Keywords

1 Introduction

Survival analysis is a set of statistical methods used to study the time until an event of interest occurs and is commonly used in medical research to estimate life expectancy based on patient-specific data [21]. A pivotal aspect of survival analysis is estimating survival curves and comparing the probability of survival over time between different cohorts [4]. In biomedicine, we can relate the differences in survival to potential markers such as specific genes [2] or groups of genes [22], which can help distinguish patients who respond to treatments from those who do not (see Fig. 1) [27].

Fig. 1.
figure 1

Example of survival curves representing two conditions dependent on gene expression and associated with patient survival. A group of patients in METABRIC dataset with a highly expressed FLT3 gene shows a noticeably higher survival function, as depicted by the survival curve on the right. This difference is less prominent and not significant in the case of the PLCE1 gene, as seen in the survival curve on the left. Part of survival analysis in data with gene expressions is to find markers, that is, genes and sets of genes, whose expression can characterize cohorts of patients with substantially different survival functions.

Rather than a single gene, intricate networks of gene interactions determine the complex nature of diseases such as cancer [26]. Identifying and characterizing these interactions is essential, as they offer critical insights into the onset and progression of a disease, potentially overlooked when analyzing individual genes. Computational discovery of gene interactions is a well-researched area in genome-wide association (GWAS) [17] and gene expression-based phenotype categorization [5]. For the former, a notable approach for handling survival data is the adaption of multifactor dimensionality reduction (MDR)  [6, 15]. Authors also typically utilize Cox regression analysis to analyze the interaction effects of candidate genes [28, 30]. Analyzing survival data is crucial in the clinical domain, highlighting the need for more systematic, data-driven methodologies to unravel intricate gene interactions linked to survival data. Computational methods that specifically address gene interactions from survival data are, at best, scarce, and due to the recent abundance of survival data that includes gene expression, there is a need for their development.

Here we report on a data-driven approach for identifying gene interactions significantly affecting survival rates. In the context of our study, gene interaction refers to the combined effect of two genes on survival, which may be substantially different from their individual effects. Our method aims to measure this interaction effect, quantified as the difference in restricted mean survival time (RMST) when considering the expression of both genes together compared to the expression of individual genes. We then rank gene pairs based on the significance of the difference in RMST. We use top-ranked gene pairs, cross-reference our findings with documented interactions, and synthesize complex literature findings using large language models, thus expanding the exploratory scope of our study.

In Sect. 2, we start with (1) introducing the data, (2) describing how we measured the effect on survival, (3) explaining the measure of interaction and how we define different types of interactions, and (4) describe the utilization of large language models when cross-referencing our findings with existing literature. Section 3 briefly describes our analysis findings, followed by a discussion of limitations and possible future work in Sect. 4.

2 Methods

Our method focuses on two-gene interactions and unfolds through a four-step process. First, we separate samples into evenly sized groups according to the median gene expression value. Subsequently, we estimate survival curves for each group, and for each survival curve, we compute the restricted mean survival time (RMST). We then quantify the difference in RMST between the groups. Lastly, we assess the interaction effect by evaluating how significant is the RMST difference between the interaction term, as discussed in Sect. 2.4 and participating genes. We replicate this procedure for each gene pair in our data set during our discovery-driven analysis and rank them based on their interaction effect. This ranked list paves the way for biologists to initiate their interpretation and investigation. To aid this process, we implement literature mining and harness the utility of large language models to distill complex biological knowledge for assistance and interpretation.

2.1 Data

In this study, we leveraged two sources of survival data:

  • TCGA. We procured RNA-Seq data, including gene expression matrices and corresponding survival endpoints, for various cancer types from The Cancer Genome Atlas via the GEO portal ( GSE62944) [16]. Given the variability in sample size across different datasets, we included only those with more than 100 samples, resulting in 20 TCGA datasets.

  • METABRIC. We obtained microRNA gene expression matrix and patient survival data from The Molecular Taxonomy of Breast Cancer International Consortium through  cBioPortal [3].

Across all datasets in our study, we implemented a log transformation on each gene expression value supplemented with a pseudo count 1. Additionally, z-score normalization was carried out on each gene across samples within a dataset, essentially standardizing the columns of the expression matrix. We utilized clinical metadata for each sample’s overall survival (OS) time and event status. OS time refers to the most recent date a patient was confirmed alive. The event is recorded when a patient dies due to the condition under study, in this case, cancer. If a patient’s status is unknown or death occurs due to unrelated causes, we classify the event status as censored. Note that sample sizes and event rates vary across datasets Table 1.

Table 1. Statistics about censoring in obtained datasets. The table shows the number of samples and the ratio of censored events. We observe a high rate of censoring across datasets.

To limit our exploration scope, we have focused solely on a specific set of genes referred to as L1000 genes [23]. The L1000 gene set contains roughly one thousand landmark genes acting as proxies to infer the expression of other genes. Using this curated set of landmark genes, we significantly reduced the dimensionality of our search space to a set of 1058 genes. Additionally, we removed genes with low expression values to reduce noise before we proceeded with computation. We have disregarded genes with a 75th percentile expression value lower than 10.

2.2 Summary Measure of Survival: Restricted Mean Survival Time

Restricted Mean Survival Time (RMST) is the average survival time up to a pre-specified time point, quantified as the area under the survival curve up to that point (see Fig. 2) [29]. Its primary benefits are that it is interpretable, provides a meaningful summary of survival data, and is considered more robust than measures of median survival time [7].

Fig. 2.
figure 2

Illustration of RMST for groups of patients distinguished by varying expression levels of FLT3. The group with low expression of FLT3 has an average survival time of around 137 months compared to 165 months for the other group if we consider the first 250 months of the study.

Building upon its intuitive nature, RMST has gained substantial traction for its versatile utility in comparing differences in survival between cohorts [19]. The difference in RMST is an alternative means to measure gains or losses in the event-free survival between different groups of patients (see Fig. 3). Unlike the log-rank test, which heavily relies on the assumption of proportional hazards and may be sensitive to instances of crossing survival curves, the difference in RMST presents a more flexible and reliable approach [25].

Fig. 3.
figure 3

RMST is a good metric for comparing two survival curves. The absolute difference in RMST represents the area between the curves. Here we illustrate this with survival curves from Fig. 2. The absolute difference in RMST can be considered the measure of time lost/gained between patients that were grouped by the expression values of gene FLT3. Quantifiable measure that supports the visual interpretation of the difference in Fig. 1.

2.3 Interaction Scoring

We have devised a data-driven approach to identify interaction revealing significant RMST differences. This difference implies that the combined influence of both features on survival differs considerably from the individual influence of each feature. While this technique broadly applies to various types of data, our primary focus here is on gene expression data, which we use to determine the combined influence of gene pairs on survival outcomes compared to their individual effects.

The steps summarized with Algorithm 1 are following:

  1. 1.

    First, we partition samples into two cohorts based on the median expression value of a particular gene. Each cohort represents a group of patients with either low or high gene expression values (line 2).

  2. 2.

    For each cohort, we calculate its Kaplan-Meier survival curve. (line 3). Next, we compute the RMST for each survival curve (line 4). We limit RMST computation to the 75th percentile of all survival times in the cohort to circumvent potential issues arising from uncertainty in survival estimates of long survivors and to ensure a fair comparison across different cohorts by consistently applying the same upper bound.

  3. 3.

    We calculate the absolute difference in RMST between the two created cohorts (line 5). This difference effectively represents the area between the survival curves, providing a measure of the disparity in survival outcomes between the two groups (as shown in Fig. 3).

  4. 4.

    To determine whether an interaction effect exists, we first calculate the RMST differences for the individual genes and their interaction (lines 6- 8). We then compute the interaction measure as the absolute difference between the largest individual RMST difference and the difference in RMST for the interaction term (line 10).

Algorithm 1
figure a

. Interaction measure between two genes

2.4 Interaction Types

We define three types of interactions between genes that correspond to different cohort formations (see Fig. 4). Using standardized gene expression values of two genes, we construct a new feature and create cohorts using the approach mentioned earlier. Gene interactions measured with this approach should be interpreted with respect to survival and not as physical interactions.

The first interaction is an additive (+) interaction, where standardized gene expression values of both genes are summed together. Such interactions are more common for genes of protein complex subunits.

The second interaction is a competing (-) interaction, where standardized gene expression values are subtracted. The cohorts represent which of the two genes was more expressed. Such interactions are more common for activator and inhibitor-type interactions, where both genes regulate the same process.

The last interaction is an XOR-type (\(\times \)) interaction, where we multiply standardized gene expression values. These interactions are more complex and are scarce in nature. They may result from the alternative signaling pathways to the same process influencing survival.

Fig. 4.
figure 4

Interaction definition schema. Cohort formation (top row), RMST difference calculation (middle row), interaction significance according to absolute RMSE difference (bottom row).

2.5 Discovering False Positives with Permutation Test

To identify potential false positive interactions, we performed a permutation test for every data set and interaction type, which involved random shuffling of the survival endpoint and rerunning the experiment 100 times. Given that we conducted 100 such permutation runs per data set and different interaction types, the computation required was extensive due to the sheer volume of potential combinations to examine. Our analysis yielded results that allowed us to isolate the top 0.01% interactions, deemed non-random occurrences. In essence, we consider interactions exceeding the 99.99th percentile as potential interaction hits.

2.6 Literature Mining

We propose to use literature mining to, where possible, explain the interactions and synthesize intricate biological knowledge, leveraging the power of large language models. Specifically, we have used GPT-3.5 and GPT-4 developed by OpenAI. We focused on each data set’s top 100 ranked gene interaction pairs and interaction types. These were cross-referenced within STRING [24] and BioGRID [14] databases to ascertain how many gene pairs are in those intricate networks of interactions. We also determined the number of shortest paths and the shortest path length between gene pairs within the BioGRID interaction network. We also incorporated UniProt descriptions of all genes under investigation to supplement our analysis [1].

Having performed initial analyses, we then concentrated on the top 10 ranked gene pairs and interactions previously reported in the literature. Utilizing the language models, we sought to condense the complex biological context, prompting the models to extrapolate potential functional associations between these genes. The UniProt functional descriptions of gene pairs and some genes found in the shortest path within the bioGRID interaction network informed the models’ prompts.

3 Results

With our proposed approach we performed the analysis on TCGA and METABRIC datasets.

3.1 Analysis Reveals Potential Interactions

We overlay interaction hits, as described in Sect. 2.5 with permutation test results. The average number of interactions above the threshold for permutations was always 55.9, equivalent to 0.01% of all tested interactions. The tail of the distribution corresponding to the 99.99% of interactions is also visualized (see Fig. 5).

Fig. 5.
figure 5

Permutation test results for TCGA-HNSC dataset. Additive interaction hits against permutations (left), and competing interaction hits against permutations (right).

The number of additive and competing interaction hits overwhelmingly exceeded the 56 random interaction threshold for almost all data sets (Table 2). The number of additive interactions is generally lower than the number of competing interactions for the same data set. On the other hand, XOR-type interactions are scarce and found in abundance only in one out of 21 data sets tested. Interestingly, there was no correlation between the number of interaction hits and samples or events in the data set.

Table 2. Number of interaction hits for data sets with at least 100 events.

3.2 Cross-Referencing with Established Interaction Networks

We have cross-referenced the top 100 ranked gene interactions against known gene interaction networks in STRING and BioGRID. Our findings indicate that many of these interactions have some form of confirmation in these referenced databases. Additionally, we performed these steps using randomly selected pairs of genes instead of our top-ranked list and repeated this random sampling process a thousand times. As illustrated in Fig. 6, competing interactions from HNSC and KIRC emerge as interesting outliers. On average, the top additive and XOR interactions are more scarce in the databases than competing interactions.

Given the surprisingly high number of documented interactions, even among randomly selected gene pairs, we hypothesize that because we are dealing with well-established genes, enhancing the likelihood of their documentation in high-throughput analyses. These analyses are typically characterized by their ability to investigate thousands of genes simultaneously, which are then reported in databases like BioGRID.

Fig. 6.
figure 6

Number of interactions with conformation in the literature for every dataset used in the analysis. Additive (blue), competing (red), and XOR-type (purple) interactions against 100 randomly selected interactions. (Color figure online)

3.3 Case Study: RHOA-CD44 Competing Interaction

We present one of the top 3 competing-type gene interaction hits from the kidney renal clear cell carcinoma (TCGA-KIRC) data set with confirmed interaction in both STRING and BioGRID database (see Fig. 7a). Competing interaction between RHOA and CD44 genes shows more than five months larger difference between cohorts than any of these genes individually (see Fig. 7b).

Fig. 7.
figure 7

Case study of the RHOA - CD44 functional association in Kidney renal clear carcinoma (TCGA-KIRC). a) permutation distribution tail, b) interaction confidence interval, c, d) Kaplan-Meier plots for RHOA and CD44 genes, e) cohort formation based on gene expression, f) Kaplan-Meier plot of the competing interaction.

CD44 gene produces a cell surface receptor that binds Hyaluronan (HA) and is involved in cell-cell interactions, adhesion, and migration. It serves for signal transduction to different pathways, including cytoskeleton reorganization via RhoA small GTPase [8]. Overexpression of CD44 was related to poor prognosis in glioblastomas [20] and renal cell carcinoma [12] but had no significant effect on breast cancer patient survival [18].

RhoA gene produces small GTPases, which function as molecular switches mainly in cytoskeleton dynamics and cell migration [10]. Increased RhoA-ROCK activities mediate the upregulation of tumor suppressor p53 and induce G1 cell cycle arrest in kidney cell lines [13]. It has been shown that reduced RhoA expression enhances metastasis in breast cancer [9].

Observing Kaplan-Meier plots for both genes’ high and low expression cohorts confirms findings from the literature (see Fig. 7c,d). Our method reveals a competing interaction between CD44 and RhoA genes. We interpret this as a competition between CD44 and RhoA-related biology, where the higher expressed gene prevails. Note that we are comparing relative expressions according to the mean expression in the data set (see Fig. 7e). When RhoA is highly expressed, it inhibits the tumor suppression mechanism. Only when CD44 is more expressed than RhoA it sufficiently activates downstream pathways to have a significant effect on survival over the effect of RhoA gene (see Fig. 7f).

4 Discussion

Our results suggest a novel ability to identify interactions significantly affecting survival outcomes, thus unveiling insights into the complex landscape of gene interplay and disease prognosis. Even so, our methodology’s ranked gene interaction lists should be interpreted cautiously, serving primarily as an exploratory analysis. Due to the vastness of possible gene interactions, we expect some to arise purely by chance. Our preliminary work with permutation tests and literature mining only provides some supportive evidence against these findings. Our analysis identified several potential gene interactions affecting patient survival rates, providing a basis for further in-depth investigations. Particularly noteworthy is the abundance of XOR-type interactions in the HNCS dataset.

Our study also reveals an intriguing potential for large language models to summarize complex biological knowledge when fed with adequate context. By distilling intricate gene pair interactions and their associated functions as informed by resources like UniProt and interaction network databases, the models demonstrated their capacity to reason about known interactions, speculate on potential associations, and guide future exploratory directions (as illustrated with an example in Fig. 8). Although the present analysis should not be regarded as a definitive evaluation of interaction, it establishes an efficient pipeline to facilitate knowledge synthesis and accelerate the pace of scientific discovery, as demonstrated in the case study above.

Fig. 8.
figure 8

Example of seven HNSC dataset interactions, ranked by their RMST difference compared to non-interacting terms. The literature column reflects their documentation in public databases. We also display four large language model-generated summaries.

We also recognized noticeable differences in the quality of summaries generated by GPT-3.5 and GPT-4, indicating a trend of improved comprehension and representation of complex biological interactions with newer model iterations. This observation suggests a promising area for future research - the potential of customized language models, fine-tuned on recent, domain-specific literature, which could serve as a more streamlined and context-aware alternative to the vast, generalized models currently accessed via APIs.

While our study presents interesting insights, several limitations present opportunities for future exploration and refinement. The choice of equally-sized cohorts, achieved by splitting at the median, does not account for potential variations in the cohort splits that might optimize the difference in RMST between cohorts. Additionally, we did not consider the potential influence of time limits on RMST calculations, which could significantly impact results and can be very study specific. Lastly, our analysis was constrained by a low number of samples relative to the vast space of possible feature interactions. The enormous space of potential feature interactions may limit the generalizability of our findings. Future work is required to address these limitations and deepen the insights offered by our proposed methodology.

5 Conclusions

The prevalent nature of censored data and molecular fingerprints in clinical environments highlights the need for techniques to illuminate the biological processes regulating disease progression. Unraveling gene interactions is fundamental in understanding these processes, specifically their collective effects on phenotypes.

We report on our work to introduce a data-centric method for detecting gene interactions significantly affecting survival rates, leveraging restricted mean survival times. Using the proposed approach, we can identify possible novel gene interaction candidates on publicly available datasets. We further contextualize the hypothesized gene interactions through literature mining and using large language models to distill complex biological knowledge for assistance and interpretation. In a case study, we show the applicability of such an approach and its potential to uncover and explain potential new interactions.

We have made our method’s implementation and the accompanying data and scripts available on GitHubFootnote 1 and archived them on Zenodo [11]. These resources include the extended results of permutation tests, summaries produced by the language models, and the prompt used to generate them.