Although tumours develop through a multistage process driven by the acquisition of genetic and epigenetic abnormalities, many of them become dependent on one or few genes to promote malignancy. As many of these genes were originally identified as oncogenes, this attribute was named oncogene addiction1. However, it is known that tumours can also become dependent on the normal cellular functions of certain genes, which themselves are not classical oncogenes, an attribute known as non-oncogene addiction (NOA)2.

In order to achieve uncontrolled proliferation, tumour cells rely upon the downregulation or epigenetic silencing of DNA Damage Repair (DDR) genes and the consequent increase in genetic and epigenetic instability3. Genome instability is indeed a fundamental hallmark of cancer4, possibly linked to oncogene-induced DNA damage5. Accordingly, it has been demonstrated that DDR genes alterations are prevalent in multiple human cancer types6,7. Despite the need for this genomic instability, tumours require some degree of DNA repair proficiency to survive the damage induced by genotoxic stress, uncontrolled proliferation, and treatments. Thus, the activation of the DNA-damage response in cancer can be considered as a type of NOA that critically supports tumour survival8. NOA genes are especially associated with stress maintenance functions such as DNA double-stranded break repair, chromatid segregation, and DNA replication regulation9. Accordingly, several DNA repair genes have been shown to exhibit Copy Number Variation (CNV) gain in cancer, which is positively correlated with upregulation of DDR genes10.

Given the importance of both oncogene and non-oncogene addiction, different groups have developed a variety of approaches to identify and prioritize cancer dependencies and vulnerabilities to exploit them therapeutically. The Cancer Dependency Map was initially created by systematically identifying genetic dependencies using RNAi-based loss-of-function genetic screens in 501 cancer cell lines11. Additionally, two large pan-cancer CRISPR-Cas9 screens were also independently performed with the same goal, containing data from over 1000 screens of more than 900 cell lines12,13. Based on both approaches, a combined CRISPR-shRNA dependency score was later developed, providing a more sensitive measure to identify essential genes14. These loss-of-function genetic screens have allowed investigating cellular drug mechanism-of-action in cancer cell lines15 as well as the identification of potential therapeutic targets like the DNA repair helicase WRN, a synthetic lethal vulnerability and promising target in cancers with microsatellite instability16.

In this study, I integrated data derived from The Cancer Dependency Map and transcriptomic data derived from The Cancer Genome Atlas (TCGA) to characterize the non-oncogene addiction to DNA repair across the pan-cancer scale. Next, I explored the relevance of DDR genes in terms of patient survival and potential therapeutic alternatives. Following this approach, a molecular signature of overexpressed DDR genes showing the weakest dependency scores was strongly associated with worse survival. The fact that cancer cells overexpress these genes despite the weak dependence on them (as growth promoters), suggests that they may act as late promoters of cell survival. Thus, as opposed to traditional cancer drivers, this supports the existence and importance of non-oncogenic addiction to DNA repair in cancer. This approach may facilitate the identification of prognostic biomarkers and therapeutic opportunities by targeting this non-oncogene addiction.

Results

Efficacy and selectivity of DNA repair genes

One of the few comprehensive lists of human DDR genes was published in 200517. In order to obtain an updated list, a literature search was done and the most comprehensive list found was used as a starting point6. This list comprised 276 DDR genes annotated in ten different DNA repair pathways, but genes involved in the nucleotide pool maintenance were excluded in this study as that is not truly a pathway. First, to identify the essentiality of these genes across different types of cancer cell lines, the efficacy and selectivity scores (θ = 0.6; CRISPR:shRNA = 60:40; see methods) were obtained from shinyDepMap14. These metrics are derived from a combined CRISPR-shRNA gene dependency score based on the Cancer Dependency Map dataset11. Efficacy scores represent the degree to which loss of a particular gene reduces cell growth in sensitive lines while selectivity represents the degree to which its essentiality varies across lines. After removing genes that were not found in the shinyDepMap dataset, 241 DDR genes (Supplementary Table 1) were grouped in nine different DNA repair pathways and the efficacy scores for each pathway were compared (p = 0.02, Kruskal–Wallis rank-sum test, Fig. 1a). Although it seems that some pathways contain more genes with strongly negative efficacy scores, no statistically significant differences were found after Dunn’s Kruskal–Wallis Multiple Comparisons test. The same trend was observed for selectivity (p = 0.0004, Kruskal–Wallis rank-sum test, Fig. 1b).

Figure 1
figure 1

Efficacy and selectivity of DNA repair pathways. (A) Comparison of the efficacy scores (the more negative, the more essential) of the 241 DNA repair genes according to their pathway annotation (some genes are included in more than one pathway). (B) Comparison of the selectivity scores of the 241 DNA repair genes according to their pathway annotation (the more positive, the more selective). Notice that some pathways behave differently in terms of efficacy and selectivity. DR Direct repair, TS translesion synthesis, BER Base Excision Repair, NHEJ Non-Homologous End Joining, FA Fanconi Anemia, OTHER DNA repair-associated genes, HR Homologous Recombination, MMR Mismatch Repair, NER Nucleotide Excision Repair.

Next, as some pathways showed different behaviour in terms of efficacy and selectivity, the association between both metrics was determined. Genes were also grouped based on pathway annotation but only NER showed a statistically significant positive correlation (R = 0.29, p = 0.045, Spearman’s correlation test, Fig. 2). Additionally, to identify genes that are commonly essential in cancer cell lines (negative selectivity and negative efficacy scores), an “essentiality threshold” was set based on previous findings analyzing the whole set of human genes14. Following this approach, genes with efficacy scores higher than − 0.5 and selectivity scores higher than 0.0 were excluded. From the 241 genes initially included, only 59 were classified as commonly essential following these criteria (quadrant III, Fig. 2).

Figure 2
figure 2

Correlation between selectivity and efficacy. The scatterplots show the correlation between efficacy and selectivity scores of the 241 genes grouped according to pathway annotation. Only NER showed a statistically significant positive correlation (R = 0.29, p = 0.045, Spearman’s correlation test). Genes with an efficacy score higher than − 0.5 and a selectivity score higher than 0.0 were discarded and only genes under the “essentiality threshold” (quadrant III) were selected and classified as “commonly essential genes” (n = 59). Notice that genes from quadrant II also have strongly negative efficacy scores but present high selectivity (e.g., HR/BER: WRN).

Dependency scores of commonly essential DDR genes

To explore deeper the essentiality of these 59 DDR genes, the combined dependency scores (θ = 0.6; CRISPR:shRNA = 60:40) across 423 cancer cell lines were also obtained from the shinyDepMap dataset (Supplementary Table 2). Moreover, the lineage-dependent essentiality of 17 cell lineages was also analyzed. Only genes with at least one dependent lineage were included. Differences were found between these 59 genes in terms of dependency (p-value < 2.2e−16, Kruskal–Wallis rank-sum test, Fig. 3a). Only 43 of these genes had at least one dependent lineage (Fig. 3b), meaning that all cell lines for that lineage must have dependency scores lower than the essentiality threshold (near − 0.5) obtained by analyzing all human genes14. Notice that as expected, genes showing strongly negative dependency scores have more dependent lineages.

Figure 3
figure 3

Combined (CRISPR:shRNA) dependency scores of 59 DDR commonly essential genes across 423 cancer cell lines. (A) The more negative dependency score, the more essential for a cell line. (B) When a gene has dependency scores beyond the threshold of essentiality in all the cell lines from a lineage, that lineage is dependent on that gene. All 17 lineages were dependent on 32 out of the 59 DDR genes.

Transcriptomic alterations of commonly essential DDR genes

Since these 59 commonly essential DDR genes can also be essential for normal cells, the expression patterns of these genes were analyzed. For this purpose, the TACCO (Transcriptome Alterations in CanCer Omnibus) web server18 was used to obtain the log2 fold change (log2 FC) of these genes from paired normal-tumour samples. TACCO includes the mRNAs and miRNA expression levels (Transcripts Per Million) of 26 types of cancer from the Broad GDAC Firehose. Gene expression data were obtained for 18 types of cancer (Supplementary Table 3) after establishing two conditions: (1) only genes with a statistically significant log2 fold change (after p-value adjustment) were included and (2) only cancer types in which at least half of DDR genes meet condition 1 were included. Differences were also found between the 59 genes in terms of log2 FC, with some genes being downregulated and others highly upregulated (p-value < 2.2e−16, Kruskal–Wallis rank-sum test, Fig. 4a). Next, to have a better visualization of the expression levels of these 59 DDR genes across the 18 types of cancer, a heatmap was elaborated with the log2 FC of each gene (Fig. 4b). Three groups are observed among the cancer types with at least 10 genes being highly upregulated in most of them. The majority of these genes are part of several pathways or participate in Homologous Recombination. For many genes either there is no significant fold change (grey colour), or they are downregulated.

Figure 4
figure 4

Log2 Fold Change of the 59 commonly essential DDR genes across 18 types of cancer. (A) Only genes with a log2 fold change with a statistically significant adjusted p-value were considered in each type of cancer. Notice that the more positive log2 FC, the greater the expression compared to normal tissue (B) Three clear groups are observed among all cancer types with at least 10 genes annotated in multiple pathways being highly upregulated (red colour) in most of them. No significant fold change is coloured in grey and downregulation in blue. Group A: READ: Rectum adenocarcinoma; COAD: Colon adenocarcinoma; THCA: Thyroid carcinoma; GBM: Glioblastoma multiforme. Group B: STAD: Stomach adenocarcinoma; ESCA: Esophageal carcinoma; BLCA: Bladder Urothelial Carcinoma; HNSC: Head and Neck squamous cell carcinoma; CHOL: Cholangiocarcinoma, LIHC: Liver hepatocellular carcinoma; UCEC: Uterine Corpus Endometrial Carcinoma; BRCA: Breast invasive carcinoma; LUAD: Lung adenocarcinoma; LUSC: Lung squamous cell carcinoma. Group C: KIRC: Kidney renal clear cell carcinoma; KIRP: Kidney renal papillary cell carcinoma; PRAD: Prostate adenocarcinoma; KICH: Kidney Chromophobe.

Since it was noticed that genes with strongly negative dependency scores (more important for cancer cell growth) were not always highly upregulated, all variables were analyzed together to further explore clustering patterns. The efficacy and selectivity scores were included as continuous variables. The dependency scores of the 59 genes (for the whole set of 423 cancer cell lines) and the log2 FC (for the 18 types of cancer) were transformed from continuous to categorical (ordinal) variables (see methods) (Supplementary Table 4). A distance matrix was created based on Gower’s similarity coefficient. Three clusters were generated using the Partitioning Around Medoids (PAM) and hclust algorithms. The t-Distributed Stochastic Neighbor Embedding (tSNE) algorithm (perplexity = 10, max. iteration = 1000) and Hierarchical clustering were used for dimensionality reduction. Figure 5 shows the three clusters generated based on the two clustering algorithms. Interestingly, cluster 1 consists of upregulated genes with strongly negative dependency and efficacy scores. On the contrary, cluster 3 consists of upregulated genes but they all have the weakest negative dependency and efficacy scores. Cluster 2 consists of genes with intermediate scores.

Figure 5
figure 5

Clustering of the 59 DDR commonly essential genes. Gower’s similarity coefficient was used to calculate the distance matrix based on efficacy, selectivity, dependency scores, and log2 FC of the 59 commonly essential DDR genes. (A) Partitioning Around Medoids (PAM) was used as a clustering algorithm and the t-Distributed Stochastic Neighbor Embedding (tSNE) algorithm was used for dimensionality reduction. Note that the size, distance, and shape of clusters in the tSNE analysis do not always convey a meaning as clusters may vary upon initialization, perplexity values, and iteration. (B) Hierarchical clustering for dimensionality reduction based on the distance matrix calculated using Gower’s distance and the hclust algorithm.

Commonly essential DDR genes survival analysis

To explore the clinical significance of the above findings, the GEPIA (Gene Expression Profiling Interactive Analysis) web server19 was used to perform survival analysis based on gene expression levels. GEPIA delivers fast and customizable functionalities using The Cancer Genome Atlas (TCGA) dataset20. The genes from each cluster were analyzed to determine if their expression was associated with worse or better overall survival in each of the three cancer groups (A, B, C) from the heatmap above (group B was subdivided into B1 and B2 as it was much larger and heterogeneous than A and C). Almost all 59 genes were statistically significant associated with worse or better survival at least in one cancer group (Supplementary Table 5). However, only genes whose higher expression was associated with a statistically significant log-rank P-value and statistically significant hazard ratio (HR > 2/HR < 1) were selected for further analysis (Fig. 6a).

Figure 6
figure 6

Survival analysis based on the expression levels of the 59 commonly essential DDR genes. (A) Genes whose higher expression was associated with better or worse overall survival (with a statistically significant log-rank P-value and statistically significant hazard ratio HR > 2/HR < 1) are indicated for each cancer group. Genes exclusively associated with the better/worse survival signature are indicated. (B) No statistically significant differences were found between the number of genes contributed by clusters 1 and 3 to each signature (p = 0.68, Fisher's exact test). (C) Prognostic value of the “better/worse” signatures in terms of overall survival (considering the 18 cancer types together).

As several genes have opposite effects for two or more cancer groups, genes that were exclusively associated with better survival and those exclusively associated with worse survival were analyzed as separate molecular signatures. Fourteen genes were classified as “better survival” (four from cluster 1 and five from cluster 3) whereas nineteen genes were classified as “worse survival” (five from cluster 1 and ten from cluster 3). Despite this trend, no statistically significant differences were found between the number of genes contributed by clusters 1 and 3 to each signature (p = 0.68, Fisher's exact test; Fig. 6b). The “better survival” gene signature was analyzed in terms of overall survival and as expected, it was associated with better survival across the 18 cancer types included in this study (p = 7.4e−09; HR 0.76; pHR = 8.3e−09, n = 3244; Fig. 6c). On the contrary, the “worse survival” gene signature was associated with worse overall survival (p = 0; HR 2.0; pHR = 0, n = 3244; Fig. 6c). Notice that p values of cero must be understood as “below machine precision” (see “Methods”).

To further understand these molecular signatures, the GEPIA2 web server21 (which is also based on TCGA dataset) was used to identify genes with similar expression patterns across the 18 cancer types. For each signature (better/worse), the Pearson’s correlation coefficient of 100 genes was obtained (range 0.65–0.80 for the “better survival” and 0.79–0.84 for the “worse survival”) (Supplementary Table 6). Remarkably, when these genes were used as a multi-gene signature to perform survival analysis based on their expression levels in the 18 cancer types, a higher expression of the “better 100 signature” was found to be also associated with better survival (p = 8.5e−06; HR 0.81; pHR = 8.8e−06; n = 3244; Fig. 7A). On the contrary, a higher expression of the “worse 100 signature” was associated with worse survival (p = 0; HR 2.1; pHR = 0; n = 3244; Fig. 7D).

Figure 7
figure 7

Characterization of the “Better/Worse 100 signature”. (A,D) Multi-gene signature survival analysis based on the expression of the 100 genes correlated with the “better” and “worse” survival signature (in all 18 cancer types). (B,E) Protein–protein interaction networks of each multi-gene signature. (C,F) Gene Ontology enrichment analysis and visualization for each multi-gene signature. GO terms with the lowest adjusted p-value are indicated.

To further characterize the genes involved in each signature, the STRING web server22 was used to plot their protein–protein interaction networks. The stringApp23, was used to import the STRING networks into Cytoscape24 for network analysis. Overall, the worse survival signature network showed a much higher network density and centralization, as well as lower network heterogeneity (Fig. 7B,E). Additionally, the g:Profiler web server25 was used for functional interpretation of these signatures according to their gene ontology (Fig. 7C,F). Interestingly, most of the genes from the “better survival 100 signature" are annotated in functions related to transcription regulation and mRNA metabolism. On the other hand, genes from the "worse survival 100 signature" are annotated in functions related to DNA replication and cell cycle (Supplementary Table 7).

Finally, considering that DDR genes have been good candidates for the synthetic lethality approach26 and considering the huge impact of the worse survival signature in terms of overall survival, the SynLethDB27 was used to search synthetic lethal partners for the 19 genes from this signature. SynLethDB harbours a large set of synthetic lethality gene pairs collected from a variety of sources, including biochemical assays, computational predictions, other related databases, and text mining. Sixteen genes presented at least one potential synthetic lethal partner (SLP) (statistic score ≥ 0.50). In 10 of these genes (ERCC3, BRIP1, POLQ, UBE2T, RFC2, RAD51D, MUS81, PCNA, RPA3, RFC4), NAE1 (NEDD8 Activating Enzyme E1 Subunit 1) was the SLP with the highest score, and in 6 of them, it was the only gene with a statistic score ≥ 0.50 (Fig. 8A). To further explore this finding, the correlation between the expression levels of these 6 genes and NAE1 was analyzed in GEPIA2. A statistically significant positive correlation was found in the 18 types of cancer (R = 0.52, p = 0, Spearman’s correlation test, Fig. 8B). In fact, higher expression of NAE1 was also found to be associated with a worse prognosis in all 18 cancer types (p = 9e−06.; HR 1.2; pHR = 9.2e−06; n = 3244; Fig. 8C).

Figure 8
figure 8

Synthetic lethality gene pairs analysis. (A) Synthetic Lethal partners (SLP) were obtained for the 19 genes from the “worse survival” signature. Although NAE1 was the SLP with the highest statistic score in ten on these genes, it was the only significant SLP of six genes. (B) The expression levels of these six genes are positively correlated with that of NAE1 in the 18 cancer types. (C) Higher expression of NAE1 is statistically significant associated with worse prognosis in the 18 cancer types.

Discussion

The analysis of cancer dependencies and vulnerabilities aims to identify and prioritize new potential therapeutic targets. In this work, a Pan-cancer analysis of non-oncogene addiction to DDR genes using data derived from the Cancer Dependency Map and TCGA was performed. Among the 59 commonly essential DDR genes, large differences were found between them in terms of dependency scores across the 423 cancer cell lines. However, it must be considered that dependency scores are measured for cells grown in culture and are unlikely to fully reflect the conditions of the in vivo tumour microenvironment. Thus, results must be interpreted with caution and should be further addressed through different approaches.

Since some of the common essential DDR genes in cancer cells may also be essential for normal cells, the implications of essentiality were analyzed at the transcriptomic level from paired normal-tumour samples and large differences were also found. Genes with strongly negative dependency scores (cluster 1), were upregulated in most cancer types suggesting that they might be essential for carcinogenesis. Some of them can even be considered as potential cancer drivers as previously suggested6. Despite this, it was surprising that higher expression of several genes from clusters 1 and 2 (e.g., PRPF19 and ATRIP) were strongly associated with better overall survival in some cancer groups. This was further confirmed when the 14 genes exclusively associated with better survival were considered a multi-gene signature and the higher expression of this signature was also associated with better survival across the 18 types of cancer. This is especially interesting considering that the lower expression and/or alterations in DDR genes have been associated with improved survival in some cancer types28,29,30. Nevertheless, a couple of studies have also reported that active DNA repair and higher expression of several DDR genes (including genes from clusters 1 and 2 from this study) were associated with better survival in gastric and ovarian cancer31,32. Taken together, this better survival signature might represent promising biomarkers for prognosis and further clinical studies should be performed to fully validate this.

Another surprising finding was that genes from cluster 3 (weakest dependency and efficacy scores) like UBE2T, POLQ, BRIP1, H2AFX, and RTEL1, were also highly upregulated. This could indicate that cancer cells do not depend on these genes for their growth but survival, especially considering the persistent DNA damage and replication stress in cancer cells2. We can refer to this as the timing-dependent addiction, which means that although cancer cells have weak dependency on these genes (as growth promoters), these genes may act as late promoters of cell survival (given the evident overexpression). For instance, among genes from cluster 3, POLQ (Pol θ) might be one of the best examples of non-oncogene addiction to DNA repair (low efficacy and dependency scores but highly upregulated). In fact, it has been demonstrated that cancer cells with mutations in POLQ synthetic lethal (DDR) genes tend to become addicted to Theta Mediated End Joining (TMEJ) for survival, suggesting that POLQ becomes essential upon increased levels of endogenous and unrepaired DNA damage33.

Remarkably, the present work also demonstrated that higher expression of UBE2T, RFC4, POLQ, BRIP1, and H2AFX is especially associated with worse overall survival (HR > 5) in several types of cancer (Group A and D). Additionally, when considered as part of the worse survival signature, these genes were associated with a worse prognosis in the 18 cancer types. This might be also linked to the correlation found between the expression levels of this signature and the "worse 100 signature", which is enriched with genes involved in DNA replication and cell cycle, probably suggesting a more aggressive phenotype. Previous findings have reported that genes involved in cell cycle phases and checkpoint genes had higher essentiality percentages both in embryonic stem cells and in cancer cell lines34. Additionally, it has been shown that the amplification and overexpression of several DDR genes across the pan-cancer scale are associated with reduced mutation burden, cell line drug resistance, and poor prognosis35. In contrast, a higher somatic tumour mutational burden has been associated with better overall survival in several types of cancer36. Taken together, the results from the present study support the notion that non-oncogene addiction to several DDR genes could reduce the tumour mutational burden and confer treatment resistance, suggesting an explanation for the worse overall survival associated with this signature.

The results discussed above also support previous findings reporting the key role of the DDR genes from cluster 3 in cancer. For instance, studying more than 10,000 TCGA pan-cancer tumours, Wu and colleagues found that tumours with amplifications in DDR genes like UBE2T (Ubiquitin Conjugating Enzyme E2 T) exhibited significantly reduced mutation burden, temozolomide resistance, and worse patient survival. In fact, UBE2T and BRIP1 are also in the top 10 amplified DDR genes in their study35. RFC4 (Replication factor C subunit 4) has been also identified as an upregulated DDR gene across the pan-cancer scale10. RFC4 protects colorectal cancer cells from X-ray-induced DNA damage and apoptosis through nonhomologous end joining (NHEJ)-mediated DNA repair37 and has also been associated with poorer prognosis in this malignancy38 and lung cancer39.

DNA polymerase theta (Pol θ) (which is encoded by the POLQ gene) is important in the repair of genomic double-strand breaks (DSBs) from many sources40. Upregulation of POLQ has been associated with poor clinical outcomes in breast and lung cancer41,42,43,44. Following the same line, the present study highlights the importance of POLQ as a prognostic factor associated with worse survival in different types of cancer (especially Group A and D). Recent reports have demonstrated that POLQ inhibitors selectively kill HR-deficient tumour cells in vitro and in vivo45,46. Additionally, it has been demonstrated that POLQ knockdown sensitizes several tumour cell lines to Ionizing Radiation and causes minimal effects on normal tissue radiosensitivity47.

RFC2, CHEK1, PCNA, and POLE2 were also associated with worse overall survival in several types of cancer (Group A and D). As they belong to cluster 1 (strongly negative dependency and efficacy scores and high expression levels), these genes might be more necessary for tumour growth and promotion. Upregulation of PCNA has been reported in several types of cancer and its inhibition is gaining more interest as a novel strategy in many cancers48,49,50. Similarly, the knockdown of POLE2 (DNA polymerase epsilon subunit 2) has been linked to reduced or suppressed tumorigenesis in different types of cancer51,52,53. CHEK1 (also known as Chk1 or checkpoint kinase 1) is essential to maintain cell viability in cancer cells54 and it has been shown that the combination of ATR and CHEK1 inhibitors results in cancer-specific synthetic lethality55. Notably, CHEK1 inhibition is synthetically lethal with loss of B-Family DNA Polymerases like POLE2 in lung cancer and colorectal cancer cells56. Taken together, inhibiting these genes from cluster 1 represents a promising therapeutic avenue in several types of cancer, especially focusing on the combined inhibition of CHEK1 and POLE2.

Finally, it was remarkable that NAE1 was identified as a synthetic lethal partner of ten genes from the worse survival signature. NAE1 is responsible for Neddylation, a post-translational modification that adds an ubiquitin-like protein (NEDD8) to multiple substrate proteins and it has been shown that its inhibition exerts anticancer effects mainly by triggering cell apoptosis, autophagy, and senescence57,58. In fact, Neddylation is beginning to be considered as a key factor of the DNA repair process59. As NAE1 inhibition has been proposed as a new approach to treating cancer60, these results support the potential of the combined inhibition of NAE1 and several DDR genes. More research will be needed to validate the implication of the molecular signatures described in this study and to what extent they apply to different subtypes of cancer and the complexity of their genetic backgrounds.

Methodology

All data processing steps, and statistical analyses were performed in the RStudio 1.4.1717 statistical environment (https://www.rstudio.com/).

Efficacy, selectivity, and dependency of DNA repair genes

The list of DNA repair genes and their respective pathway annotation was obtained from one of the latest and more complete reports so far6. The efficacy and selectivity scores for the 241 genes (θ = 0.6; CRISPR:shRNA = 60%:40%; where θ is the mixing ratio of the two scores) were obtained from shinyDepMap14 (https://labsyspharm.shinyapps.io/depmap). These values are derived from a combined CRISPR-shRNA gene dependency score based on the Cancer Dependency Map dataset11 (https://depmap.org/portal/). Additionally, for those genes with efficacy scores lower than -0.5 and selectivity scores lower than 0 (termed essential genes), the combined (θ = 0.6; CRISPR:shRNA = 60%:40%, where θ is the mixing ratio of the two scores) dependency scores across 423 cell lines were obtained. Likewise, the lineage-dependent essentiality of 17 cell lineages was also analyzed. For a lineage to be classified as dependent, all its cell lines must have dependency values that exceed the essentiality threshold obtained by analyzing all available genes.

Transcriptomic alterations of commonly essential DRGs

For transcriptomic analysis, the TACCO (Transcriptome Alterations in CanCer Omnibus) webserver18 (http://tacco.life.nctu.edu.tw/) was used to identify the patterns of expression of essential DNA repair genes (efficacy < − 0.5; selectivity < 0) in 18 cancer types. TACCO includes the mRNAs expression levels (Transcripts Per Million) for 26 cancer types from the Broad GDAC Firehose (https://gdac.broadinstitute.org/) (version stddata__2016_01_28). For this task, only the log2 fold change (log2_FC) of those genes with statistically significant (Benjamini-Hochberg) adjusted p values were included. For the heatmap elaboration, the ComplexHeatmap61 R package was used.

To find clustering patterns among DDR genes, all variables were analyzed together. The efficacy and selectivity scores were included for each gene as continuous variables. Based on the median dependency score of each gene (for the whole set of 423 cell lines), dependency scores were transformed from continuous to categorical: weakly negative (from 0 to − 0.5), moderate (from − 0.5 to − 1.0), and strongly negative (from − 1.0 to − 1.5). The log2 FC was also transformed to categorical: downregulated (from 0 to − 1.25), upregulated (from 0 to 1.25), and highly upregulated (from 1.25 to 2.5 +). A distance matrix was created based on Gower's similarity coefficient, as it is suggested for mixed-type variables. The clustering algorithm used was Partitioning Around Medoids (PAM) and the silhouette plot was used to determine the number of clusters. The t-Distributed Stochastic Neighbor Embedding (tSNE) algorithm was used for dimensionality reduction (perplexity = 10, max_iter = 1000). Hierarchical clustering was used as an additional algorithm for dimensionality reduction based on the same Gower's similarity coefficients calculated.

Survival analysis, protein networks, and gene ontology

The GEPIA (Gene Expression Profiling Interactive Analysis) webserver19 (http://gepia.cancer-pku.cn/) was used to compare the different gene clusters in terms of patient overall survival across the cancer groups included in this study. GEPIA is based on The Cancer Genome Atlas (TCGA)20 and The Genotype-Tissue Expression (GTEx) project62. The thresholds for high/low gene expression levels between cohorts were adjusted to 50%. Log-rank P values, hazard ratios (HR), and 95% confidence intervals (CI) were obtained for each gene. Notably, as GEPIA calculations are run in R, some p values might be reported as cero since R does not report values lower than 2.220446e-16 (.Machine$double.eps). Therefore, p values of cero must be understood as “below machine precision”.

The GEPIA2 web server21 (http://gepia2.cancer-pku.cn/#index) was used to identify genes that have a similar expression pattern for each signature across all cancer types. The Search Tool for the Retrieval of Interacting Genes/Proteins database (STRING v11.0) web server22 (https://string-db.org/) was used to construct the protein–protein network associated with the better/worse 100 signature). The stringApp23, a Cytoscape app that makes it easy to import STRING networks into Cytoscape24 was used for network visualization and analysis. Additionally, the g:Profiler web server25 (http://biit.cs.ut.ee/gprofiler/) was used to find statistically significant Gene Ontology terms, pathways, and other gene functions giving a list of DDR genes.

Synthetic lethal pair analysis

The SynLethDB web server27 (http://synlethdb.sist.shanghaitech.edu.cn/v2/#/) was used to search synthetic lethal partners given a specific gene. SynLethDB harbours a large set of synthetic lethality gene pairs collected from a variety of sources, including biochemical assays, other related databases, computational predictions, and text mining.