Abstract
Protein ubiquitination regulates a wide range of cellular processes. The degree of protein ubiquitination is determined by the delicate balance between ubiquitin ligase (E3)-mediated ubiquitination and deubiquitinase (DUB)-mediated deubiquitination. In comparison to the E3-substrate interactions, the DUB-substrate interactions (DSIs) remain insufficiently investigated. To address this challenge, we introduce a protein sequence-based ab initio method, TransDSI, which transfers proteome-scale evolutionary information to predict unknown DSIs despite inadequate training datasets. An explainable module is integrated to suggest the critical protein regions for DSIs while predicting DSIs. TransDSI outperforms multiple machine learning strategies against both cross-validation and independent test. Two predicted DUBs (USP11 and USP20) for FOXP3 are validated by “wet lab” experiments, along with two predicted substrates (AR and p53) for USP22. TransDSI provides new functional perspective on proteins by identifying regulatory DSIs, and offers clues for potential tumor drug target discovery and precision drug application.
Similar content being viewed by others
Introduction
Ubiquitin, a 76-amino acid protein that is widely expressed and highly conserved in eukaryotes, is conjugated to substrate proteins through a tightly regulated cascade involving the ubiquitin activating enzyme (E1), the ubiquitin conjugating enzyme (E2), and the ubiquitin ligase (E3), and removed from the substrate by deubiquitinase (DUB)1. Protein ubiquitination is a highly prevalent post-translational modification (PTM) that regulates a wide range of cellular processes, including cell proliferation, survival, differentiation and cellular signal transduction2. Like most PTMs, ubiquitination is a dynamic and reversible process3. The degree of protein ubiquitination is determined by the delicate balance between specific E3-mediated ubiquitination and DUB-mediated deubiquitination4. Both E3s and DUBs have been elegantly leveraged for drug development in the forms of PROTAC5 (Proteolysis-targeting chimeras) and DUBTAC6 (Deubiquitinase-targeting chimeras for targeted protein stabilization) technology. Compared to E3s, the less-studied DUBs have been found to exert distinct functions such as oncogenic, tumor-suppressive or context-dependent roles in tumorigenesis, mainly by affecting the protein stability, enzymatic activity or subcellular localization of their substrates7,8. Studies of DSIs have shed light on the mechanisms of cancer therapy and may provide new avenues for drug design.
Several experimental methods have been developed for identifying DSIs, such as protein microarrays9, global protein stability profiling10, mass spectrometry11 and live phage display library12. However, due to the substrates’ low expression levels and their intrinsically weak interactions with enzymes, these methods are often laborious, time-intensive, expensive and inefficient. As a result, although there are >100,000 ubiquitin sites on over 9,000 proteins in the ubiquitination site resource Ubisite13, only <900 human DUB-substrate relationships are collected in corresponding database14, which means that only a small proportion of ubiquitinated proteins have the known corresponding DUB information. Therefore, it is an urgent need to identify proteome-wide DSIs through bioinformatics strategy.
In 2022, we proposed UbiBrowser 2.0 (UB2), a computational algorithm based on Naïve Bayesian classifier to predict human DSIs by combining multiple types of heterogeneous biological features, including homology, enriched protein domain and function, protein-protein interaction (PPI) network topology, and inferred DUB recognition consensus motif14. UbiBrowser 2.0 is a popular publicly available bioinformatics tool capable of proteome-wide DSIs prediction. However, this method relies on feature engineering and its application on proteome scale is inevitably hindered by the lack of training datasets. To address these challenges, we introduce TransDSI, an explainable transfer learning architecture based on protein sequence only. TransDSI is pre-trained by sequence similarity network between 20,398 proteins, and fine-tuned by 863 experimentally validated DSIs. Meanwhile, TransDSI presents an explainable module that can suggest the critical protein regions for DSIs while predicting DSIs, partially capturing the protein structural basis of DSIs. We conducted proteome-wide scanning and generated a predicted DUB-substrate interaction dataset (PDSID). Two predicted DUBs (USP11 and USP20) for FOXP3, along with two predicted substrates (Androgen receptor AR and Cellular tumor antigen p53) for USP22 were validated by our “wet lab” experiments, contributing to tumor immune escape-related drug target discovery and precise application of anti-tumor agent, respectively. TransDSI also provides a new perspective for disease omics data analysis by identifying regulatory DUBs for significantly dysregulated proteins in hepatocellular carcinoma (HCC). To facilitate the usage of TransDSI, we made the PDSID and the corresponding program codes available on github (https://github.com/LiDlab/TransDSI).
Results
An overview of TransDSI
Deep learning framework of TransDSI consists of four modules. Firstly, given the primary structures of human proteins, the protein coding module utilizes conjoint triad (CT) method15 to generate protein sequence features and BLAST16 to construct a sequence similarity network (SSN) (Fig. 1a). Next, the self-supervised learning module exploits a variational graph autoencoder (VGAE)17 to process them simultaneously and pre-train a Graph Convolutional Network (GCN)18 encoder which can effectively compress complex graph structure data in non-Euclidean space into low-dimensional numerical vectors while preserving as much relevant information from the original input as possible (Fig. 1b, more details can be found in Methods). After that, DSI-Predictor module adopts transfer learning mechanisms to initialize parameters from pre-trained GCN encoder and performs fine-tuning using DUBs and their corresponding substrates that are involved in gold standard dataset. Our gold standard positive dataset (GSP) was obtained by manual curation by experts, and each pair of interaction is supported by traceable literature evidence, while the gold standard negative dataset (GSN) was obtained by randomly sampling protein-protein interactions from the complement graph of the GSP while preserving its network topology (Details in Supplementary Fig. 1 and Methods Section). Specifically, DSI-Predictor concatenates embeddings of DUBs and substrates and uses a multilayer perceptron (MLP)19 with four fully connected layers to predict whether there exists a functional interaction between DUB and substrate (Fig. 1c). Finally, we developed PairExplainer, an explainable module to obtain an optimized mask highlighting the contribution of different positions in the protein sequence to prediction, which might partially explain the protein structural basis of DSI (Fig. 1d).
A key insight of our framework is that there might be many evolutionarily conserved protein regions in the proteome that contribute to DSIs. TransDSI aims to extract these conserved regions from proteome-scale, and exploit evolutionary information of known DSIs to predict unseen DSIs. Intuition tells us that these conserved regions are likely to contribute the most to the predicted results and can be further investigated using perturbation-based explainable methods. More details of the TransDSI can be found in Methods.
TransDSI has the ability to predict true DSIs
To test the effectiveness of the TransDSI framework, we initially conducted a comparative analysis with UbiBrowser 2.014 developed by our group, which is a popular publicly available bioinformatics tool that can predict proteome-wide DSI. Meanwhile, we constructed five additional DSI prediction systems employing several machine learning methods including random forest (RF), support vector machine (SVM), eXtreme gradient boosting (XGBoost), logistic regression (LR), and K-nearest neighbors (KNN) based on the same training dataset and protein sequence features as that of TransDSI. Additionally, to demonstrate the importance of using sequence features only, we also established a variant for UB2: UB2 without PPI network topology and Gene Ontology (GO) term pair features (UB2 w/o NT and GO), where we removed PPI network topology and GO term pair features from UB2, and only sequence features (domain pair and recognition consensus motif) are used.
We compared the performance of TransDSI, UB2, and other machine learning methods for DSI prediction on two distinct subsets within the GSP (Supplementary Data 1): (1) A 5-fold cross-validation dataset collected from a manual curation of PubMed prior to June 2018; (2) An independent test set derived from literature published between June 2018 and August 2021, which does not intersect with the 5-fold cross-validation dataset.
The performance of TransDSI was evaluated using both area under the receiver operating characteristic curve (AUROC) and that under the precision recall curve (AUPRC). AUROC is a robust metric for evaluating model discrimination between positive and negative examples; while AUPRC excels in identifying true positives especially in imbalanced datasets20. The more a test’s AUROC/AUPRC approximates to 1.0, the higher its overall efficacy will be. TransDSI achieved an AUROC of 0.83 (95% CI = 0.79–0.87) and AUPRC of 0.95 (95% CI = 0.92-0.96) in 5-fold cross-validation, and an AUROC of 0.75 (95% CI = 0.71–0.80) and AUPRC of 0.77 (95% CI = 0.70-0.82) in independent test, which outperforms all other methods (Fig. 2a–d). Notably, the performance of UB2 dropped dramatically after removing features such as homology DUB-substrate interaction, GO term pair, and PPI network topology (Fig. 2a), which suggests that UB2 cannot achieve ideal prediction performance without prior knowledge of the two proteins involved in DSI (Fig. 2b). This emphasizes the importance of using protein sequence features only for prediction in TransDSI, since protein sequences are the easiest to obtain. In real-world applications, higher specificity is desired to improve the success rate of experimental validation. In independent test, TransDSI has higher coverage in scenarios requiring high specificity compared to UB2. For example, when specificity is set to 0.90, TransDSI has a recall of 45.0 %, which is much higher than that of UB2 (21.1%). This result suggests that, at this particular threshold, TransDSI possesses a more robust capacity to predict positive DSIs and delivers enhanced performance. Conversely, UB2 compromises recall in pursuit of elevated accuracy, consequently yielding reduced coverage.
In addition, F1-score, positive predictive value (PPV), and negative predictive value (NPV) were also employed as evaluation metrics. AUROC and AUPRC are comprehensive metrics that consider all possible thresholds, while sensitivity, specificity, PPV, NPV, and F1-score are metrics for optimal specific threshold which was identified by the Youden Index20,21. Across all these metrics, TransDSI consistently demonstrates superior predictive performance compared to other prediction methods (Supplementary Data 2).
To further test the robustness and efficacy of the proposed TransDSI model to handle real-world scenarios, we performed 30 iterations of random sampling for the negative set construction (protein-protein interaction data). The AUROC and AUPRC of TransDSI on the independent test set exhibited remarkable stability, with standard deviations of only 0.017 and 0.025, respectively (Supplementary Data 2). Meanwhile, we examined the performance of TransDSI and five machine learning methods across diverse negative/positive ratios (1:1, 2:1, 5:1, and 10:1). TransDSI outperforms other methods against all ratios (Supplementary Data 3). These results demonstrate that TransDSI has satisfactory robustness and the potential for practical applications.
TransDSI was then used to perform a large-scale proteome-wide DSI scanning, resulting in a predicted DUB-substrate interaction dataset (PDSID) with 19,461 predicted interactions between 85 DUBs and 5,151 substrates (Supplementary Data 4). This predicted DUB-substrate interaction network presents a scale-free degree distribution (linear model fitting R2 index = 0.93)22.
In addition, Gene Ontology (GO)23,24 enrichment analysis revealed that DUBs and their known substrates, as well as predicted substrates, tend to be associated with similar functional categories (Supplementary Data 5), such as protein deubiquitination (ERDUB = 61.11; ERknown SUB = 9.80; ERpredicted SUB = 9.80), non-recombinational repair (ERDUB = 8.64; ERknown SUB = 6.76; ERpredicted SUB = 3.80), and regulation of cytokine-mediated signaling pathways (ERDUB = 7.36; ERknown SUB = 5.76; ERpredicted SUB = 3.46).
We further tested GO term similarity between DUBs and their predicted substrates in terms of biological process (BP), cellular component (CC). As Fig. 2e showed, consistent with known DSIs, DUBs and their predicted substrates tend to be involved in the same biological process, locate in the same cellular component compared to randomly sampled DUB-random protein interactions (DRIs). Besides, we found DUBs and their predicted substrates tend to locate in tightly connected subgraphs within PPI network. These findings are consistent with the previous reports25 and imply the reliability of the PDSID.
Motivated by the abundant and balanced dataset of E3-substrate interactions (ESIs) available in our UbiBrowser 2.014 (containing 4,068 ESIs), we further investigated the applicability of the TransDSI deep learning framework for predicting ESIs (Supplementary Fig. 2). Interestingly, on an independent ESI test set, this deep learning framework outperforms other machine learning systems (Supplementary Fig. 3), which indicates that the TransDSI deep learning framework has certain generalization ability.
Explainable module of TransDSI provides partial insights into the protein structural basis of DSI
Some DUB-substrate interactions are mediated by the interacting protein domains and motifs26. We developed a perturbation-based explainable module PairExplainer, which allows us to identify critical protein sequence features.
Firstly, we froze the parameters of the fine-tuned DSI-Predictor module. Then, based on the query DSI, the relevant protein sequence features involved in subgraphs of the DUB and substrate were extracted and disturbed with optimizable feature masks. These masks were trained by minimizing the discrepancy between the disturbed prediction score (“Disturbed Score”) and the original one (“TransDSI Score”). The optimized mask highlights the contribution of different positions in the protein sequence to prediction (Fig. 1d).
Based on a given query DSI, the algorithm of PairExplainer is equivalent as using a sliding window of 3 residues in length to move along the protein sequence with a step size of 1 residue, performing in silico “knockout” by removing all residues except for the triad within the sliding window. It observes the impact on “TransDSI Score” after knockout and assigns importance scores to each residue based on the magnitude of the effect within all three triads that the residue is involved in (Fig. 3a).
We predicted the residue-level importance scores for DUBs and their substrates in the GSP and provided the top 10 residues with the highest contribution to the interaction for each protein (Supplementary Data 6). We collected all experimentally confirmed DSI binding sites in literature (9 sites, involving 1 DUB and 5 substrates) and used them to assess these identified residues (Supplementary Data 7). Interestingly, some of these protein features can partially explain the structural basis of DSI. For example, PairExplainer successfully captured the KxxxKxK motif on DNA (cytosine-5)-methyltransferase 1 (DNMT1) and Ubiquitin-like PHD and RING finger domain-containing protein 1 (UHRF1), which is known to bind to Ubiquitin carboxyl-terminal hydrolase 7 (USP7) and is one of only two known USP7 recognition motifs (Supplementary Data 7).
To further elucidate this finding, we selected a real crystallographic structure of the USP7 and DNMT1 complex as an example27. Figure 3b illustrates the importance of each residue on the DNMT1 sequence for the interaction. A concentration of red, indicating high importance, can be observed on the lysine residues (K1111/K1113/K1115) in the KG repeat zone (residues 1,109–1,119) of DNMT1. We presented surface representations of the projection of the heatmap on major interfaces of the DNMT1–USP7 complex (determined by Cheng et al.27. using X-ray crystallography experiments), highlighting the critical residues involved in the interactions through stick representation (Fig. 3c). These findings are in line with previous experimental studies showing that the interaction between USP7 and substrate DNMT1 is primarily mediated by the acidic pocket of USP7 and the lysine residues in the KG repeat zone of DNMT127.
Furthermore, PairExplainer can also identify the MATH structural domain mediating the competitive binding of USP7 to p53 and MDM2 (Supplementary Data 7). This domain was reported to play a crucial role in the regulation of the p53-MDM2 signaling pathway, which has implications for understanding the mechanism of tumor suppression and the development of new anticancer drugs28.
Experimental validation of predicted DSIs and their application in oncology research
To verify whether TransDSI can accurately predict potential DSIs, we selected several DSIs with certain biological significance and high ranking for experimental validation. A series of biochemical experiments were conducted to verify these interactions and the regulatory effect of DUBs on ubiquitination of substrates. These are promising to expand the understanding of tumor development mechanisms from the perspective of deubiquitination-regulated protein homeostasis, contributing to the refinement of cancer patient stratification and the development of personalized treatment strategies.
Potential DUBs that regulate FOXP3. Forkhead box protein P3 (FOXP3), a major transcription factor, mediates the suppressant effect of regulatory T (Treg) cells on antitumor immune responses29. The induction of FOXP3 transcription is the result of synergy between TGF-β receptor (TGFβR) activated SMAD3/4 and T cell receptor (TCR) activated Nuclear factor of activated T-cells (NFAT, Fig. 4a)30. Elevated levels of FOXP3 in multiple tumor types has been reported to be associated with worse overall survival31,32,33. However, due to the critical role of FOXP3 in regulating autoimmunity, it cannot be directly targeted for therapy34. Identification of DUBs that regulate both FOXP3 and its upstream regulators will provide valuable insights for the development of potential therapeutic targets related to FOXP3 regulation.
Utilizing TransDSI, we have predicted seven potential DUBs for FOXP3. We selected five DSIs with relatively high ranking for validation (USP18, score: 0.955; UCHL1, score: 0.954; UCHL3, score: 0.954; USP11, score: 0.947; USP20, score: 0.946). We observed that both USP11 and USP20 can act as DUBs to deubiquitinate FOXP3 (Fig. 4a). USP11 has been reported to be able to enhance TGF-β-mediated TH17 cell differentiation and stabilize FOXP3 expression, thereby maintaining the suppressive capacity of Tregs35. However, the underlying molecular mechanism is still unclear. We found that USP11 can bind FOXP3 directly and remove the ubiquitin conjugation on FOXP3, enhancing the stability of FOXP3 (Fig. 4b, c). Considering tumor-infiltrating FOXP3+ Treg cells may promote the immune escape of cancer cells, our findings suggest that USP11 could serve as a potential therapeutic target for the treatment of tumors with FOXP3-induced immune evasion. USP20, on the other hand, has been shown to play important roles in various signaling pathways, such as enhancing the Wnt signaling pathway to promote tumor growth by deubiquitinating β-catenin36. However, no association between USP20 and autoimmune diseases has been reported so far. We found that USP20 can bind and deubiquitinate FOXP3 (Fig. 4d, e). This indicates that USP20 may play a role in regulating the autoimmune process and may be a potential target for inhibiting FOXP3-induced tumor immune escape.
In addition, we also predicted the DUBs of some upstream regulators of FOXP3 in TGF-β pathway. We predicted two DUBs that regulate SMAD3 (UCHL5, Score: 0.846) and SMAD4 (USP17, Score: 0.881). Literature review showed that these predictions were validated by independent studies37,38(Fig. 4a).
Candidate Substrates of USP22. Ubiquitin-specific protease 22 (USP22) has been implicated in the regulation of multiple signaling pathways (such as SIRT1/AKT/MRP1 signaling pathway) associated with the development and progression of HCC through its DUB activity39. Elevated USP22 expression is correlated with poor prognosis in HCC patients40, positioning it as a potential therapeutic target for HCC intervention. The peptide of hD1 has been identified as a specific inhibitor of USP2241. However, the clinical applicability of USP22 as a drug target in patients remains to be elucidated. Clinical practice with certain targeted drugs, such as PD-1/PD-L1 inhibitors, has demonstrated that the efficacy of these agents is intricately linked to the cellular regulatory networks context of their targets42. Identifying the substrates of USP22 holds significant implications for understanding the pathogenesis of tumors and assessing the applicability of potential anticancer agents like hD141.
Utilizing TransDSI, we have predicted 268 candidate substrates for USP22. We selected five DSIs with relatively high ranking for validation (CLSPN, score: 0.992; p53, score: 0.982; MDM4, score: 0.979; MDM2, score: 0.979; AR, score: 0.875). We observed that USP22 can act as a DUB to deubiquitinate both AR (Androgen receptor) and p53(Fig. 5a), both interactions were subsequently confirmed via exogenous co-immunoprecipitation (Co-IP) experiments, and USP22 decreased both AR (Fig. 5b, c) and p53 (Fig. 5f, g) ubiquitination in cells.
To further elucidate the clinical significance of the identified the deubiquitinating regulatory role of USP22 on AR, we classified 159 HCC patients with HBV infection into four distinct subgroups (G-I, G-II, G-III, and G-IV) based on USP22 and AR protein abundance in the tumor tissues (Fig. 5d; refer to Methods for further details). Notably, we found that subgroups exhibiting high USP22 expression did not present significant prognostic differences in comparison to those with low USP22 expression (G-I and G-II, log-rank P = 0.53; G-III and G-IV, log-rank P = 0.92; Fig. 5e). Intriguingly, a significant disparity in prognosis was observed within USP22 high-expression subgroups (G-II and G-III, log-rank P-value = 0.043, Fig. 5e). This demonstrates the significance of identifying USP22 substrates for achieving more precise subgrouping and promoting personalized treatment. Specifically, in G-II patients, USP22 might stabilize AR through deubiquitination, subsequently inhibiting HCC progression. Consequently, inhibition of USP22 may result in the downregulation of AR, attenuating its suppressive effect on HCC and rendering therapeutic strategies targeting USP22 unsuitable for this subgroup. And in G-III patients, AR stabilization was not mediated by USP22, suggesting that USP22-targeted therapy would not elicit AR-related side effects in this population.
In addition, the identified USP22-p53 interaction is of potential biological significance. p53 is a key tumor suppressor that inhibits excessive cell growth and division43. Our current understanding of the relationship between USP22 and p53 is limited to indirect associations. For example, Lin et al. (2012) proposed that USP22 can suppress TP53 transcriptional activation by deubiquitinating SIRT139. Our findings for the first time demonstrated that USP22 can directly deubiquitinate p53, providing a new perspective on the complexity of p53 regulation.
Use cases of TransDSI in the analysis of disease omics data
Numerous types of high-volume omics datasets especially proteomic data (such as TCGA and CPTAC) are increasing exponentially with the advancement of high-throughput experimental techniques44. In-depth analysis of omics data often begins with exploring key molecules of interest identified from differential expression or survival analysis, to investigate their associations with disease prognosis45,46. In fact, the regulators of these key molecules are also very important for the study of disease mechanisms and expanding the scope of omics analysis for discovering disease biomarkers and potential drug targets. TransDSI can facilitate the identification of potential regulators of these key molecules.
Specifically, we analyzed a cohort of 159 Chinese HCC patients with HBV infection (CHCC-HBV) from Fan’s study47. We found that the M2 isoform of pyruvate kinase (PKM2), a key enzyme in the glycolytic pathway, is significantly upregulated in tumor compared to non-tumor samples (adj. P-value = 1.3e-13, logFC = 0.78). Next, we aimed to associate PKM2 with other known tumor-related DUBs, so as to deeply understand the role of PKM2 deubiquitination regulatory in the mechanism of liver cancer48. Among the predicted DUBs for PKM2, ubiquitin carboxyl-terminal hydrolase L1 (UCHL1) emerged as a candidate due to its reported tumor-suppressive activity49 and a high confidence score of 0.823 (Fig. 6a). In fact, recent independent studies have confirmed the above speculation in Parkinson’s disease. Ham et al. found UCHL1 can stabilize PKM2 by mediating its deubiquitination, and loss of UCHL1 can reduce oxidative stress and alleviate neuronal damage by suppressing glycolysis50. However, the role of PKM2 deubiquitination in cancer development and progression remains elusive. Therefore, we next aim to explore the association between UCHL1-mediated PKM2 deubiquitination and tumor prognosis as well as overall survival.
Our analysis on CHCC-HBV revealed a significantly positive correlation between the expression of UCHL1 and PKM2 across all HCC samples (Fig. 6b, R = 0.39, P-value = 5.4e-07). Furthermore, through consensus-clustering analysis based on the abundance of UCHL1 and PKM2, we identified two major protein subgroups among the 159 HCC tumors, with 95 and 64 cases assigned to subgroups G-I and G-II, respectively (Supplementary Fig. 4a and c). The G-II subgroup is characterized by synergistic high expression of UCHL1 and PKM2. The enzymes upstream of PKM2 in the glycolytic pathway, HK and PFKP, are both highly expressed in the G-II type (Supplementary Fig. 4b). Clinical evaluation of these protein subgroups showed that patients in the G-II subgroup had a higher frequency of tumor thrombus (P-value = 1.2e-04), higher AFP level (P-value = 0.011), and advanced TNM stages (P-value = 0.003) compared to those in the G-I subgroup (Kruskal-Wallis test; Fig. 6c). Figure 6d shows that the G-II subgroup had a significantly lower overall survival rate compared to the G-I subgroup (log-rank test P-value = 4.5e-04). These findings suggest that overexpression of UCHL1 may stabilize PKM2 and result in metabolic dysregulation and poor prognosis, and that inhibiting UCHL1 in the G-II subgroup may help improve patient outcomes by downregulating the over-expressed substrate PKM2.
In this study, we report a finding that contributes to understanding the mechanism through which UCHL1 may exert its oncogenic function in HCC. Our results suggest that UCHL1 may contribute to tumor progression in HCC by stabilizing PKM2 via deubiquitination, thereby providing important indications into the potential molecular mechanism of HCC pathogenesis. UCHL1, as the regulatory molecule for PKM2 proposed by TransDSI, may also be used as a potential drug target for resistant tumor treatment. This case study shows the usage of TransDSI for the disease omics dataset analysis, aiding the discovery of disease potential biomarkers or drug targets.
Discussion
We established a protein sequence-based ab initio strategy, TransDSI, for predicting deubiquitinase-substrate interactions. This study transfers proteome-scale evolutionary information to predict potential DSIs. The performance of TransDSI outperforms multiple machine learning strategies against both cross-validation and independent test. Both bioinformatic analysis and experimental validation demonstrate the effectiveness of our strategy. In addition, as a general-purpose transfer learning model, the computational framework of TransDSI can alleviate the problem of insufficient training data and has the potential to predict other less-studied categories of protein-protein interactions, such as the case of SUMOylation.
Nevertheless, deep learning models constructed on relatively small training datasets remain prone to overfitting, this is because the model may only capture the noise in the training data, which is not present in the test data. As a result, the model may perform well on the training data, but poorly on an independent test set51,52. To avoid the possible overfitting, we have taken multiple strategies: (1) Protein embedding: We use a self-supervised learning module that is completely independent of the DSI prediction task to learn protein feature representations53; (2) Neural network training: We use a variety of regularization techniques to prevent overfitting in the neural networks of the protein embedding module (Fig. 1b) and the DSI-predictor (Fig. 1c), including batch normalization and dropout54; (3) Model evaluation: To simulate the real-world use case, we use an independent test set to evaluate the performance of the DSI model. All DSIs contained in this independent test set were discovered after June 1, 2018, and none of the pairs of DSIs were used for training55; (4) Gold standard negative data set construction: The negative data set we construct has a similar network topology to the positive data set, which can effectively prevent overfitting56,57; (5) ESI prediction task: We adopted the same TransDSI deep learning framework to successfully establish a prediction system for protein ubiquitination ligase E3-substrate interaction(Supplementary Fig. 2 and 3), which demonstrates that the TransDSI deep learning framework has certain generalization ability.
Our previous UbiBrowser 2.014, relies on hand-crafting discriminant features or rules for proteins, its scalability is hindered by the bottleneck of feature engineering. UB2 cannot be implemented if the required features (homology, enriched protein domain and function, PPI network topology, and inferred DUB recognition consensus motif) about the proteins are not available, which greatly limits its application. In TransDSI, we only use protein sequence information, which is the most basic property of proteins, therefore TransDSI can be easily implemented at proteome level. Specifically, two protein sequence-based features (CT-encoded protein feature vectors and SSN based on sequence similarity) are integrated into TransDSI. Removing either the SSN or CT-encoded sequence information significantly reduced the model’s predictive performance (Supplementary Fig. 5), which suggested both features synergistically enhance prediction performance by capturing distinct aspects: CT-encoded features might capture local sequence features of proteins, while SSN might capture evolutionary correlations between diverse proteins.
The PairExplainer module within TransDSI can identify protein sequence features that reflect associations between DUBs and substrates, and some of these features provide partial insights into the structural basis of DSI, contributing to the ubiquitin-proteasome system (UPS)-related drug design and cancer treatments. However, TransDSI was specifically designed for constructing a proteome-wide DSI network, and not all features from PairExplainer are the decisive factors for the binding, and some DSIs might involve a complex interplay between multiple protein regions. Additionally, due to the limited number of reported experimentally validated DSI binding site data (currently only 9 sites, involving 1 DUB and 5 substrates), predicting DSI binding sites remains a major challenge. In fact, we tested the possibility of incorporating rich ubiquitination site datasets from mass spectrometry into DSI prediction. However, our analysis did not reveal any association between ubiquitination sites and DSI binding sites, which suggests this information may not sufficiently aid in the accurately differentiation between DSIs and general PPIs (Supplementary Fig. 6). In the future, a deeper understanding of DSIs might enable the use of protein sequence features surrounding ubiquitination modification sites for DSI prediction.
In order to elucidate the value of TransDSI, we analyzed several prediction results from multiple perspectives, such as drug target discovery and personalized cancer treatment and disease omics data analysis. Firstly, in the aspect of drug target discovery and personalized treatment, Huang, X. et al. posited that protein levels in the human body are stringently regulated by the UPS, and dysregulation may lead to diseases such as cancer, making it a potential drug target for personalized cancer treatment58. Moreover, many oncogenes cannot be directly targeted, rendering the identification of DUBs that regulate them essential for cancer therapy. In this paper, we predicted and experimentally validated two predicted DUBs (USP11 and USP20) for FOXP3, along with two predicted substrates (AR and p53) for USP22. Among them, USP22-AR provides new insights for precise subgrouping and individualized treatment strategies for HCC patients. The deubiquitination of p53 by USP22, as revealed in our study, offers a new perspective on the complexity of p53 regulation. Meanwhile, USP11/USP20-FOXP3 presents potential therapeutic target for inhibiting FOXP3-induced tumor immune evasion, as FOXP3 itself cannot be targeted due to its important role in autoimmunity34. Secondly, in the context of disease omics data analysis, Barabási, A. L. et al. advocated for a network-based approach to investigate human diseases, emphasizing the inter-regulation of biomolecules to reveal the complexity and diversity of diseases, as opposed to solely focusing on individual dysregulated proteins in omics data59. Inspired by this idea, we predicted that UCHL1 could function as a DUB for PKM2, which exhibits significant upregulation in the HCC protein expression profile. Patient survival information also suggested that there is a relationship between the UCHL1-PKM2 interaction and tumor development, providing new insights into the development of HCC.
To enhance the application of our strategy, we have compiled all the predicted DSIs as a dataset, which contains 19,461 DSIs between 85 DUBs and 5,151 potential substrates. These DSIs, along with some key regions on them and all the codes of TransDSI, are available on github (https://github.com/LiDlab/TransDSI). This resource can serve as a valuable tool for biologists to identify candidate DUBs and substrates for deubiquitination studies, and contribute to the understanding of the mechanism of UPS regulated protein homeostasis.
Methods
Gold standard positive dataset
A human DSI dataset, the Gold Standard Positive dataset (GSP), is sourced from our UbiBrowser 2.0 database14 (Downloaded on March 1, 2022). This gold standard dataset was obtained by strict manual curation, and each pair of interaction is supported by traceable literature evidence. Its construction process is as follows: Firstly, we collected all the literature published from 1982 to 2021 that may involve DUB and substrate interactions from PubMed containing the following keyword combinations: (“deubiquitinase” OR “DUB”) AND (“substrate” OR “substrates”). Then, we established a panel of three experienced experts to manually review these papers. Potential DUB-substrate interactions were manually filtered and verified based on the following textual patterns: “D deubiquitylates S…”, “D mediates the deubiquitination of S…”, “D targets S for deubiquitination…”, “D stabilizes S…”, “D suppresses the ubiquitination of S…”, “D plays a crucial role in the deubiquitination of S…”, “S is the substrate of D…”, “S is deubiquitinated and stabilized by D…”, where D is a DUB and S is a substrate. Finally, the GSP consists of 865 manually curated DSIs, each of which is annotated with supporting evidence for the deubiquitination relationship between the DUB and substrate (Supplementary Data 1). The dataset used for 5-fold cross-validation comprised 616 DSIs that had been validated by literature up to June 2018. The independent test set comprised 249 DSIs identified from June 2018 to August 2021.
Gold standard negative datasets
We constructed the Gold Standard Negative dataset (GSN) based on the protocols of references Zahiri et al.57. and Yu et al.60, with some modifications. Firstly, we obtained human physical PPI from BioGRID61 (Released 25 January 2022). Next, we randomly select a negative set with the same number of nodes as that of the positive set from the complement graph of the known DSI network. The connectivity distribution of DUBs in the negative set network is the same as that in the positive (Supplementary Fig. 1). All interactions in negative set are the PPIs from BioGRID. None of the interactions in the GSN are present in the GSP.
Protein sequence processing
In this study, protein sequences are derived from UniProt62 (accessed on October 4, 2022) and encoded using the conjoint triad (CT) method15, a widely applied method for encoding amino acid sequences in related fields. The CT method clusters 20 different amino acids into 7 classes based on the dipoles and volumes of the side chains, and the mapping between classes and amino acids is presented in Supplementary Data 8. The CT encoding captures the properties of a single amino acid and its adjacent amino acids by treating any three consecutive amino acids as a unit and counting their occurrence frequency within the protein sequence. This results in a fixed-dimension representation of the amino acid sequence with a dimensionality of 343 (7 \(\times\) 7 \(\times\) 7).
To construct a sequence similarity network (\({SSN}\)), we used BLASTp (version 2.13.0+) to compare all reviewed human protein sequences in UniProt and screened for protein pairs with an E-value < 1e-4. The \({SSN}\) is represented as an undirected graph, \({SSN}=(V,E,X)\), where \(V\) represents the set of proteins in the \({SSN}\) and \({e}_{{ij}}= < {v}_{i},{v}_{j} > \in E\) represents an interaction in the \({SSN}\). The adjacency matrix \({SSM}\) is used to represent the topological structure of the \({SSN}\), with the sequence identity score between the proteins serving as weights: \({{SSM}}_{{ij}}\) = sequence identity between \({v}_{i}\) and \({v}_{j}\) if \({e}_{{ij}}\in E\), otherwise \({{SSM}}_{{ij}}=0\) (assuming that each node is not connected to itself and setting the diagonal elements to 0).
To preserve node-specific information and balance the contribution of node neighbors and the node itself in feature extraction, we normalized the \({SSM}\) to obtain the normalized sequence similarity matrix \(A\), which serves as the input to the GCN encoder.
\({D}_{1}\) and \({D}_{2}\) are the weighted degree matrix of \({SSM}\) and \(\widetilde{A}\) :
Protein sequence feature embedding (variational graph autoencoder model)
The TransDSI model is composed of three main components: a protein sequence feature embedding module (variational graph autoencoder model), a deubiquitinase-substrate interaction prediction module (DSI-Predictor), and an explainable module (PairExplainer).
VGAE is a machine learning technique for unsupervised feature extraction that generates latent representations that incorporate both network structure and node features17. The method trains an encoder and decoder in parallel.
The graph encoder is built using GCN and aims to project the protein sequence features X onto the latent features \(Z\), leveraging network evolutionary information represented in the graph-structured \({SSN}\) matrix. The resulting numerical matrix of protein embeddings, \(Z\), serves as an interpretable latent representation for undirected graphs learned using VGAE. This representation is used to effectively compress complex graph structure data in non-Euclidean space into simple, low-dimensional numerical vectors while preserving as much relevant information from the original input as possible.
We defined a spectral convolution function \({{{{{{{\rm{f}}}}}}}_{{{{{{\rm{gcn}}}}}}}}^{17}\):
Here, we utilized the input \({Z}^{(l)}\) in a convolutional operation, yielding the output \({Z}^{(l+1)}\). The normalized sequence similarity matrix \(A\) serves as the kernel for this calculation. In our work, \({Z}^{(0)}\) is initialized as X. The individual layers of our graph convolutional network can be defined as follows:
where \({W}^{(l)}\in {{\mathbb{R}}}^{{d}_{l}\times {d}_{l+1}}\). \({d}_{l}\) is the dimension of input for convolution, \({d}_{l+1}\) is the dimension of output after convolution. Our graph encoder consists of two GCN layers, and we let the prior over the latent variables Z be the centered isotropic multivariate Gaussian63:
In our work, we defined the prior over the latent variables \({\rm Z}\) as a centered isotropic multivariate Gaussian distribution with mean \(\mu\) and standard deviation \(\sigma\).
where \(\odot\) is element-wise multiplication and \({\epsilon }_{i}\sim {{{{{\mathcal{N}}}}}}({{{{\mathrm{0,1}}}}})\).
Next, we described a basic inner product decoder that aims to reconstruct \(A\) using the learned latent variable \(Z\):
Finally, to maximize the similarity between the reconstructed sequence similarity \(\hat{A}\) and the normalized sequence similarity matrix \(A\), we optimized the model by minimizing the following loss function:
In this study, the Kullback-Leibler (KL) divergence, \({{{{{\rm{KL}}}}}}\left[{{{{{\rm{q}}}}}}(\cdot )\parallel {{{{{\rm{p}}}}}}(\cdot )\right]\), is employed to quantify the dissimilarity between the distributions \({{{{{\rm{q}}}}}}(\cdot )\) and \({{{{{\rm{p}}}}}}(\cdot )\)64. As \({{{{{\rm{p}}}}}}\left(Z\right)\) is assumed to follow a normal distribution with mean 0 and standard deviation 1 (i.e, \({{{{{\rm{p}}}}}}(Z)\sim {{{{{\mathcal{N}}}}}}({{{{\mathrm{0,1}}}}})\)), the cost function represents the capability of the model in reconstructing the input network and aligning the latent variables with \({{{{{\rm{p}}}}}}(Z)\). The optimization of the cost function with respect to the parameters of the encoder is performed using stochastic gradient descent.
The deubiquitinase-substrate interaction prediction module (DSI-Predictor)
The DSI-Predictor module comprises of two components, a GCN encoder and an MLP. The GCN encoder is identical to the GCN encoder in the VGAE module. The MLP component is described in further detail below.
MLP is widely recognized as a powerful and prevalent method for supervised learning65. In our study, protein embeddings obtained from the DUBs and their substrates are concatenated and inputted into a 4-layer fully connected neural network. We defined the function \({{{{{{\rm{f}}}}}}}_{{{{{{\rm{mlp}}}}}}}\) as follows:
In this study, we utilized an MLP where \({P}^{{{{{{\boldsymbol{(}}}}}}l{{{{{\boldsymbol{)}}}}}}}\) serves as the input and \({P}^{{{{{{\boldsymbol{(}}}}}}l+1{{{{{\boldsymbol{)}}}}}}}\) represents the output of each MLP layer. The matrix of filter parameters, \({W}^{{{{{{\boldsymbol{(}}}}}}l{{{{{\boldsymbol{)}}}}}}}\), and the bias of each layer, \({b}^{{{{{{\boldsymbol{(}}}}}}l{{{{{\boldsymbol{)}}}}}}}\), are learned during the training process. The operation performed at each layer of the MLP can be defined as follows:
MLP in this study consists of four layers, with batch normalization and dropout implemented between each layer. \({W}^{(l)}\in {{\mathbb{R}}}^{{d}_{l}\times {d}_{l+1}}\), \({d}_{l}\) is the dimension of input, \({d}_{l+1}\) is the dimension of output.
The probability \({P}^{\left(4\right)}\) obtained from the MLP is transformed into a “TransDSI Score” for confidence assessment through the following sigmoid function.
Here, the temperature parameter T was set to 2 to fine-tune the smoothness of the score distribution, and enhance the final TransDSI score’s discriminative power66.
The explainable module of TransDSI (PairExplainer)
Graph Neural Networks (GNNs) are neural network models that incorporate the dependencies within a graph through message passing between the nodes of the graph18. However, it poses a challenge in terms of explainability because these models are complicated by combining both graph structure and node information. Several GNN explainable methods have been proposed to address this issue, but they mainly focus on explaining predictions made by GNNs for node classification and graph classification tasks67,68.
In this work, we propose PairExplainer, a general approach to provide an explanation of node features for link predictions made by any GNN-based model. The underlying principle of PairExplainer is that retaining important node features should result in a prediction that is similar to the original prediction made by the GNN-based model.
Let G denote a graph with edges E and nodes V, where each node is associated with a d-dimensional feature representation. Given node pair \(\left({v}_{i},{v}_{j}\right)\) and GNN model \(\Phi\). \({N}_{i}\) and \({N}_{j}\) denote the h-hop neighbors of nodes \({v}_{i}\) and \({v}_{j}\), respectively. Our aim is to extract the neighborhood features \({X}_{{N}_{i}}=\left\{\left.{x}_{k}\right|{v}_{k}\in {N}_{i}\right\}\), \({X}_{{N}_{j}}=\left\{\left.{x}_{k}\right|{v}_{k}\in {N}_{j}\right\}\) and the corresponding subgraphs \({G}_{{N}_{i}}\) and \({G}_{{N}_{j}}\), that are crucial for the prediction result of the link between \({v}_{i}\) and \({v}_{j}\). These features and subgraphs impact the feature representation of \({v}_{i}\) and \({v}_{j}\) during the GNN aggregation process.
In the task of predicting DSIs, the specific DUB and substrate are denoted as \({v}_{i}\) and \({v}_{j}\), respectively. The protein sequence features of these enzymes and substrates are encoded using the CT method and are represented by the feature vectors \({x}_{i}\) and \({x}_{j}\). G corresponds to the sequence similarity network (SSN).
Our model DSI-Predictor, denoted as \(\Phi\), has been fine-tuned on a set of gold standard datasets. The goal of PairExplainer is to learn two feature masks \({M}_{{v}_{i}},{M}_{{v}_{j}}\in {{\mathbb{R}}}^{1\times d}\), which intuitively assess the importance of each feature in the prediction process. If a feature is not important, then masking it should not significantly decrease the prediction probability. The prediction probabilities after masking the features are presented as follows:
The element-wise multiplication is denoted as \(\odot\) and the sigmoid function, which maps the mask to the interval [0, 1], is denoted as \(\sigma\). The model is optimized using the loss function defined as follows:
Here \(\alpha\) and \(\beta\) are hyper-parameters that balance the loss function, which control the contribution of each term in the optimization process, \({{m}_{k}\in M}_{{N}_{i}}\cup {M}_{{N}_{j}}\).
TransDSI model hyperparameters setting
The protein sequence feature embedding module employs a VGAE model, which consists of a two-layer GCN encoder17. The input and output dimensions of each GCN layer are set to 343. The optimization algorithm used is Adam with an initial learning rate of 1e-4, and a dropout rate of 0.1. The VGAE model is trained for 100 epochs.
The DSI-Predictor module consists of both a GCN encoder and an MLP. The parameters of the GCN encoder are transferred from the VGAE model, and the hyperparameters are set to the same values as those in the VGAE. The four layers of the MLP have input and output dimensions of 686, 512, 256, 64 and 1, respectively. The optimization algorithm is Adam with an initial learning rate of 1e-4, a dropout rate of 0.4, and a batch size of 64. The DSI-Predictor module is trained for 100 epochs.
In the PairExplainer module, the optimization algorithm is Adam with an initial learning rate of 0.01, and the model is trained for a total of 10,000 epochs. The hyper-parameters α and β are set to 1 and 0.5 respectively for the loss calculation.
Determining the optimal classification threshold
The Youden index21 was used to determine the optimal classification threshold for TransDSI score (0.701) by balancing the sensitivity and specificity of the model (Youden index = sensitivity + specificity − 1). We calculated the Youden index against different thresholds, and we chose the threshold that maximizes the Youden index as the recommended cutoff. DSIs with scores > 0.701 are considered to be of high reliability. TransDSI score is a score between 0 and 1, with higher scores indicating higher reliability. Meanwhile, we also provide those predicted DSIs with scores smaller than the recommended cutoff in the Supplementary Data 4. Users can fine-tune the threshold based on their needs: lowering it for increased sensitivity (more clues) or raising it for enhanced specificity (more accurate results).
Five-fold cross-validation
To test the efficacy of the overall performance of various assessment models, the 5-fold cross-validation protocol was used. The DSI dataset used for 5-fold cross-validation comprised 616 DSIs that had been validated by literature up to June 2018. This positive dataset and the corresponding negative datasets were randomly divided into five approximately equal subsets. The model was then trained and tested five times, with each fold serving as the test set once and the remaining four folds used for training. This process was done in turn five times, and finally the numbers of TPs and FPs against different thresholds across five test data sets were summed to calculate the TP/FP ratio, and the sensitivity (TP/T) and specificity (1-FP/F) for the ROC curve. Finally, we added the number of TP (true positive), FP (false positive), TN (true negative), and FN (false negative) obtained from the five tests against all possible thresholds to calculate the sensitivity (TP/T) and specificity (1-FP/F) in the ROC curve, and the precision (TP/(TP + FP)) and recall (TP/(TP + FN)) in the PR curve. Then, we used the Youden Index method to find the optimal threshold for all methods21 and based on this threshold, we calculated NPV (TN/(TN + FN)), PPV (TP/(TP + FP)), and F1-score to evaluate the predictive performance of the model. In each training session, the training set and the validation set are independent of each other, so the validation set will not be used for performance tuning.
Bioinformatics analysis on DSIs
The Gene Ontology (GO)23,24 enrichment analysis is conducted utilizing the R package clusterProfiler, encompassing DUBs, all known substrates, and all predicted substrates. Subsequently, an in-depth investigation is performed on the GO terms exhibiting significant enrichment for both DUB and its known or predicted substrates (Supplementary Data 5).
We use GO semantic similarity to measure the functional association between DUB and substrate. To quantify the semantic similarity between the GO terms of two proteins, we utilized the R package GOSemSim69,70.
A protein-protein interaction (PPI) network is constructed using the interactions recorded by BioGRID (version: BIOGRID-ALL-3.4.159)71. To measure the neighborhood similarity between two proteins in the PPI network, we calculated the P-value from a Fisher’s exact test on the overlap of their network neighbors:
HCC protein subgrouping analysis based on predicted DSIs
We obtained an HCC proteomics dataset (n = 159) from the Clinical Proteomic Tumor Analysis Consortium (CPTAC) of the National Cancer Institute47. The data is normalized using median standardization across all proteins to account for sample loading differences. Consensus clustering is performed using the R package ConsensusClusterPlus47. The protein abundance of predicted DSI (UCHL1-PKM2) is subjected to k-means consensus clustering respectively with the following parameters: 1,000 bootstraps repetitions, pItem = 0.8 (resampling 80% of any sample), pFeature = 1 (resampling all proteins), and k-means clustering with up to 6 clusters. Euclidean distance is used as the measure of sample clustering. The number of clusters is determined based on three criteria: the average pairwise consensus matrix within consensus clusters, the delta plot of the relative change in the area under the cumulative distribution function (CDF) curve, and the average silhouette distance for consensus clusters. The clustering results obtained using UCHL1-PKM2 indicated that k = 2 or k = 3 clusters are the optimal solutions for clustering, as evidenced by the average silhouette width, the consensus CDF, and delta plot. The k = 2 cluster consensus matrix is deemed to have the cleanest separation among clusters and the lowest proportion of ambiguous clustering (PAC)72. Therefore, the HCC proteomic data is clustered into 2 groups (Supplementary Fig. 4a). Similarly, for USP22-AR, the k = 4 cluster was chosen as the final clustering results using PAM consensus clustering with Canberra distance metric (Supplementary Fig. 4c). In the CHCC-HBV cohort, the expression of AR was not detected in the samples of five patients (T385, T387, T391, T393, T395), so we removed these patients’ samples when performing subgrouping analysis based on USP22-AR protein abundance. Survival analysis of patient stratification in different subgroups is performed using the consensus clustering results. The Log-rank test is used to compare survival outcomes between the two subgroups generated by proteomics clustering, and Kaplan–Meier survival curves are plotted using the R package ggsurvplot. Results with P-values < 0.05 are considered statistically significant. Log2(hazard ratio) of each protein is calculated using Cox proportional hazards regression analysis. The association between clinical information and protein subgroups is examined using the Kruskal-Wallis test for categorical data versus continuous data47. Pearson correlation test is used to calculate the correlation coefficient and P-value.
Experimental validation protocols
Cell lines: Human embryonic kidney HEK293T cells was purchased from ATCC. HEK293T cells was cultured in dulbeccos modification of Eagles medium (DMEM, Gibco) supplemented with fetal bovine serum (FBS, Gibco) and penicillin/streptomycin at 37° C in a humidified 5 % CO2 incubator.
Plasmids, antibodies and cell transfection: Human DUB library was purchased from OriGene Technologies. Full-length AR, TP53 and FOXP3 were cloned into pFlag-CMV-2 vectors (MLCC, L3435) as indicated. The antibodies we used in this study are as follows: anti-Myc (MBL, Cat# M047-3, RRID:AB_591112, 1:2000 for IB, 1:500 for IP), anti-Flag (MBL, Cat# M185-3, RRID:AB_10950447, 1:1000 for IB, 1:500 for IP), and anti-HA (MBL, Cat# M180-3, RRID:AB_10951811, 1:1000 for IB). As secondary antibodies, goat anti-mouse IgG (H + L) was used (Jackson, Cat# 115-035-003, RRID: AB_10015289, 1:4000 for IB) and detection was done by the SuperSignal™ West Pico PLUS chemiluminescent Detection Reagent (ThermoFisher, 34577). Cells were transfected with various plasmids using TuboFect reagent (ThermoFisher, R0534) according to the manufacturer’s protocol.
Immunoprecipitation: Cells were lysed with TNTE 0.5 % (50 mM tris-HCl (pH 7.5), 150 mM NaCl, 1 mM EDTA, and 0.5 % Triton X-100) with protease inhibitor (MCE). Immunoprecipitation was performed using the indicated primary antibody for 3 h and incubated with protein A/G agarose beads (Santa Cruz) overnight at 4° C, which were then washed with TNTE 0.5 % buffer three times. The lysates and immunoprecipitates were analyzed by western blotting with the indicated antibodies.
Ubiquitination assay: For in vivo ubiquitination assay, HEK293T cells was transfected with various plasmids and treated with the proteasome inhibitor MG132 (20 μM; Sigma) for 8-10 h before collection. Cells were lysed in RIPA buffer (50 mM Tris-HCl (pH 7.5), 1 % NP-40, 1 % sodium deoxycholate, 10 % glycerinum, 150 mM NaCl, 5 mM EDTA, and 0.1 % SDS) and then incubated with the indicated primary antibody for 3 h and protein A/G agarose beads (Santa Cruz) overnight at 4 °C. After washing three times, the lysates and immunoprecipitates were analyzed by western blotting with the indicated antibodies.
The uncropped scans of blots in Fig. 4 and Fig. 5 have been included in the source data file (“Fig. 4b–e”, “Fig. 5b, c”, “Fig. 5f, g” sheets). The areas that have been cropped are indicated by dashed lines in these uncropped scans.
Construction of TransESI model for predicting ubiquitin E3 ligase-substrate interactions
We constructed a model (TransESI) for predicting ESIs following the same protocol of TransDSI framework. Firstly, we constructed a gold standard positive dataset of ESIs using a manual curation approach. We split the gold standard ESI dataset into a training set and an independent test set. The training set comprised 2,367 ESIs identified before June 2018, while the independent test dataset comprised 472 ESIs identified from June 2018 to August 2021. Next, we constructed a gold standard negative dataset of ESIs based on PPIs from BioGRID (Released 25 January 2022). We randomly selected a negative set of the same size as the positive set from the complement of the known ESI network. During this process, we ensured that the connectivity distribution of E3s in the negative set was consistent with that in the positive set. Then, we designed the same deep transfer learning framework for ESI (Supplementary Fig. 2), trained TransESI model on ESI data using a similar approach to TransDSI and tested it on an independent test dataset.
Statistics and reproducibility
All statistical analyses were performed using R (version 4.0.2 and 4.1.3). ROC curves were plotted and smoothed, and the area under the curve (AUROC) and its 95% confidence interval are simultaneously calculated. To determine if there are nonrandom associations between two categorical variables, statistical significance was considered at P-value < 0.05 using the one-tailed Wilcoxon test and Hypergeometric test. Kruskal-Wallis test was used to analyze the clinical data. All survival analyses among the proteomic subtypes, used Kaplan–Meier method; p-values were calculated using the Log-rank test. Hazard ratio (HR) was calculated from Cox proportional hazards regression analysis. P-value < 0.05 was considered as significantly different. Five machine learning models (including RF, XGBoost, SVMs, LR and KNN) were implemented in scikit-learn v1.0.1. The experiments in this study were independently repeated at least three times. Similar results were obtained.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
All relevant data supporting the key findings of this study are available within the article and its Supplementary Information files. The DUB-substrate interaction (DSI), protein-protein interaction (PPI), protein sequence data, GO annotation data, and HCC proteomics data used in this studyare available in the Zenodo database under accession code https://zenodo.org/records/10468044. These datasets were sourced from publicly available databases as follows: the Human DSI dataset from UbiBrowser 2.0 [http://ubibrowser.ncpsb.org.cn/], the Human PPI dataset from BioGRID [https://thebiogrid.org/], and the protein sequence data and GO annotation data from UniProt [https://www.uniprot.org/]. The HCC proteomic data used in this study were sourced from the CPTAC research and are accessible in the PDC database at https://pdc.cancer.gov/pdc/study/PDCO00198. The PDB entry 4YOC, which is utilized for visualizing the USP7-DNMT1 complex, can be accessed from the Protein Data Bank at [https://doi.org/10.2210/pdb4yoc/pdb]. The data generated in this study is provided in the Supplementary Data files. Source data are provided with this paper.
Code availability
The source code of TransDSI is packaged as π-TransDSI and available at Github https://github.com/LiDlab/TransDSI and Zenodo https://zenodo.org/records/1086613673.
References
Pickart, C. M. Mechanisms underlying ubiquitination. Annu. Rev. Biochem. 70, 503–533 (2001).
Popovic, D., Vucic, D. & Dikic, I. Ubiquitination in disease pathogenesis and treatment. Nat. Med. 20, 1242–1253 (2014).
Song, L. & Luo, Z. Q. Post-translational regulation of ubiquitin signaling. J. Cell Biol. 218, 1776–1786 (2019).
Sun, T., Liu, Z. & Yang, Q. The role of ubiquitination and deubiquitination in cancer metabolism. Mol. Cancer 19, 146 (2020).
Bekes, M., Langley, D. R. & Crews, C. M. PROTAC targeted protein degraders: the past is prologue. Nat. Rev. Drug Discov. 21, 181–200 (2022).
Lange, S. M., Armstrong, L. A. & Kulathu, Y. Deubiquitinases: From mechanisms to their inhibition by small molecules. Mol. Cell. 82, 15–29 (2022).
Zheng, Q. et al. Dysregulation of ubiquitin-proteasome system in neurodegenerative diseases. Front. Aging Neurosci. 8, 303 (2016).
Deng, L., Meng, T., Chen, L., Wei, W. & Wang, P. The role of ubiquitination in tumorigenesis and targeted drug discovery. Signal Transduct. Target Ther. 5, 11 (2020).
Loch, C. M. & Strickler, J. E. A microarray of ubiquitylated proteins for profiling deubiquitylase activity reveals the critical roles of both chain and substrate. Biochim Biophys. Acta 1823, 2069–2078 (2012).
Yen, H. C. & Elledge, S. J. Identification of SCF ubiquitin ligase substrates by global protein stability profiling. Science 322, 923–929 (2008).
Yumimoto, K., Matsumoto, M., Oyamada, K., Moroishi, T. & Nakayama, K. I. Comprehensive identification of substrates for F-box proteins by differential proteomics analysis. J. Proteome Res. 11, 3175–3185 (2012).
Guo, Z., Wang, X., Li, H. & Gao, Y. Screening E3 substrates using a live phage display library. PLoS One 8, e76622 (2013).
Huang, C. H. et al. UbiSite: incorporating two-layered machine learning method with substrate motifs to predict ubiquitin-conjugation site on lysines. BMC Syst. Biol. 10, 49–61 (2016).
Wang, X. et al. UbiBrowser 2.0: a comprehensive resource for proteome-wide known and predicted ubiquitin ligase/deubiquitinase-substrate interactions in eukaryotic species. Nucleic Acids Res. 50, D719–D728 (2022).
Shen, J. et al. Predicting protein-protein interactions based only on sequences information. Proc. Natl Acad. Sci. USA 104, 4337–4341 (2007).
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Kipf, T. N. & Welling, M. Variational graph auto-encoders. arXiv https://doi.org/10.48550/arXiv.1611.07308 (2016).
Kipf, T. N. & Welling, M. Semi-supervised classification with graph convolutional networks. arXiv https://doi.org/10.48550/arXiv.1609.02907 (2017).
Pal, S. K. & Mitra, S. Multilayer perceptron, fuzzy sets, and classification. IEEE Trans. Neural Netw. 3, 683–697 (1992).
Davis, J. & Goadrich, M. The relationship between precision-recall and ROC curves. In Proc. 23rd International Conference on Machine Learning. 6, 233–240 (2006).
Ruopp, M. D., Perkins, N. J., Whitcomb, B. W. & Schisterman, E. F. Youden index and optimal cut-point estimated from observations affected by a lower limit of detection. Biom. J. 50, 419–430 (2008).
Barabasi, A. L. Scale-free networks: a decade and beyond. Science 325, 412–413 (2009).
Ashburner, M. et al. Gene ontology: tool for the unification of biology. The gene ontology consortium. Nat. Genet. 25, 25–29 (2000).
Gene Ontology, Consortium et al. The gene ontology knowledgebase in 2023. Genetics 224, iyad031 (2023).
Chen, D. et al. An integrative pan-cancer analysis of biological and clinical impacts underlying ubiquitin-specific-processing proteases. Oncogene 39, 587–602 (2020).
Komander, D., Clague, M. J. & Urbe, S. Breaking the chains: structure and function of the deubiquitinases. Nat. Rev. Mol. Cell Biol. 10, 550–563 (2009).
Cheng, J. et al. Molecular mechanism for USP7-mediated DNMT1 stabilization by acetylation. Nat. Commun. 6, 7023 (2015).
Sheng, Y. et al. Molecular recognition of p53 and MDM2 by USP7/HAUSP. Nat. Struct. Mol. Biol. 13, 285–291 (2006).
Li, Z., Li, D., Tsun, A. & Li, B. FOXP3+ regulatory T cells and their functional regulation. Cell. Mol. Immunol. 12, 558–565 (2015).
Shen, Z., Chen, L., Hao, F. & Wu, J. Transcriptional regulation of Foxp3 gene: multiple signal pathways on the road. Med Res Rev. 29, 742–766 (2009).
Merlo, A. et al. FOXP3 expression and overall survival in breast cancer. J. Clin. Oncol. 27, 1746–1752 (2009).
Winerdal, M. E. et al. FOXP3 and survival in urinary bladder cancer. BJU Int. 108, 1672–1678 (2011).
Hinz, S. et al. Foxp3 expression in pancreatic carcinoma cells as a novel mechanism of immune evasion in cancer. Cancer Res. 67, 8344–8350 (2007).
Triulzi, T., Tagliabue, E., Balsari, A. & Casalini, P. FOXP3 expression in tumor cells and implications for cancer progression. J. Cell Physiol. 228, 30–35 (2013).
Istomine, R., Alvarez, F., Almadani, Y., Philip, A. & Piccirillo, C. A. The deubiquitinating enzyme ubiquitin-specific peptidase 11 potentiates TGF-β signaling in CD4+ T cells to facilitate Foxp3+ regulatory T and TH17 cell differentiation. J. Immunol. 203, 2388–2400 (2019).
Wu, C. et al. USP20 positively regulates tumorigenesis and chemoresistance through β-catenin stabilization. Cell. Death Differ. 25, 1855–1869 (2018).
Nan, L. et al. Ubiquitin carboxyl-terminal hydrolase-L5 promotes TGFbeta-1 signaling by de-ubiquitinating and stabilizing Smad2/Smad3 in pulmonary fibrosis. Sci. Rep. 6, 33116 (2016).
Song, C., Liu, W. & Li, J. USP17 is upregulated in osteosarcoma and promotes cell proliferation, metastasis, and epithelial-mesenchymal transition through stabilizing SMAD4. Tumor Biol. 39, 1010428317717138 (2017).
Ling, S. et al. USP22 mediates the multidrug resistance of hepatocellular carcinoma via the SIRT1/AKT/MRP1 signaling pathway. Mol. Oncol. 11, 682–695 (2017).
Tang, B. et al. High USP22 expression indicates poor prognosis in hepatocellular carcinoma. Oncotarget 6, 12654–12667 (2015).
Morgan, M., Ikenoue, T., Suga, H. & Wolberger, C. Potent macrocycle inhibitors of the human SAGA deubiquitinating module. Cell Chem. Biol. 29, 544–554.e544 (2022).
Miao, D. et al. Genomic correlates of response to immune checkpoint therapies in clear cell renal cell carcinoma. Science 359, 801–806 (2018).
Ozaki, T. & Nakagawara, A. Role of p53 in cell death and human cancers. Cancers 3, 994–1013 (2011).
Berger, B., Peng, J. & Singh, M. Computational solutions for omics data. Nat. Rev. Genet. 14, 333–346 (2013).
Chen, B. et al. Harnessing big ‘omics’ data and AI for drug discovery in hepatocellular carcinoma. Nat. Rev. Gastroenterol. Hepatol. 17, 238–251 (2020).
Crow, M., Lim, N., Ballouz, S., Pavlidis, P. & Gillis, J. Predictability of human differential gene expression. Proc. Natl Acad. Sci. USA 116, 6491–6500 (2019).
Gao, Q. et al. Integrated proteogenomic characterization of HBV-related hepatocellular carcinoma. Cell 179, 561–577.e522 (2019).
Hyduke, D. R., Lewis, N. E., Palsson & B, Ø. Analysis of omics data with genome-scale models of metabolism. Mol. Biosyst. 9, 167–174 (2013).
Yu, J. et al. Epigenetic identification of ubiquitin carboxyl-terminal hydrolase L1 as a functional tumor suppressor and biomarker for hepatocellular carcinoma and other digestive tumors. Hepatology 48, 508–518 (2008).
Ham, S. J. et al. Loss of UCHL1 rescues the defects related to Parkinson’s disease by suppressing glycolysis. Sci. Adv. 7, eabg4574 (2021).
Hawkins, D. M. The Problem of Overfitting. J. Chem. Inf. Comput. Sci. 44, 1–12 (2004).
Geman, S., Bienenstock, E. & Doursat, R. Neural networks and the bias variance dilemma. Neural Comput. 4, 1–58 (1992).
Lee, J. et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240 (2020).
Ioffe, S. & Szegedy, C. Batch normalization accelerating deep network training by reducing internal covariate shift. 32th Int. Conf. Mach. Learn. ICML 2015 37, 448–456 (2015).
Lever, J., Krzywinski, M. & Altman, N. Model selection and overfitting. Nat. Methods 13, 703–704 (2016).
Zhang, S., Vasishtan, D., Xu, M., Topf, M. & Alber, F. A fast mathematical programming procedure for simultaneous fitting of assembly components into cryoEM density maps. Bioinformatics 26, i261–i268 (2010).
Zahiri, J., Yaghoubi, O., Mohammad-Noori, M., Ebrahimpour, R. & Masoudi-Nejad, A. PPIevo: Protein–protein interaction prediction from PSSM based evolutionary information. Genomics 102, 237–242 (2013).
Huang, X. & Dixit, V. M. Drugging the undruggables: exploring the ubiquitin system for drug development. Cell Res. 26, 484–498 (2016).
Barabasi, A. L., Gulbahce, N. & Loscalzo, J. Network medicine: a network-based approach to human disease. Nat. Rev. Genet. 12, 56–68 (2011).
Yu, J. et al. Simple sequence-based kernels do not predict protein–protein interactions. Bioinformatics 26, 2610–2614 (2010).
Stark, C. et al. BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 34, D535–539, (2006).
Consortium, UniProt. UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Res. 51, D523–D531 (2023).
Kingma, Diederik P. & Welling., Max. Auto-encoding variational bayes. arXiv https://doi.org/10.48550/arXiv.1312.6114 (2014).
Hershey, J. R. & Olsen, P. A. Approximating the Kullback Leibler divergence between Gaussian mixture models. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP) 4, 317–320 (2007).
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network. arXiv https://doi.org/10.48550/arXiv.1503.02531 (2015).
Ying, R., Bourgeois, D., You, J., Zitnik, M. & Leskovec, J. GNNExplainer: generating explanations for graph neural networks. Adv Neural Inf Process Syst. 32, 9240–9251 (2019).
Zhang, Z., Liu, Q., Wang, H., Lu, C. & Lee, C. ProtGNN: Towards self-explaining graph neural networks. Proceedings of the AAAI Conference on Artificial Intelligence. 36, 8 (2022).
Yu, G. et al. GOSemSim: an R package for measuring semantic similarity among GO terms and gene products. Bioinformatics 26, 976–978 (2010).
Wang, J. Z., Du, Z., Payattakool, R., Yu, P. S. & Chen, C. F. A new method to measure the semantic similarity of GO terms. Bioinformatics 23, 1274–1281 (2007).
Chatr-Aryamontri, A. et al. The BioGRID interaction database: 2017 update. Nucleic Acids Res. 45, D369–D379 (2017).
Senbabaoglu, Y., Michailidis, G. & Li, J. Z. Critical limitations of consensus clustering in class discovery. Sci. Rep. 4, 6207 (2014).
Liu, Y. et al. A protein sequence-based deep transfer learning framework for identifying human proteome-wide deubiquitinase-substrate interactions. Zenodo https://zenodo.org/records/10866136 (2024).
Acknowledgements
This research was funded by the National key Research and Development Program of China (2022YFC3401500 to C.C.; 2023YFF1204600, 2020YFE0202200 and 2021YFA1301603 to Dong Li) and the National Natural Science Foundation of China (32271518 and 32088101 to Dong Li).
Author information
Authors and Affiliations
Contributions
D.L. (Dong Li) and C.C. directed and designed research; Y.L. (Yuan Liu) and D.L. (Dianke Li) designed, implemented and evaluated the algorithm; Y.L. (Yuan Liu), D.L. (Dianke Li), S.X., Y.Q., Y.L. (Yang Li) and X.L. performed bioinformatics analysis and HCC protein subgrouping analysis; X.Z., L.Z. and C.C. performed experimental validation of predicted DSIs; D.L. (Dong Li), C.C., Y.L. (Yuan Liu), D.L. (Dianke Li) and X.K. wrote the manuscript. All authors have read and agreed to the published version of the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Tzong-Yi Lee, Han Liang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Liu, Y., Li, D., Zhang, X. et al. A protein sequence-based deep transfer learning framework for identifying human proteome-wide deubiquitinase-substrate interactions. Nat Commun 15, 4519 (2024). https://doi.org/10.1038/s41467-024-48446-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467-024-48446-3
- Springer Nature Limited