Introduction

The human genome comprises approximately 19,890 protein-coding genes, yet not all these proteins serve as suitable drug targets1,2,3. The druggable proteome refers to the subset of proteins capable of binding to an antibody or small molecule with the requisite chemical properties and affinity4. Druggability describes the feature of a target molecule, wherein it induces a favorable clinical response upon interaction with a drug-like compound5. It is noteworthy that an estimated 60% of small molecule drug discovery projects falter during the hit-to-lead phase due to the target’s lack of druggability5,6. Predicting a target’s druggability early in drug discovery is thus crucial. Only about 10% of the human genome consists of druggable targets, and merely half of these are disease-relevant7.

According to Gashaw et al., for a drug target to be considered ideal, it should possess specific properties: an unimpeded operation characterized by the absence of competitive binding, the presence of a biomarker that facilitates monitoring its efficacy, differential expression throughout the body to enable precise targeting, minimal interference with physiological conditions, the ability to alter a disease, and suitability for high-throughput screening7,8.

In the context of the human genome, which consists of numerous protein-coding genes, roughly 3,000 are believed to be part of the druggable genome. However, drugs that have received approval from the US Food & Drug Administration (FDA) only target a meager twenty percent of these proteins9. To provide more specifics, the FDA has approved 672 drugs, each classified based on its protein class: enzymes (260; 39%), transporters (149; 22%), G-protein coupled receptors (98; 15%), CD markers (71; 11%), voltage-gated ion channels (49; 7%), and nuclear receptors (24; 4%), to name a few10,11. It is essential to note that drugs rendering the protein target inactive are termed antagonists, while those that stimulate the protein target are labeled agonists. In terms of the cellular locations of the targets for these FDA-approved drugs, various prediction methods for transmembrane and signal proteins suggest that 250 (37%) were integral to the membrane, 201 (30%) were intracellular, 101 (15%) existed as single-pass transmembrane proteins, 83 (12%) were secreted, 28 (4%) appeared as combined membrane-bound and secreted isoforms, and 9 (1%) were simultaneously integral to the membrane and exhibited a single-pass membrane structure4,10,11.

The limited number of drugs approved to date can be attributed to several factors, including the intricacies of experimenting with all proteins and nucleic acid fragments, a lack of information related to ethnicity, and a limited understanding of many diseases at the molecular level12,13. Given these challenges, there is a significant demand for computational models that can accurately predict drug targets on a genome-wide scale, ensuring both high sensitivity and specificity5. Furthermore, leveraging extensive data sources, such as metabolic and gene regulatory networks, protein–protein interactions, multi-omics datasets, and gene expression profiles, in conjunction with data mining tools like machine learning (ML), can aid in constructing predictive models. These models can discern biologically relevant patterns that indicate druggability in potential drug targets14.

Several classification models have been developed for predicting protein activities, including anti-angiogenic15, anti-cancer16, enzyme classes15, epitopes17, signaling18, lectins19, antioxidants20, and druggability14,21,22,23,24,25,26,27. Thus, the main aim of our study was to build an effective ML classifier to forecast the druggability of cancer-driving proteins, validate them through integrated multi-omics approaches, propose potential druggable proteins per cancer type, and propose potential targeted drugs and metabolites.

Methods

Machine learning prediction model

Figure 1 presents the general flow chart of the proposed methodology to obtain a classifier for druggable proteins. Firstly, we conducted a database with druggable proteins and ‘hard-to-drug’ proteins. Secondly, three families of protein composition descriptors were calculated using RCPI (R package)28: 20 amino acid composition (AC), 400 di-amino acid composition (DC), and 8,000 tri-amino acid composition (TC). In the next step, Jupyter notebooks with Python scikit-learn29 were used to test 13 types of ML classifiers by combining the 3 families of descriptors (AC, DC, TC) with five different feature selection methods and with different parameters. The employed classifiers include Gaussian Naive Bayes (GNB)30, k-nearest neighbors algorithm (KNN)31, linear discriminant analysis (LDA)32, support vector machine (SVM) both linear and non-linear based on radial basis functions (RBF)33, logistics regression (LR)34, multilayer perceptron (MLP) or neural network with 20 neurons in one hidden layer35, decision tree (DT)36, random forest (RF)37, XGBoost (XGB), an optimized distributed gradient boosting library38, Gradient Boosting for classification (GB)39, AdaBoost classifier (AdaB)40, and Bagging classifier41. The feature selection methods utilized were principal component analysis (PCA)42, feature selection based on a percentile of the highest scores with f_classif (ANOVA F-value between label/feature for classification tasks), feature selector removing features with variance below a threshold, linear support vector classification, and the extra-trees classifier.

Figure 1
figure 1

Flow chart of methodology for druggable cancer-driving protein prediction. AC, amino acid composition; DC, di-amino acid composition; TC, tri-amino acid composition; SMOTE, generation of synthetic data to balance the classes in the dataset; SD, standard deviation; AUROC, area under the receiver operating characteristics—metrics to evaluate the model performance; CV, cross-validation.

GNB is a probabilistic classifier based on Bayes' theorem, assuming all features are independent30. KNN is a non-parametric classifier that categorizes an unclassified sample using the nearest of k samples in the training set (k = 3)31. LDA is a fundamental linear classifier, fitting class conditional densities to the input features using Bayes’ rule32. The linear SVM maps input features into a higher-dimensional space33, while for nonlinear challenges, SVM employs Gaussian radial basis as nonlinear kernel functions. LR, another linear classifier, estimates binary response probabilities using varying weights34. The MLP is a category of neural networks with artificial neurons and a single hidden layer, capable of integrating both linear and nonlinear activation functions35. DT structures decision rules from the input features, with classification rules defined as paths from the root to the leaf36. RF, an ensemble method, combines parallel decision trees, exhibiting low-bias, minimal correlation between individual trees, and high variance37. XGB, an ensemble method, uses sequential weak trees to improve classification performance38. GB, meant for classification, is a base boost method employing sequential weak classifiers39. AdaB is a meta-estimator that initiates fitting with a classifier on the original dataset, subsequently adding more copies of the classifier with adjusted weights for misclassified instances40. The Bagging classifier, a variant of AdaB, incorporates additional classifiers based on subsets of the original dataset41.

The ML prediction model was constructed using two protein sets. The positive set comprised 666 druggable proteins with FDA-approved drugs, as per the DrugBank database (www.drugbank.ca)43 and the Broad Institute’s Drug Repurposing Hub (https://clue.io/repurposing)44. In contrast, the negative protein set consisted of 219 ‘hard-to-drug’ protein phosphatases, which were previously referred to as ‘undruggable’ targets45. As noted by Xie et al., kinases are classic examples of druggable targets that play a significant role in modulating cell motility. Conversely, phosphatases act as their counterparts, critically regulating cellular dynamics by removing phosphate from proteins, including serine, threonine, and tyrosine residues46,47,48. Detailed information on the gene symbol and gene ID for all druggable and ‘hard-to-drug’ proteins can be found in Supplementary Tables 1 and 2, while Supplementary Tables 3 and 4 provide the FASTA sequences of all proteins analyzed in this study. Lastly, the final ML prediction model was applied to scan 2,339 cancer-driving proteins sourced from the Network of Cancer Genes49 (Supplementary Table 5).

After computing the amino acid composition descriptors, the datasets comprised 885 proteins. Proteins in the druggable class were labeled as 1, while those in the ‘hard-to-drug’ class were labeled as 0. Due to the imbalance in the datasets, we employed the synthetic minority over-sampling technique (SMOTE) as described by Chawla et al.50. We used a threefold cross-validation (CV) approach to construct the ML classifiers. For each fold, a sequential pipeline was executed: (a) Scaling: the training set was standardized using the StandardScaler, and the test set was transformed to match the same scale. (b) Feature Selection or Dimension Reduction: the dimensionality of the training set was either reduced using a feature selection method, such as LinearSVC, or through a dimension reduction technique like Principal Component Analysis (PCA). (c) Cross-Validation Evaluation: The cross_val_score method was employed to compute the area under the receiver operating characteristic (AUROC) scores across the 13 ML methods for all splits. (d) Mean values and standard deviations (SD) of the AUROC scores for each ML classifier were calculated and displayed for the test subset51.

The best model to be used for predictions was chosen using criteria such as mean AUROC, SD of AUROC, the number of features, and the type model features (original or transformed). All the results obtained can be reproduced by using the scripts available at the GitHub repository: https://github.com/muntisa/machine-learning-for-druggable-proteins.

In addition, the importance of the features for the best model was analyzed using a function that calculates the permutation feature importance. This is done by randomly shuffling each feature and measuring the decrease in the model's performance. This process was repeated 10 times for each feature, and the average importance value was calculated. The result was a list of feature importances, which can be used to identify the most important features in the model. The importance values were normalized (values between 0 and 1), and the top 10% most important features were highlighted. Additionally, an extra analysis of the single amino acid frequencies in all selected features of the best model was conducted, with values also normalized between 0 and 1.

Target-disease evidence score

Open Targets (https://www.targetvalidation.org) is a platform that provides comprehensive data integration, enabling access to and visualization of potential drug targets associated with cancer52. ChEMBL (https://www.ebi.ac.uk/chembl/) is a database that catalogs bioactive molecules with drug-like properties53. The ChEMBL evidence score denotes a target-disease relationship that is supported by an FDA-approved drug or a clinical candidate drug targeting the gene product in question and indicated for cancer treatment54. In this study, to validate the significance of our previously predicted druggable proteins, we compared the ChEMBL evidence scores of druggable cancer-driving proteins and those of ‘hard-to-drug’ proteins. The scores were then statistically analyzed using the Bonferroni correction test, with a significance threshold set at P < 0.001.

In our pursuit to identify and prioritize the most critical druggable cancer-driving proteins, we retrieved target-disease evidence scores provided by ten distinct bioinformatic tools. This analytical effort spanned proteins currently undergoing clinical trials as well as those not yet under examination in clinical trials. Regarding these tools, Open Target Genetics (https://genetics.opentargets.org) specializes in identifying trait-causal genes from significant loci in genome-wide association studies (GWAS)55. ClinVar (https://www.ncbi.nlm.nih.gov/clinvar/) serves as a repository detailing the relationships between human germline or somatic variants and their associated phenotypes. Its evidence score is based on the clinical significance of a genetic variant56,57. The Genomics England PanelApp (https://panelapp.genomicsengland.co.uk) merges crowdsourced expertise with curation to establish target-cancer relationships58. Cancer Gene Census (https://cancer.sanger.ac.uk/census), part of the Wellcome Sanger Institute Catalogue of Somatic Mutations in Cancer (COSMIC), aims to catalog genes containing mutations causally linked to cancer59. IntOGen (https://www.intogen.org) offers a methodology to pinpoint potential cancer driver genes, using large-scale mutational data from sequenced tumor samples60. The Cancer Biomarkers database, part of the Cancer Genome Interpreter (http://www.cancergenomeinterpreter.org), features biomarkers relevant to drug sensitivity, resistance, and toxicity for drugs targeting specific cancer entities61. SLAPenrich (https://saezlab.github.io/SLAPenrich/) introduces a novel statistical approach to recognize significantly mutated pathways at the population level across large cohorts of cancer patients62. The Reactome Knowledgebase (https://reactome.org) delineates molecular aspects of various cellular functions like signal transduction and metabolism, identifying reaction pathways impacted by specific cancer types63. Phenotype comparisons for DIsease Genes and Models (PhenoDigm) (http://www.sanger.ac.uk/resources/databases/phenodigm), an algorithm provided by the International Mouse Phenotypes Consortium (IMPC), offers insights into gene–disease associations by analyzing phenotype data64. Lastly, an overall score was calculated integrating the information from all bioinformatic approaches to identify and prioritize essential druggable cancer-driving proteins.

Drugs involved in late-phase clinical trials

The Open Targets platform (https://www.targetvalidation.org), enhanced with ChEMBL annotations, provides an integrated data framework. This enables access and visualization of potential drug targets associated with cancer52,54,65. Furthermore, the Broad Institute’s Drug Repurposing Hub (https://clue.io/repurposing) is a curated collection of drugs approved by the Food and Drug Administration (FDA). This hub aided us in discerning the mechanisms of action for drugs employed in cancer treatments44. As a result, we analyzed the druggable cancer-driving proteins with a ChEMBL evidence score of > 0.9 to map out the therapeutic landscape for drugs in phase III and IV clinical trials.

Distance score of shortest pathways to cancer hallmark phenotypes

CancerGeneNet (https://signor.uniroma2.it/CancerGeneNet/) is a curated bioinformatics resource provided by the SIGnaling Network Open Resource (SIGNOR 3.0)66,67. This platform uses experimental annotations to bridge two interaction layers crucial to cell physiology, connecting proteins influenced by cancer drivers with proteins that impact on the hallmarks of cancer66,68. To elucidate the dynamics of these interactions, the procedure for calculating the distance score for the shortest pathways is outlined as follows: a) initiate a path query between two nodes; b) within the path string, each step is characterized by a pair of nodes and an edge, representing the nature of the interaction (e.g., activation or inhibition); c) the ‘distance’ parameter calculates the path length, incorporating the reliability of each step. Each step's reliability score, denoted as 'r', is derived from supporting evidence extracted from the STRING database69. This score is converted into a distance using the equation: \(d=1-r\). The final path score, represented as \({D}_{path}={\sum }_{rel=1}^{N} \left(1-{r}_{rel}\right)\), is the sum of each step distance, with 'N' standing for the total number of steps in a path67,70.

Iannuccelli et al. implemented a programmatic approach to calculate the shortest distance scores, or paths, between specific proteins and cancer phenotypes using the ‘shortest path’ function from the igraph R package. Our primary aim was to probe the signaling nexus between druggable cancer-driving proteins (not yet involved in clinical trials and with a ChEMBL evidence score = 0) and the hallmarks of cancer71.

Within this framework, we determined the shortest paths for both positive and negative regulations of druggable cancer-driving proteins linked to angiogenesis, immortality, inflammation, metastasis, proliferation, cell death, differentiation, DNA repair, and glycolysis. We then carried out multiple comparison tests, employing the Bonferroni correction (P < 0.001, 95% confidence interval), to compare the distance scores of druggable cancer-driving proteins across different cancer phenotypes. Lastly, we ranked these druggable proteins with the shortest paths to each cancer hallmark phenotype.

Chemistry-based score

canSAR (http://cansar.icr.ac.uk) is a comprehensive knowledgebase dedicated to drug discovery. It integrates data from genomics, proteomics, pharmacology, drugs, and chemicals with structural proteins and protein networks72. This bioinformatic resource encompasses the complete human proteome (20,375 sequences) sourced from the Uniprot Swiss-Prot database73. Additionally, canSAR provides an extensive structure-based ligandability assessment, covering more than 4.5 million cavities72. The chemistry-based score is categorized into four levels: low (0–24%), suggesting the protein is less likely to be a successful drug target; moderate (25–49%), indicating a moderate probability of druggability; high (50–74%), suggesting the protein has a good probability of being druggable; and very high (75–100%), indicating the protein is very likely to be druggable and is often considered a high priority for drug development due to its high probability of successful binging with drugs72. Using this data, we retrieved the chemistry-based score to validate our machine learning prediction method and prioritize key druggable cancer-driving proteins for each cancer type.

A pathology atlas for human cancer

The Human Pathology Atlas, available at (https://www.proteinatlas.org/humanproteome/pathology), is an integral component of the Human Protein Atlas project. This atlas explores the prognostic relevance of druggable cancer-driving genes/proteins across 17 The Cancer Genome Atlas (TCGA) PanCancer types in almost 8,000 patients74,75. Anchored in transcriptomics and antibody profiling, the atlas emerges as an essential tool for tailoring cancer treatments based on precision oncology74. Immunohistochemistry (IHC) stands as the gold standard method for in situ protein expression analysis in tissue samples. The combination of IHC and tissue microarray (TMA) technology allows simultaneous analysis of hundreds of tissue samples with an unprecedented degree of experimental standardization76.

The Atlas provides staining profiles for proteins in human tumor tissues, generated through the synergy of IHC and TMA techniques. This is complemented by Kaplan–Meier analysis, linking mRNA expression levels to patient survival. Patient samples were classified into two expression groups and the correlation between expression level and patient survival was examined. Using Kaplan–Meier survival estimators, the prognosis of different patient cohorts was determined. Log-rank tests were employed to compare these results. Genes/proteins with marked correlations to detrimental outcomes (log-rank P-values < 0.001) in the Kaplan–Meier evaluations were pinpointed as unfavorable prognostic indicators across TCGA PanCancer types77.

Functional enrichment analysis

Functional enrichment analysis provides researchers with curated insights and a deeper understanding of protein sets derived from omics-scale experiments. For our study, we focused on druggable cancer-driving proteins that have not yet entered clinical trials, as indicated by a ChEMBL evidence score of 0. These proteins also show an unfavorable prognosis across various TCGA PanCancer types. To evaluate enrichment, we employed g:Profiler version e101_eg48_p14_baf17f0, accessible at (https://biit.cs.ut.ee/gprofiler/gost)78. Our objective was to pinpoint significant annotations, following the Benjamini–Hochberg FDR q < 0.001 criteria, related to Gene Ontology (GO) biological processes (http://geneontology.org/)79 and Reactome signaling pathways (https://reactome.org/)63. The results of the functional enrichment analysis were visualized using a Manhattan plot, and significant terms associated with cancer hallmark phenotypes were manually curated80.

The oncogenic variome of key druggable cancer-driving proteins

Identifying the oncogenic variome of druggable cancer-driving proteins encompassed two primary steps. First, we obtained 22,320 single nucleotide and insertion/deletion variants from the 23 key druggable cancer-driving proteins. This data was retrieved from 76,156 genomes belonging to the Genome Aggregation database (gnomAD v3.2.1) (https://gnomad.broadinstitute.org/), and using the GRCh38/hg38 human reference genome1,81,82. Second, we performed the oncodriveMUT and boostDM methods integrated within the Cancer Genome Interpreter (CGI) platform (https://www.cancergenomeinterpreter.org) to evaluate the tumorigenic potential of the acquired genomic variants. This approach enabled us to categorize driver variants into known, predicted, and passenger classifications based on the Catalog of Validated Oncogenic Mutations61,83. OncodriveMUT is a rule-based strategy that analyzes genomic features, such as regions depleted by germline variants, gene mechanisms of action, gene signals of positive selection, and clusters of somatic mutations. Conversely, boostDM is a machine learning strategy that assesses the oncogenic potential of mutations in human tissues by employing in silico saturation mutagenesis of cancer genes61,83.

Deleteriousness of the oncogenic variome

The Combined Annotation-Dependent Depletion (CADD) tool, version 1.4 (https://cadd.gs.washington.edu/), is a bioinformatic resource that assesses the deleterious effects of diverse gene mutations within the human genome. It integrates over 60 genomic features to evaluate the impact of single nucleotide and insertion/deletion variants84. The CADD framework analyzes multiple annotations by comparing natural selection against simulated mutations, using the GRCh38/hg38 human reference genome85. For this study, we performed CADD to determine the deleteriousness of both known and predicted oncogenic variants associated with pivotal druggable cancer-driving genes. Lastly, the CADD deleteriousness scores were categorized as very high (30–50), high (25–30), medium (15–25), low (10–15), and very low (0–10).

Artificial intelligence prediction of drugs and metabolites

To investigate the potential interactions of current drugs (for drug repurposing) and metabolites with the best-predicted proteins, an artificial intelligence (AI)-based tool called Protein–Ligand Binding Affinity Prediction Using Pretrained Transformers (PLAPT) was used to predict the interaction affinity (or negative log10 affinity)86. PLAPT (https://github.com/trrt-good/WELP-PLAPT/) predicts the binding affinity of ligand–protein complexes using the SMILES code of ligands and the sequence of proteins. The model employs pre-trained transformers such as ProtBERT and ChemBERTa to convert the protein sequence and SMILES structure into embeddings for the model.

Using the PLAPT Python package, all ligand–protein affinities were calculated. Two families of ligands were used: 2,466 ChEMBL approved drugs with masses between 100 and 500 (downloaded via a Python script) for potential drug repurposing against multiple predicted druggable proteins, and 217,776 molecules from the Human Metabolome Database (HMDB) (https://hmdb.ca/downloads) against a single important predicted druggable protein (due to computational limits)87.

Results and discussion

Machine learning prediction model

The current study introduces innovative classification models designed to predict new druggable proteins that drive cancer. These predictions are based on three sets of protein sequence descriptors (amino acid composition, di-amino acid composition, and tri-amino acid composition), calculated using Rcpi. These descriptors were chosen for their proven ability to capture essential information about protein sequences that are critical for predicting druggability88.

AC effectively represents a protein's primary structure by highlighting the frequency of each amino acid within the sequence, helping to identify general trends and patterns associated with druggable proteins. DC captures the local interactions between pairs of amino acids, providing insight into the secondary structure and local folding patterns, which are crucial for understanding functional regions and binding sites. TC considers interactions between triplets of amino acids, offering a more detailed view of the amino acid sequence, which is essential for accurately predicting protein interactions with drugs and other ligands30,89.

Focusing on these features ensures computational efficiency and reduces the risk of overfitting, which can occur with an excessive number of features. Our comprehensive benchmarking demonstrated that these descriptors consistently provided robust performance across various machine learning classifiers. While the inclusion of additional features, such as secondary structure elements or solvent accessibility, might offer incremental benefits, the chosen descriptors strike an optimal balance between model performance, computational feasibility, and biological relevance. This balance allows for effective and interpretable predictions while maintaining the practicality of the computational framework. Furthermore, the identified amino acid sequence patterns will inform future studies on protein properties.

Subsequently, we utilized Jupyter notebooks built on Python and scikit-learn to construct 13 types of ML classifiers (GNB, KNN, LDA, SVM linear, SVM, LR, MLP, DT, RF, XGB, GB, AdaB, and Bagging), along with five types of feature selection methods with various parameters (Fig. 1). All scripts used the mean AUROC values from threefold cross-validation to quantify classification performance. We tested models using 20, 100, 200, and 400 features30.

Figure 2 illustrates the AUROC values for a classifier using only 20 features: AC descriptors without feature selection, DC descriptors with LinearSCV feature selection (DC-LinearSVC20), PCA features from DC (DC-PCAn20), TC descriptors selected by SelectPercentile(f_classif, percentile = 0.25) (TC-Percn20), and TC descriptors selected with LinearSVC (TC-LinearSVC20). Notably, using only 20 AC descriptors with SVM yielded an AUROC of 0.926. The best performance was achieved using SVM (RBF) with 20 PCA components from 400 DC descriptors, resulting in an AUROC of 0.958. Additional results can be found in Supplementary Table 6.

Figure 2
figure 2

Mean AUROC values for classifiers obtained with 20 selected features (threefold CV). GNB, Gaussian Naive Bayes; KNN, k-nearest neighbors algorithm; LDA, linear discriminant analysis; SVM linear, super vector machine linear; LR, logistic regression; MLP, multilayer perceptron; DT, decision tree; RF, random forest; XGB, XGBoost; GB, gradient boosting; AdaB, AdaBoost classifier, Bagging, Bagging classifier; AC, amino acid composition; DC, di-amino acid composition; TC, tri-amino acid composition.

Figure 3 displays AUROC values for a classifier using 100 features: PCA transformed of 400 DC descriptors (DC-PCAn100), TC descriptors selected with SelectPercentile(f_classif, percentile = 1.25) (TC-Perc1.25), TC descriptors with LinearSVC (TC-LinearSVC100), and 100 features selected by LinearSVC from 200 PCA components of 8,000 TC descriptors (TC-PCA200LinearSVC100). Increasing the number of features to 100 (five times more than 20) improved the AUROC to 0.976 using the same SVM (RBF) with TC-PCA200LinearSVC10030.

Figure 3
figure 3

Mean AUROC values for classifiers based on 100 selected features (threefold CV). GNB, Gaussian Naive Bayes; KNN, k-nearest neighbors algorithm; LDA, linear discriminant analysis; SVM linear, super vector machine linear; LR, logistic regression; MLP, multilayer perceptron; DT, decision tree; RF, random forest; XGB, XGBoost; GB, gradient boosting; AdaB, AdaBoost classifier, Bagging, Bagging classifier; DC, di-amino acid composition; TC, tri-amino acid composition.

Figure 4 shows the AUROC values for classifiers using 200 selected features (double the number from 100): PCA transformation of 400 DC descriptors (DC-PCAn200), DC descriptors selected with SelectPercentile (DC-Perc50), 200 PCA components of 8,000 transformed TC descriptors (TC-PCAn200), TC descriptors selected with SelectPercentile(f_classif, percentile = 2.5) (TC-Perc2.5), and TC descriptors with LinearSVC (TC-LinearSVC200). The combination of PCA and SVM for DC-PCAn200 resulted in the best classifier, achieving an AUROC of 0.981 (Supplementary Table 6). Further, using all 400 DC descriptors with SVM, the mean AUROC reached 0.982 ± 0.0021. Additionally, with 8,000 pure TC descriptors and SVM linear, the mean AUROC was 0.992 ± 0.0028. It is important for a classification model to avoid having more input features than data instances. We also sought to prioritize pure descriptors over PCA transformations. As a compromise, we selected the following as the best model for subsequent protein-related cancer predictions: 200 TC descriptors selected with LinearSVC, a non-linear SVM classifier with an AUROC of 0.975 ± 0.003 and an accuracy of 0.929 ± 0.006 (threefold cross-validation). The list of the 200 selected features is available in the Jupyter notebooks.

Figure 4
figure 4

Mean AUROC of classifiers based on 200 input features (threefold CV). GNB, Gaussian Naive Bayes; KNN, k-nearest neighbors algorithm; LDA, linear discriminant analysis; SVM linear, super vector machine linear; LR, logistic regression; MLP, multilayer perceptron; DT, decision tree; RF, random forest; XGB, XGBoost; GB, gradient boosting; AdaB, AdaBoost classifier, Bagging, Bagging classifier; DC, di-amino acid composition; TC, tri-amino acid composition.

Selected features analysis

The following is the list of selected features for the best model: NRA, QRA, INA, MCA, YEA, THA, CSA, VYA, KNR, WDR, TER, PQR, YGR, EHR, LIR, VSR, ERN, MDN, SDN, LHN, YIN, FFN, RSN, QSN, FWN, ACD, WCD, MED, CHD, SHD, MLD, SMD, WPD, SSD, HTD, DWD, VYD, KNC, NDC, IHC, VHC, GYC, MCE, NHE, ALE, HME, LPE, AWE, EYE, QYE, GVE, FVE, SAQ, FNQ, MDQ, PCQ, WEQ, RQQ, NGQ, HLQ, RMQ, DFQ, GPQ, DSQ, YSQ, AWQ, RVQ, QRG, HGG, TGG, KLG, NKG, FPG, SSG, RTG, PTG, IVG, CDH, FDH, PDH, TQH, KHH, FHH, IFH, NSH, WSH, FWH, WRI, NDI, EDI, FEI, WEI, WQI, MGI, PMI, AAL, EKL, IKL, FKL, GPL, ESL, DVL, MVL, VVL, GNK, HNK, HDK, HCK, EQK, DHK, QLK, EKK, SMK, FFK, QSK, EWK, AVK, WRM, WNM, REM, WQM, SHM, LLM, SMM, NFM, TSM, RWM, GYM, KYM, VYM, HVM, IVM, LDF, YQF, NGF, HGF, FWF, FAP, FNP, PEP, SQP, QGP, VHP, PLP, HKP, NPP, QPP, STP, TTP, KWP, YWP, SRS, HDS, WDS, HCS, LES, DHS, SHS, PSS, SSS, LWS, LAT, DRT, GRT, IRT, INT, VQT, NLT, CLT, KKT, YTT, QWT, FYT, KCW, QGW, VGW, MIW, IKW, RFW, DFW, HVW, KVW, NRY, CHY, DMY, YPY, YAV, SRV, ENV, HNV, GEV, QGV, HGV, TGV, WHV, LLV, IMV, DSV, TSV, QYV. The normalized importance for the 10% selected features is presented in Table 1. The most important amino acid patterns for this classification are HME, NSH, SSS, HTD, DHK, ERN, NDI, DRT, VYD, FFN, SHM, NDC, RFW, WRI, GYC, MGI, PEP, GVE, DSQ, and LLV. The HME pattern is the most important feature for druggable proteins, while the NSH pattern has only half the importance of HME.

Table 1 Feature importance for 10% of the selected features of the best classification model.

In Table 2, the frequencies of the amino acids in all selected features demonstrate the importance of H (histidine), S (serine), D (aspartic acid), and Q (glutamine) in classifying druggable proteins. Additionally, H and S appear in the first five most important tri-amino acid patterns. The biological significance of the amino acids in these patterns is outlined below: (a) HME (histidine–methionine–glutamic acid): Histidine is essential for protein synthesis and enzyme catalysis, methionine is the initiator amino acid for protein translation, and glutamic acid is involved in neurotransmission and protein folding; (b) NSH (asparagine–serine–histidine): Asparagine is crucial for glycoprotein synthesis and serine is involved in phosphorylation and protein structure; (c) SSS (serine–serine–serine): serine is essential for cell signaling, protein synthesis, and metabolism; (d) HTD (histidine–threonine–aspartic acid): Threonine is important for protein stability and immune function, and aspartic acid contributes to protein structure and function; and (e) DHK (aspartic acid–histidine–lysine): Lysine is essential for protein synthesis and collagen formation5,90,91.

Table 2 Frequencies of the amino acids in the selected tri-amino acids groups for the best classification model.

Cancer-driving proteins

We transformed 2,339 cancer-driving proteins into molecular descriptors using the best model to predict their druggability. Consequently, these protein sequences were converted into 200 selected TC descriptors. As a result, 2,080 (88.9%) of these cancer-driving proteins were predicted to have druggable activity (Fig. 5A and Supplementary Table 5). For validation, we compared the ChEMBL evidence scores of proteins involved in clinical trials54, distinguishing among the positive set of druggable proteins (mean score = 0.712), druggable cancer-driving proteins (class 1, mean score = 0.706), ‘hard-to-drug’ cancer-driving proteins (class 0, mean score = 0.596), and the negative set of ‘hard-to-drug’ proteins (mean score = 0.414). As expected, the Bonferroni correction revealed no significant difference between the positive set and druggable cancer-driving proteins, nor between the negative set and ‘hard-to-drug’ proteins. Interestingly, it did reveal a significant difference between druggable cancer-driving proteins (class 1) and ‘hard-to-drug’ proteins (class 0) (P < 0.001) (Fig. 5B). This indicates that druggable cancer-driving proteins are distinctively more validated as potential targets compared to ‘hard-to-drug’ proteins, underscoring the relevance and accuracy of the classification method used. These findings validate the effectiveness of the prediction model in distinguishing between truly druggable targets and those that are more challenging to target therapeutically, highlighting its potential utility in the drug discovery process.

Figure 5
figure 5

Target-disease evidence score for predicted druggable cancer-driving proteins. (A) A bean plot illustrating the distribution of prediction scores (mean = 0.796) for 2,339 (100%) cancer-driving proteins. Out of these, 2,080 (88.9%) proteins were classified as druggable (class 1), while 259 (11.1%) were predicted as ‘hard-to-drug’ (class 0). (B) Bean plots present the distribution of ChEMBL evidence scores (https://www.ebi.ac.uk/chembl)53 for various categories: the positive set of druggable proteins (mean = 0.712), druggable cancer-driving proteins (class 1, mean = 0.706), ‘hard-to-drug’ cancer-driving proteins (class 0, mean = 0.596), and the negative set of ‘hard-to-drug’ proteins (mean = 0.414). These plots show the distribution of ChEMBL evidence scores that represent the involvement of proteins in clinical trials. (C) A heat map displaying druggable cancer-driving proteins in clinical trials with ChEMBL evidence scores exceeding 0.9. The map also incorporates the target-disease evidence scores from ten unique bioinformatic tools: Open Targets Genetics (https://genetics.opentargets.org)55, ClinVar (germline) and ClinVar (somatic) (https://www.ncbi.nlm.nih.gov/clinvar/)56,57. Genomics England PanelApp (https://panelapp.genomicsengland.co.uk)58, Cancer Gene Census (https://cancer.sanger.ac.uk/census)59, IntOGen (https://www.intogen.org)60, Cancer Biomarkers (http://www.cancergenomeinterpreter.org)61, SLAPenrich (https://saezlab.github.io/SLAPenrich/)62, Reactome (https://reactome.org)63., and IMPC (http://www.sanger.ac.uk/resources/databases/phenodigm)64. (D) Another heat map showcases druggable cancer-driving proteins not yet participating in clinical trials with ChEMBL evidence scores equal to 0. It too incorporates target-disease evidence scores from the aforementioned bioinformatic tools. (E) Box plots provide a ranking of bioinformatic tools based on their mean target-disease scores. This analysis focuses on druggable cancer-driving proteins that have not been part of clinical trials and have ChEMBL scores of 0.

Following the prediction and validation of the 2,080 druggable cancer-driving proteins, we extracted the target-disease evidence scores from the Open Targets platform. This was done to prioritize the most relevant druggable cancer-driving proteins already involved in late-stage clinical trials (ChEMBL score > 0.9) and those not yet involved in clinical trials (ChEMBL score = 0)52,53,54. The target-disease evidence score was encompassed data from various bioinformatic tools including Open Target Genetics55, ClinVar (covering germinal and somatic variants)56,57, Genomics England PanelApp58, Cancer Gene Census59, IntOGen60, the Cancer Biomarkers database61, SLAPenrich62, the Reactome Knowledgebase63, and PhenoDigm64. This overall score, derived from an integration of these bioinformatic approaches, enabled us to identify proteins strongly associated with cancer traits. Of these, 52 were druggable cancer-driving proteins involved in late-phase clinical trials (Fig. 5C and Supplementary Tables 7 and 8), and 296 were druggable cancer-driving proteins not yet involved in clinical trials (Fig. 5D and Supplementary Tables 7 and 9). Furthermore, the five bioinformatic approaches yielding the highest target-disease evidence scores for the 296 druggable proteins not yet in clinical trials were Cancer Gene Census (mean = 0.90), SLAPenrich (0.88), Reactome (0.84), Genomics England PanelApp (0.79), and Cancer Biomarkers (0.77) (Fig. 5E).

Drugs involved in late-phase clinical trials

Figure 6 presents an update on phase III and IV clinical trials involving drugs that target cancer-driving proteins, as cataloged by the Open Targets Platform52. The Sankey plot in the figure reveals a total of 257 clinical trial events, involving 94 drugs with 38 different mechanisms of action, which target 52 key cancer-driving proteins across 26 types of cancer (Supplementary Table 10). The most frequently involved drugs in these late-phase clinical trials were regorafenib, binimetinib, pazopanib, and sorafenib. The mechanisms of action most common in these trials included FGFR inhibitors, FLT3 inhibitors, MEK inhibitors, and EGFR inhibitors. The cancer-driving proteins most frequently targeted in the trials were GABRB2, MAP2K1, and MAP2K2. Additionally, the cancer types most commonly evaluated in these late-phase clinical trial events were liver cancer, lung cancer, breast cancer, leukemia, and colorectal cancer. This comprehensive therapeutic landscape has enabled us to identify key patterns and trends in cancer treatment research.

Figure 6
figure 6

Panoramic landscape of the druggable cancer-driving proteins and the drugs currently in phase III and IV clinical trials. The Sankey plot displays the 257 late-stage clinical trial events. These encompass 52 druggable cancer-driving proteins (with ChEMBL evidence score exceeding 0.9) that are targeted by 94 distinct drugs. These drugs operate through 38 different mechanisms of action and are tested across 26 cancer types. The proteins with the most clinical trial events are GABRB2 (n = 14), MAP2K1 (n = 10), and MAP2K2 (n = 10). The drugs most frequently involved in trial events are regorafenib (n = 24), binimetinib (n = 10), and sorafenib (n = 10). The mechanisms of action most represented in trials are FGFR inhibitors (n = 37), FLT3 inhibitors (n = 24), and MEK inhibitors (n = 20). The cancer types with the highest number of clinical trial events were liver cancer (n = 40), lung cancer (n = 36), and breast cancer (n = 31). Data of clinical trials and mechanisms of action were taken from the Open Targets Platform (https://platform.opentargets.org/)52, and the Drug Repurposing Hub (https://clue.io/repurposing)44. Lastly, Sankey plots were designed using the SankeyMATIC software (https://sankeymatic.com/ and https://github.com/nowthis/sankeymatic).

Shortest pathways to cancer hallmark phenotypes

After identifying 296 druggable proteins not yet involved in clinical trials, we conducted multi-omics analyses to prioritize the most relevant cancer-driving proteins as potential therapeutic targets across various cancer types70,80,92,93,94. In this context, we employed the CancerGeneNet software and found that 184 (62%) of these proteins showed distance scores indicative of their involvement in the shortest pathways leading to cancer hallmark phenotypes66,67, as detailed in Supplementary Table 11. Figures 7A and B illustrate these druggable proteins and their shortest paths to cancer hallmarks. The top three hallmarks are cell proliferation (with a mean distance score of 1.27 and 154 proteins involved), cell differentiation (1.51; 160), and resistance to cell death (1.55; 157) (Supplementary Table 12). Utilizing the Bonferroni correction test, we observed that these druggable proteins had significantly shorter paths to these cancer hallmark phenotypes (P < 0.001). These findings are highly relevant because the prioritized druggable proteins in this analysis could be crucial targets for focusing new therapeutic strategies on processes such as cell proliferation or resistance to cell death.

Figure 7
figure 7

Prioritization of key druggable cancer-driving proteins through multi-omics analyses. (A) Box plots that display the mean distance scores of the shortest pathways associated with each cancer hallmark phenotype. Additionally, the Bonferroni correction, a method for multiple comparison testing (P < 0.001), was employed to highlight significant differences among the cancer phenotypes. Analysis of the shortest paths to cancer hallmark phenotypes reveals that 184 druggable proteins are closely associated with cell proliferation, cell differentiation, resistance to cell death, glycolysis, metastasis, inflammation, genome instability, immortality, and angiogenesis. (B) The analysis further indicates that out of these druggable proteins, 64 (34.8) have the shortest paths to nine cancer hallmarks, 63 (34.2) to eight hallmarks, 29 (15.8%) to seven hallmarks, 17 (9.2%) to six hallmarks, 6 (3.3%) to five hallmarks, 4 (2.2%) to four hallmarks, and 1 (0.5%) to one hallmark. These shortest paths to cancer hallmark phenotypes were analyzed using data from CancerGeneNet (https://signor.uniroma2.it/CancerGeneNet/)66. (C) A box plot is shown to demonstrate the percentage of chemistry-based score for the 184 druggable cancer-driving proteins. The ligandability analysis reveals that 79 (43%) of these proteins have high scores (> 69.9%). This chemistry-based score was analyzed using data from canSAR (http://cansar.icr.ac.uk)72. (D) This dot plot highlights the prioritization of 23 key druggable cancer-driving genes/proteins, identified based on a prediction of druggability higher than 0.7, a chemistry-based score above 70%, and unfavorable prognostic significance (significant log-rank P-value < 0.001) across 16 TCGA PanCancer types, according to data from the Human Protein Atlas platform (https://www.proteinatlas.org/)74. (E) Functional enrichment analysis of these 23 key druggable cancer-driving proteins is visualized through a Manhattan plot. This analysis demonstrates the most significant (Benjamini–Hochberg method, FDR q-value < 0.001) biological processes and Reactome signaling pathways involved in cancer. The enrichment analysis was conducted using g:Profiler software (https://biit.cs.ut.ee/gprofiler/gost)78.

Chemistry-based score

canSAR is a comprehensive knowledgebase dedicated to drug discovery and offers an extensive structure-based ligandability assessment72. Consequently, we retrieved the chemistry-based scores for the previously prioritized 184 proteins. The mean chemistry-based score of these 184 proteins was 69.9%. In our analysis, we considered all proteins with a ligandability score higher than the mean (cutoff > 69.9%), encompassing all proteins with the very high scores and the best proteins with high scores. This analysis enabled us to identify 79 (43%) druggable cancer-driving proteins with the highest ligandability, as shown in Fig. 7C and Supplementary Table 13. Ligandability analysis refers to a protein’s ability to bind efficiently to a drug. High ligandability helps identify and prioritize proteins that can be effective targets for new drugs, thereby increasing the specificity of the drug’s action and reducing the time and cost associated with pharmaceutical development95.

A pathology atlas for human cancer

We explored the Human Pathology Atlas, developed by the Human Protein Atlas program, and subsequently conducted a Kaplan–Meier analysis to examine the correlation between mRNA and protein expression and patient survival74,75,76,77. This analysis aimed at determining the prognostic significance of 79 highly ligandable, druggable cancer-driving genes/proteins (Supplementary Table 14). Our findings underscore the effectiveness of large-scale system biology projects that utilize publicly available resources. In this study, we identified the 23 key druggable cancer-driving genes/proteins that demonstrated unfavorable prognostic significance (significant log rank P-value < 0.001) across 16 TCGA PanCancer types. These genes/proteins were CDKN2A, BCL10, ACVR1, CASP8, JAG1, TSC1, NBN, PREX2, PPP2R1A, DNM2, VAV1, ASXL1, TPR, HRAS, BUB1B, ATG7, MARK3, SETD2, CCNE1, MUTYH, CDKN2C, RB1, and SMARCA4 (Fig. 7D and Supplementary Table 15).

Functional enrichment analysis

We conducted a functional enrichment analysis of the 23 key druggable cancer-driving proteins using g:Profiler software78. The Manhattan plot enabled us to identify 64 GO biological processes79 and 2 Reactome signaling pathways63 (Fig. 7E and Supplementary Table 16). The most significant annotations, adjusted with the Benjamini–Hochberg correction and an FDR q-value < 0.001, included cell cycle, cell communication, phosphorylation, immune system process, programmed cell death, cell differentiation, cellular senescence, endocrine resistance, G1 phase, and cyclin D events in G1. Interestingly, it is important to highlight that these 23 key druggable cancer-driving proteins are involved in biological processes associated with various therapeutic strategies. These strategies include the inhibition of cellular proliferation96, the inhibition of phosphorylation97, cancer immunotherapy98, activation of programmed cell death99, regulation of senescence100, and evasion of endocrine resistance101.

The oncogenic variome of key druggable cancer-driving proteins and their deleterious effects

Figure 8A presents the analysis of 22,320 variants using OncodriveMUT and boostDM to determine the oncogenic variome in the 23 key druggable cancer-driving genes. This analysis identified 1,598 oncogenic variants, with 11 (1%) being previously known and 1,578 (99%) newly predicted. The analysis of deleteriousness scores revealed that 252 (16%%) of these oncogenic variants had very high CADD scores, 788 (49%) had high CADD scores, and 506 (32%) had medium CADD scores. The most common types of genetic alterations were missense variants (81%), followed by frameshift (6%), and stop-gained variants (5%). Figure 8B displays box plots that illustrate the deleteriousness scores of the oncogenic variants according to their consequence types. Stop-gained variants exhibited the highest mean CADD score (37.2), followed by splice donor (31.2), splice acceptor (30.9), missense (25.9), frameshift (25.8), start lost (21.2), stop lost (17.7), inframe deletion (17.4), splice region (16.8), and inframe insertion variants (16.7). Lastly, Fig. 8C presents bean plots that rank the key druggable cancer-driving genes based on the highest number of oncogenic variants and their deleteriousness scores (Supplementary Table 17).

Figure 8
figure 8

Oncogenic variome. (A) Identification of oncogenic variants in the 23 key druggable cancer-driving genes. This identification is achieved through the use of oncodriveMUT and boostDM machine learning methods. Following this, the analysis includes examining their CADD deleteriousness scores and consequence types. (B) It also features a ranking of the consequence types based on the highest mean CADD scores. (C) Bean plots are used to illustrate the cancer-driving genes that possess the highest number of oncogenic variants, along with the CADD deleteriousness scores associated with these genes. The analysis of the oncogenic variome was conducted using the CGI platform (https://www.cancergenomeinterpreter.org)61,83, while the assessment of their deleteriousness was carried out using the CADD tool (https://cadd.gs.washington.edu/)84.

Identifying oncogenic variants in cancer-driving genes is crucial for developing targeted therapies102,103,104,105. These therapies are specifically designed to inhibit or modify the function of proteins produced by mutated genes, offering more effective treatment options with potentially fewer side effects compared with traditional chemotherapy106. Moreover, this approach enables personalized precision medicine. By understanding specific genetic and epigenetic alterations in a patient’s tumor, treatments can be tailored to target these changes107,108,109,110. In this context, the identification of oncogenic variants in druggable cancer-driving genes is a fundamental aspect of modern oncology, influencing everything from individual patient treatment to broader aspects of cancer research, ethnicity, and public health initiatives106,111,112.

This integrative approach has identified 23 key druggable cancer-driving proteins (CDKN2A, BCL10, ACVR1, CASP8, JAG1, TSC1, NBN, PREX2, PPP2R1A, DNM2, VAV1, ASXL1, TPR, HRAS, BUB1B, ATG7, MARK3, SETD2, CCNE1, MUTYH, CDKN2C, RB1, and SMARCA4), setting the stage for improved therapeutic targets that could significantly boost the efficacy of clinical trials.

Testing the model’s limitations

Like any model, there are limitations when using it for prediction. Due to the limited data on druggable proteins, all 666 druggable proteins were used as class 1 to train the model. This makes it impossible to obtain an external dataset with druggable proteins to confirm the predictive power of the best model. One way to test the model's limitations is to plot the best protein predictions within the space of the selected features, alongside the druggable proteins and hard-to-drug proteins. Since plotting in 200 dimensions (the number of selected features in the best model) is impractical, we approximate by transforming these 200 dimensions into just 2 PCA components for visualization. Class 0 descriptors (hard-to-drug proteins), class 1 descriptors (druggable proteins), and the descriptors corresponding to the 23 key druggable cancer-driving proteins (predicted proteins) have been converted into standard units as in the original dataset for TC descriptors and transformed into 2 PCA components for visualization in Fig. 9. In the figure, druggable proteins are shown in blue, hard-to-drug proteins in red, and the best predicted proteins in green. The plot indicates that even though the negative class (class 0) contains phosphatase proteins, there is no clear separation between the training classes 1 and 0 within the space of the selected TC descriptors, indicating a complex descriptor space.

Figure 9
figure 9

Principal component analysis to test the model’s limitations. This figure shows a plot of the best protein predictions within the space of the selected features, alongside the druggable proteins (class 1) and hard-to-drug proteins (class 0). Descriptors for class 1, class 0, and the 23 key druggable cancer-driving proteins (predicted proteins) have been converted into standard units, as in the original dataset for TC descriptors, and transformed into 2 PCA components. Druggable proteins are shown in blue, hard-to-drug proteins in red, and the best-predicted proteins in green.

Prediction points that fall within regions containing mixed points (both class 1 and class 0 points) may be the most trustworthy. In these regions, the model has been exposed to a more diverse dataset, enabling it to learn to better distinguish the patterns and characteristics that differentiate the two classes. Consequently, predictions in these regions are more likely to be accurate and reliable, as the model has learned more robust and generalizable features for data classification. Therefore, predictions made in these mixed regions are likely to be the most robust and trustworthy. The majority of the predicted proteins are located in these mixed regions, suggesting they have a higher potential to be future drug targets. In the supplementary material, a researcher can choose another model with a mean AUROC value greater than a specific cutoff (e.g., 0.9), with fewer features and possibly better PCA representation of the predictions. Future studies should use artificial intelligence and docking tools to predict a list of potential current drugs or new ligands.

Repurposing drugs and metabolites

An additional step to confirm the 23 key druggable cancer-driving proteins involves predicting interactions with ChEMBL-approved drugs (2,466 molecules with masses between 100 and 500) through drug repurposing113,114. Using pairs of drug SMILES codes and protein sequences as inputs, a deep learning model called PLAPT evaluated the binding affinity (or negative log10 affinity)86. The model employs pre-trained transformers like ProtBERT and ChemBERTa to convert the protein sequence and SMILES structure into embeddings for the model. Supplementary Tables 18 and 19, along with the GitHub file titled Supplementary_interactions_gene-drug(byPLAPT), present the affinity values for each drug-protein pair. The mean affinity values (minimum affinities or maximum negative log10 affinities) for all 23 proteins indicate that the top drugs clinically relevant to cancer treatment that can interact with these proteins include: mifepristone (targeting CASP8), pentostatin (BCL10, CASP8, CCNE1, and CDKN2A), afatinib (ACVR1, CDKN2C, and HRAS), alitretinoin (ACVR1, CDKN2C, HRAS, and PREX2), talazoparib (ACVR1, CDKN2C, and HRAS), alpelisib (ACVR1, CDKN2C, HRAS, NBN, PREX2, and SMARCA4), ulipristal acetate (ACVR1, ASXL1, CDKN2C, HRAS, NBN, PREX2, RB1, and SMARCA4), lorlatinib (ACVR1, ASXL1, ATG7, DNM2, HRAS, JAG1, MARK3, NBN, PPP2R1A, PREX2, RB1, SETD2, SMARCA4, TPR, TSC1, and VAV1), piflufolastat (ASXL1, ATG7, BUB1B, DNM2, JAG1, MARK3, MYTYH, NBN, PPP2R1A, PREX2, RB1, SETD2, SMARCA4, TPR, TSC1, and VAV1), pyrvinium pamoate (ASXL1, ATG7, BUB1B, DNM2, HRAS, JAG, MARK3, NBN, PPP2R1A, PREX2, RB1, SETD2, SMARCA4, TPR, TSC, and VAV1), and tepotinib hydrochloride (ASXL1, ATG7, BUB1B, DNM2, JAG1, MARK3, MUTYH, NBN, PPP2R1A, PREX2, RB1, SETD2, SMARCA4, TPR, TSC1, and VAV1 (Fig. 10).

Figure 10
figure 10

Drug repurposing. The Sankey plot shows the best interactions between the 23 key druggable cancer-driving proteins and the clinically relevant ChEMBL-approved drugs for cancer treatment (https://www.ebi.ac.uk/chembl/)113. The interactions are based on binding affinity values greater than 0.9 for each drug-protein pair. Lastly, Sankey plots were designed using the SankeyMATIC software (https://sankeymatic.com/ and https://github.com/nowthis/sankeymatic).

Mifepristone, a progesterone receptor antagonist, has been explored for its potential in treating glioblastoma, breast cancer, and uveal melanoma due to its ability to act on multiple receptor types, including glucocorticoid and androgen receptors115,116,117. Pentostatin is a chemotherapy drug primarily used for treating hairy cell leukemia and T-cell prolymphocytic leukemia. It is a purine analog that works by inhibiting the enzyme adenosine deaminase, crucial for DNA synthesis and cell replication, leading to the accumulation of deoxyadenosine triphosphate and ultimately causing cell death, particularly in rapidly dividing cancer118. Afatinib is an oral medication primarily used for treating non-small cell lung cancer. It functions as a tyrosine kinase inhibitor, targeting and blocking the EGFR protein as well as other members of the ErbB family, including HER2 and ErbB4119. Alitretinoin, a derivative of vitamin A, is used in cancer treatment primarily for Kaposi sarcoma. It binds to and activates retinoid receptors (RAR and RXR), which regulate gene expression involved in cell differentiation and proliferation, helping to inhibit the growth of Kaposi sacroma cells120. Talazoparib works by inhibiting PARP enzymes, which play a crucial role in DNA repair. By blocking these enzymes, talazoparib prevents cancer cells from repairing their DNA, leading to cell death, especially in cells with BRCA1/2 mutations that already have compromised DNA repair mechanisms43,121. Alpelisib is an oral medication used in combination with fulvestrant to treat hormone receptor-positive, HER2-negative advanced or metastatic breast cancer with PIK3CA mutations. It works as a PI3K inhibitor, specifically targeting the alpha isoform of the enzyme, which is crucial in the PI3K/AKT signaling pathway involved in cancer cell growth and survival122. Ulipristal acetate is a progesterone receptor modulator implicated in the proliferation and growth of certain cancer cells. It competes with progesterone, thereby inhibiting the progesterone-induced proliferation of breast cancer cells, making it a candidate for reducing breast cancer risk, especially in individuals with BRCA1/2 mutations123. Lorlatinib inhibits ALK and ROS1 kinases, which are involved in cancer cell growth and survival. It is effective against multiple ALK mutations that confer resistance to first- and second-generation ALK inhibitors124. Piflufolastat F-18 binds to the prostate-specific membrane antigen, a protein overexpressed on the surface of most prostate cancer cells. Once bound, the radioactive tracer emits positrons detected by a PET scanner, revealing the location of PSMA-positive lesions in the body125. Pyrvinium pamoate is an androgen receptor antagonist that targets multiple cellular pathways. It disrupts mitochondrial function by inhibiting electron transport chain complexes I and II, reducing mitochondrial fitness and increased glycolysis, especially under hypoglycemic conditions often found in tumors. It also reduces WNT and Hedgehog signaling pathways, crucial for cancer cell proliferation and survival126,127,128,129. Lastly, tepotinib hydrochloride is a tyrosine kinase inhibitor targeting the MET receptor. By inhibiting this receptor, it interferes with cancer cell growth and survival pathways, which are crucial for the proliferation and metastasis of MET-altered cancer cells.

The last screening for interactions was conducted for the HRAS protein (P01112) using 217,776 molecules from the HMDB (see all affinities in the Supplementary Table 20 and the GitHub file titled Supplimentary_affinities_hmdb_HRAS-P01112). Among the best potential interactions between HRAS and metabolites, the following were identified: cyanidin 5-O-beta-d-glucoside (HMDB0304305), chlorophyll (HMDB0303604), delphinidin 3-(3″-p-coumaroylglucoside) (HMDB0030099), cis-neoxanthin (HMDB0302969), verteporfin (HMDB0014603), pinotin A (HMDB0029240), benztropine (HMDB0014390), adapalene (HMDB0014355), inulin (HMDB0014776), and ceftriaxone (HMDB0015343). Future studies involving molecular docking, molecular dynamics, or other AI-based interaction prediction models will be needed to further confirm these interactions.

Conclusions

This study presents an innovative machine learning-based method for predicting druggable proteins that drive cancer, utilizing three sets of protein sequence descriptors: amino acid composition, di-amino acid composition, and tri-amino acid composition. These descriptors, chosen for their ability to capture essential information about protein sequences, have demonstrated robust performance across various machine learning classifiers.

Our results emphasize the effectiveness of these descriptors in balancing model performance, computational efficiency, and biological relevance. Specifically, the use of SVM classifiers with 200 TC descriptors selected by LinearSVC achieved high predictive accuracy. The model's robustness was validated by achieving high AUROC values, with the best performance reaching an AUROC of 0.992 using SVM with 8000 pure TC descriptors.

The practical utility of this model was demonstrated by predicting the druggability of 2,339 cancer-driving proteins, with 88.9% predicted to have druggable activity. Validation using ChEMBL evidence scores confirmed the model's accuracy in differentiating druggable from hard-to-drug proteins, highlighting its potential in drug discovery and therapeutic development.

Additionally, integrating multi-omics analyses and chemistry-based scores identified 23 key druggable cancer-driving proteins, prioritized based on their involvement in critical cancer-related pathways and ligandability. Analyzing these proteins and their interaction with clinically relevant drugs provides valuable insights for developing targeted cancer therapies.

While our study demonstrates the model's capabilities, it also acknowledges limitations, such as the challenge of validating predictions with external datasets due to limited data on druggable proteins. Nonetheless, the drug repurposing analysis identified high-affinity interactions between the 23 key druggable cancer-driving proteins and 11 clinically relevant FDA-approved drugs. Future research should aim to enhance model validation using artificial intelligence and docking tools to confirm predicted interactions with current drugs or new ligands, facilitating the translation of repurposed drugs into clinical trials.

In summary, this study provides a comprehensive framework for predicting druggable cancer-driving proteins, combining computational efficiency with biological relevance. The integration of machine learning, multi-omics analyses, and chemistry-based assessments paves the way for identifying and prioritizing new therapeutic targets, advancing precision oncology and personalized medicine.