Abstract
The druggable proteome refers to proteins that can bind to small molecules with appropriate chemical affinity, inducing a favorable clinical response. Predicting druggable proteins through screening and in silico modeling is imperative for drug design. To contribute to this field, we developed an accurate predictive classifier for druggable cancer-driving proteins using amino acid composition descriptors of protein sequences and 13 machine learning linear and non-linear classifiers. The optimal classifier was achieved with the support vector machine method, utilizing 200 tri-amino acid composition descriptors. The high performance of the model is evident from an area under the receiver operating characteristics (AUROC) of 0.975 ± 0.003 and an accuracy of 0.929 ± 0.006 (threefold cross-validation). The machine learning prediction model was enhanced with multi-omics approaches, including the target-disease evidence score, the shortest pathways to cancer hallmarks, structure-based ligandability assessment, unfavorable prognostic protein analysis, and the oncogenic variome. Additionally, we performed a drug repurposing analysis to identify drugs with the highest affinity capable of targeting the best predicted proteins. As a result, we identified 79 key druggable cancer-driving proteins with the highest ligandability, and 23 of them demonstrated unfavorable prognostic significance across 16 TCGA PanCancer types: CDKN2A, BCL10, ACVR1, CASP8, JAG1, TSC1, NBN, PREX2, PPP2R1A, DNM2, VAV1, ASXL1, TPR, HRAS, BUB1B, ATG7, MARK3, SETD2, CCNE1, MUTYH, CDKN2C, RB1, and SMARCA4. Moreover, we prioritized 11 clinically relevant drugs targeting these proteins. This strategy effectively predicts and prioritizes biomarkers, therapeutic targets, and drugs for in-depth studies in clinical trials. Scripts are available at https://github.com/muntisa/machine-learning-for-druggable-proteins.
Similar content being viewed by others
Introduction
The human genome comprises approximately 19,890 protein-coding genes, yet not all these proteins serve as suitable drug targets1,2,3. The druggable proteome refers to the subset of proteins capable of binding to an antibody or small molecule with the requisite chemical properties and affinity4. Druggability describes the feature of a target molecule, wherein it induces a favorable clinical response upon interaction with a drug-like compound5. It is noteworthy that an estimated 60% of small molecule drug discovery projects falter during the hit-to-lead phase due to the target’s lack of druggability5,6. Predicting a target’s druggability early in drug discovery is thus crucial. Only about 10% of the human genome consists of druggable targets, and merely half of these are disease-relevant7.
According to Gashaw et al., for a drug target to be considered ideal, it should possess specific properties: an unimpeded operation characterized by the absence of competitive binding, the presence of a biomarker that facilitates monitoring its efficacy, differential expression throughout the body to enable precise targeting, minimal interference with physiological conditions, the ability to alter a disease, and suitability for high-throughput screening7,8.
In the context of the human genome, which consists of numerous protein-coding genes, roughly 3,000 are believed to be part of the druggable genome. However, drugs that have received approval from the US Food & Drug Administration (FDA) only target a meager twenty percent of these proteins9. To provide more specifics, the FDA has approved 672 drugs, each classified based on its protein class: enzymes (260; 39%), transporters (149; 22%), G-protein coupled receptors (98; 15%), CD markers (71; 11%), voltage-gated ion channels (49; 7%), and nuclear receptors (24; 4%), to name a few10,11. It is essential to note that drugs rendering the protein target inactive are termed antagonists, while those that stimulate the protein target are labeled agonists. In terms of the cellular locations of the targets for these FDA-approved drugs, various prediction methods for transmembrane and signal proteins suggest that 250 (37%) were integral to the membrane, 201 (30%) were intracellular, 101 (15%) existed as single-pass transmembrane proteins, 83 (12%) were secreted, 28 (4%) appeared as combined membrane-bound and secreted isoforms, and 9 (1%) were simultaneously integral to the membrane and exhibited a single-pass membrane structure4,10,11.
The limited number of drugs approved to date can be attributed to several factors, including the intricacies of experimenting with all proteins and nucleic acid fragments, a lack of information related to ethnicity, and a limited understanding of many diseases at the molecular level12,13. Given these challenges, there is a significant demand for computational models that can accurately predict drug targets on a genome-wide scale, ensuring both high sensitivity and specificity5. Furthermore, leveraging extensive data sources, such as metabolic and gene regulatory networks, protein–protein interactions, multi-omics datasets, and gene expression profiles, in conjunction with data mining tools like machine learning (ML), can aid in constructing predictive models. These models can discern biologically relevant patterns that indicate druggability in potential drug targets14.
Several classification models have been developed for predicting protein activities, including anti-angiogenic15, anti-cancer16, enzyme classes15, epitopes17, signaling18, lectins19, antioxidants20, and druggability14,21,22,23,24,25,26,27. Thus, the main aim of our study was to build an effective ML classifier to forecast the druggability of cancer-driving proteins, validate them through integrated multi-omics approaches, propose potential druggable proteins per cancer type, and propose potential targeted drugs and metabolites.
Methods
Machine learning prediction model
Figure 1 presents the general flow chart of the proposed methodology to obtain a classifier for druggable proteins. Firstly, we conducted a database with druggable proteins and ‘hard-to-drug’ proteins. Secondly, three families of protein composition descriptors were calculated using RCPI (R package)28: 20 amino acid composition (AC), 400 di-amino acid composition (DC), and 8,000 tri-amino acid composition (TC). In the next step, Jupyter notebooks with Python scikit-learn29 were used to test 13 types of ML classifiers by combining the 3 families of descriptors (AC, DC, TC) with five different feature selection methods and with different parameters. The employed classifiers include Gaussian Naive Bayes (GNB)30, k-nearest neighbors algorithm (KNN)31, linear discriminant analysis (LDA)32, support vector machine (SVM) both linear and non-linear based on radial basis functions (RBF)33, logistics regression (LR)34, multilayer perceptron (MLP) or neural network with 20 neurons in one hidden layer35, decision tree (DT)36, random forest (RF)37, XGBoost (XGB), an optimized distributed gradient boosting library38, Gradient Boosting for classification (GB)39, AdaBoost classifier (AdaB)40, and Bagging classifier41. The feature selection methods utilized were principal component analysis (PCA)42, feature selection based on a percentile of the highest scores with f_classif (ANOVA F-value between label/feature for classification tasks), feature selector removing features with variance below a threshold, linear support vector classification, and the extra-trees classifier.
GNB is a probabilistic classifier based on Bayes' theorem, assuming all features are independent30. KNN is a non-parametric classifier that categorizes an unclassified sample using the nearest of k samples in the training set (k = 3)31. LDA is a fundamental linear classifier, fitting class conditional densities to the input features using Bayes’ rule32. The linear SVM maps input features into a higher-dimensional space33, while for nonlinear challenges, SVM employs Gaussian radial basis as nonlinear kernel functions. LR, another linear classifier, estimates binary response probabilities using varying weights34. The MLP is a category of neural networks with artificial neurons and a single hidden layer, capable of integrating both linear and nonlinear activation functions35. DT structures decision rules from the input features, with classification rules defined as paths from the root to the leaf36. RF, an ensemble method, combines parallel decision trees, exhibiting low-bias, minimal correlation between individual trees, and high variance37. XGB, an ensemble method, uses sequential weak trees to improve classification performance38. GB, meant for classification, is a base boost method employing sequential weak classifiers39. AdaB is a meta-estimator that initiates fitting with a classifier on the original dataset, subsequently adding more copies of the classifier with adjusted weights for misclassified instances40. The Bagging classifier, a variant of AdaB, incorporates additional classifiers based on subsets of the original dataset41.
The ML prediction model was constructed using two protein sets. The positive set comprised 666 druggable proteins with FDA-approved drugs, as per the DrugBank database (www.drugbank.ca)43 and the Broad Institute’s Drug Repurposing Hub (https://clue.io/repurposing)44. In contrast, the negative protein set consisted of 219 ‘hard-to-drug’ protein phosphatases, which were previously referred to as ‘undruggable’ targets45. As noted by Xie et al., kinases are classic examples of druggable targets that play a significant role in modulating cell motility. Conversely, phosphatases act as their counterparts, critically regulating cellular dynamics by removing phosphate from proteins, including serine, threonine, and tyrosine residues46,47,48. Detailed information on the gene symbol and gene ID for all druggable and ‘hard-to-drug’ proteins can be found in Supplementary Tables 1 and 2, while Supplementary Tables 3 and 4 provide the FASTA sequences of all proteins analyzed in this study. Lastly, the final ML prediction model was applied to scan 2,339 cancer-driving proteins sourced from the Network of Cancer Genes49 (Supplementary Table 5).
After computing the amino acid composition descriptors, the datasets comprised 885 proteins. Proteins in the druggable class were labeled as 1, while those in the ‘hard-to-drug’ class were labeled as 0. Due to the imbalance in the datasets, we employed the synthetic minority over-sampling technique (SMOTE) as described by Chawla et al.50. We used a threefold cross-validation (CV) approach to construct the ML classifiers. For each fold, a sequential pipeline was executed: (a) Scaling: the training set was standardized using the StandardScaler, and the test set was transformed to match the same scale. (b) Feature Selection or Dimension Reduction: the dimensionality of the training set was either reduced using a feature selection method, such as LinearSVC, or through a dimension reduction technique like Principal Component Analysis (PCA). (c) Cross-Validation Evaluation: The cross_val_score method was employed to compute the area under the receiver operating characteristic (AUROC) scores across the 13 ML methods for all splits. (d) Mean values and standard deviations (SD) of the AUROC scores for each ML classifier were calculated and displayed for the test subset51.
The best model to be used for predictions was chosen using criteria such as mean AUROC, SD of AUROC, the number of features, and the type model features (original or transformed). All the results obtained can be reproduced by using the scripts available at the GitHub repository: https://github.com/muntisa/machine-learning-for-druggable-proteins.
In addition, the importance of the features for the best model was analyzed using a function that calculates the permutation feature importance. This is done by randomly shuffling each feature and measuring the decrease in the model's performance. This process was repeated 10 times for each feature, and the average importance value was calculated. The result was a list of feature importances, which can be used to identify the most important features in the model. The importance values were normalized (values between 0 and 1), and the top 10% most important features were highlighted. Additionally, an extra analysis of the single amino acid frequencies in all selected features of the best model was conducted, with values also normalized between 0 and 1.
Target-disease evidence score
Open Targets (https://www.targetvalidation.org) is a platform that provides comprehensive data integration, enabling access to and visualization of potential drug targets associated with cancer52. ChEMBL (https://www.ebi.ac.uk/chembl/) is a database that catalogs bioactive molecules with drug-like properties53. The ChEMBL evidence score denotes a target-disease relationship that is supported by an FDA-approved drug or a clinical candidate drug targeting the gene product in question and indicated for cancer treatment54. In this study, to validate the significance of our previously predicted druggable proteins, we compared the ChEMBL evidence scores of druggable cancer-driving proteins and those of ‘hard-to-drug’ proteins. The scores were then statistically analyzed using the Bonferroni correction test, with a significance threshold set at P < 0.001.
In our pursuit to identify and prioritize the most critical druggable cancer-driving proteins, we retrieved target-disease evidence scores provided by ten distinct bioinformatic tools. This analytical effort spanned proteins currently undergoing clinical trials as well as those not yet under examination in clinical trials. Regarding these tools, Open Target Genetics (https://genetics.opentargets.org) specializes in identifying trait-causal genes from significant loci in genome-wide association studies (GWAS)55. ClinVar (https://www.ncbi.nlm.nih.gov/clinvar/) serves as a repository detailing the relationships between human germline or somatic variants and their associated phenotypes. Its evidence score is based on the clinical significance of a genetic variant56,57. The Genomics England PanelApp (https://panelapp.genomicsengland.co.uk) merges crowdsourced expertise with curation to establish target-cancer relationships58. Cancer Gene Census (https://cancer.sanger.ac.uk/census), part of the Wellcome Sanger Institute Catalogue of Somatic Mutations in Cancer (COSMIC), aims to catalog genes containing mutations causally linked to cancer59. IntOGen (https://www.intogen.org) offers a methodology to pinpoint potential cancer driver genes, using large-scale mutational data from sequenced tumor samples60. The Cancer Biomarkers database, part of the Cancer Genome Interpreter (http://www.cancergenomeinterpreter.org), features biomarkers relevant to drug sensitivity, resistance, and toxicity for drugs targeting specific cancer entities61. SLAPenrich (https://saezlab.github.io/SLAPenrich/) introduces a novel statistical approach to recognize significantly mutated pathways at the population level across large cohorts of cancer patients62. The Reactome Knowledgebase (https://reactome.org) delineates molecular aspects of various cellular functions like signal transduction and metabolism, identifying reaction pathways impacted by specific cancer types63. Phenotype comparisons for DIsease Genes and Models (PhenoDigm) (http://www.sanger.ac.uk/resources/databases/phenodigm), an algorithm provided by the International Mouse Phenotypes Consortium (IMPC), offers insights into gene–disease associations by analyzing phenotype data64. Lastly, an overall score was calculated integrating the information from all bioinformatic approaches to identify and prioritize essential druggable cancer-driving proteins.
Drugs involved in late-phase clinical trials
The Open Targets platform (https://www.targetvalidation.org), enhanced with ChEMBL annotations, provides an integrated data framework. This enables access and visualization of potential drug targets associated with cancer52,54,65. Furthermore, the Broad Institute’s Drug Repurposing Hub (https://clue.io/repurposing) is a curated collection of drugs approved by the Food and Drug Administration (FDA). This hub aided us in discerning the mechanisms of action for drugs employed in cancer treatments44. As a result, we analyzed the druggable cancer-driving proteins with a ChEMBL evidence score of > 0.9 to map out the therapeutic landscape for drugs in phase III and IV clinical trials.
Distance score of shortest pathways to cancer hallmark phenotypes
CancerGeneNet (https://signor.uniroma2.it/CancerGeneNet/) is a curated bioinformatics resource provided by the SIGnaling Network Open Resource (SIGNOR 3.0)66,67. This platform uses experimental annotations to bridge two interaction layers crucial to cell physiology, connecting proteins influenced by cancer drivers with proteins that impact on the hallmarks of cancer66,68. To elucidate the dynamics of these interactions, the procedure for calculating the distance score for the shortest pathways is outlined as follows: a) initiate a path query between two nodes; b) within the path string, each step is characterized by a pair of nodes and an edge, representing the nature of the interaction (e.g., activation or inhibition); c) the ‘distance’ parameter calculates the path length, incorporating the reliability of each step. Each step's reliability score, denoted as 'r', is derived from supporting evidence extracted from the STRING database69. This score is converted into a distance using the equation: \(d=1-r\). The final path score, represented as \({D}_{path}={\sum }_{rel=1}^{N} \left(1-{r}_{rel}\right)\), is the sum of each step distance, with 'N' standing for the total number of steps in a path67,70.
Iannuccelli et al. implemented a programmatic approach to calculate the shortest distance scores, or paths, between specific proteins and cancer phenotypes using the ‘shortest path’ function from the igraph R package. Our primary aim was to probe the signaling nexus between druggable cancer-driving proteins (not yet involved in clinical trials and with a ChEMBL evidence score = 0) and the hallmarks of cancer71.
Within this framework, we determined the shortest paths for both positive and negative regulations of druggable cancer-driving proteins linked to angiogenesis, immortality, inflammation, metastasis, proliferation, cell death, differentiation, DNA repair, and glycolysis. We then carried out multiple comparison tests, employing the Bonferroni correction (P < 0.001, 95% confidence interval), to compare the distance scores of druggable cancer-driving proteins across different cancer phenotypes. Lastly, we ranked these druggable proteins with the shortest paths to each cancer hallmark phenotype.
Chemistry-based score
canSAR (http://cansar.icr.ac.uk) is a comprehensive knowledgebase dedicated to drug discovery. It integrates data from genomics, proteomics, pharmacology, drugs, and chemicals with structural proteins and protein networks72. This bioinformatic resource encompasses the complete human proteome (20,375 sequences) sourced from the Uniprot Swiss-Prot database73. Additionally, canSAR provides an extensive structure-based ligandability assessment, covering more than 4.5 million cavities72. The chemistry-based score is categorized into four levels: low (0–24%), suggesting the protein is less likely to be a successful drug target; moderate (25–49%), indicating a moderate probability of druggability; high (50–74%), suggesting the protein has a good probability of being druggable; and very high (75–100%), indicating the protein is very likely to be druggable and is often considered a high priority for drug development due to its high probability of successful binging with drugs72. Using this data, we retrieved the chemistry-based score to validate our machine learning prediction method and prioritize key druggable cancer-driving proteins for each cancer type.
A pathology atlas for human cancer
The Human Pathology Atlas, available at (https://www.proteinatlas.org/humanproteome/pathology), is an integral component of the Human Protein Atlas project. This atlas explores the prognostic relevance of druggable cancer-driving genes/proteins across 17 The Cancer Genome Atlas (TCGA) PanCancer types in almost 8,000 patients74,75. Anchored in transcriptomics and antibody profiling, the atlas emerges as an essential tool for tailoring cancer treatments based on precision oncology74. Immunohistochemistry (IHC) stands as the gold standard method for in situ protein expression analysis in tissue samples. The combination of IHC and tissue microarray (TMA) technology allows simultaneous analysis of hundreds of tissue samples with an unprecedented degree of experimental standardization76.
The Atlas provides staining profiles for proteins in human tumor tissues, generated through the synergy of IHC and TMA techniques. This is complemented by Kaplan–Meier analysis, linking mRNA expression levels to patient survival. Patient samples were classified into two expression groups and the correlation between expression level and patient survival was examined. Using Kaplan–Meier survival estimators, the prognosis of different patient cohorts was determined. Log-rank tests were employed to compare these results. Genes/proteins with marked correlations to detrimental outcomes (log-rank P-values < 0.001) in the Kaplan–Meier evaluations were pinpointed as unfavorable prognostic indicators across TCGA PanCancer types77.
Functional enrichment analysis
Functional enrichment analysis provides researchers with curated insights and a deeper understanding of protein sets derived from omics-scale experiments. For our study, we focused on druggable cancer-driving proteins that have not yet entered clinical trials, as indicated by a ChEMBL evidence score of 0. These proteins also show an unfavorable prognosis across various TCGA PanCancer types. To evaluate enrichment, we employed g:Profiler version e101_eg48_p14_baf17f0, accessible at (https://biit.cs.ut.ee/gprofiler/gost)78. Our objective was to pinpoint significant annotations, following the Benjamini–Hochberg FDR q < 0.001 criteria, related to Gene Ontology (GO) biological processes (http://geneontology.org/)79 and Reactome signaling pathways (https://reactome.org/)63. The results of the functional enrichment analysis were visualized using a Manhattan plot, and significant terms associated with cancer hallmark phenotypes were manually curated80.
The oncogenic variome of key druggable cancer-driving proteins
Identifying the oncogenic variome of druggable cancer-driving proteins encompassed two primary steps. First, we obtained 22,320 single nucleotide and insertion/deletion variants from the 23 key druggable cancer-driving proteins. This data was retrieved from 76,156 genomes belonging to the Genome Aggregation database (gnomAD v3.2.1) (https://gnomad.broadinstitute.org/), and using the GRCh38/hg38 human reference genome1,81,82. Second, we performed the oncodriveMUT and boostDM methods integrated within the Cancer Genome Interpreter (CGI) platform (https://www.cancergenomeinterpreter.org) to evaluate the tumorigenic potential of the acquired genomic variants. This approach enabled us to categorize driver variants into known, predicted, and passenger classifications based on the Catalog of Validated Oncogenic Mutations61,83. OncodriveMUT is a rule-based strategy that analyzes genomic features, such as regions depleted by germline variants, gene mechanisms of action, gene signals of positive selection, and clusters of somatic mutations. Conversely, boostDM is a machine learning strategy that assesses the oncogenic potential of mutations in human tissues by employing in silico saturation mutagenesis of cancer genes61,83.
Deleteriousness of the oncogenic variome
The Combined Annotation-Dependent Depletion (CADD) tool, version 1.4 (https://cadd.gs.washington.edu/), is a bioinformatic resource that assesses the deleterious effects of diverse gene mutations within the human genome. It integrates over 60 genomic features to evaluate the impact of single nucleotide and insertion/deletion variants84. The CADD framework analyzes multiple annotations by comparing natural selection against simulated mutations, using the GRCh38/hg38 human reference genome85. For this study, we performed CADD to determine the deleteriousness of both known and predicted oncogenic variants associated with pivotal druggable cancer-driving genes. Lastly, the CADD deleteriousness scores were categorized as very high (30–50), high (25–30), medium (15–25), low (10–15), and very low (0–10).
Artificial intelligence prediction of drugs and metabolites
To investigate the potential interactions of current drugs (for drug repurposing) and metabolites with the best-predicted proteins, an artificial intelligence (AI)-based tool called Protein–Ligand Binding Affinity Prediction Using Pretrained Transformers (PLAPT) was used to predict the interaction affinity (or negative log10 affinity)86. PLAPT (https://github.com/trrt-good/WELP-PLAPT/) predicts the binding affinity of ligand–protein complexes using the SMILES code of ligands and the sequence of proteins. The model employs pre-trained transformers such as ProtBERT and ChemBERTa to convert the protein sequence and SMILES structure into embeddings for the model.
Using the PLAPT Python package, all ligand–protein affinities were calculated. Two families of ligands were used: 2,466 ChEMBL approved drugs with masses between 100 and 500 (downloaded via a Python script) for potential drug repurposing against multiple predicted druggable proteins, and 217,776 molecules from the Human Metabolome Database (HMDB) (https://hmdb.ca/downloads) against a single important predicted druggable protein (due to computational limits)87.
Results and discussion
Machine learning prediction model
The current study introduces innovative classification models designed to predict new druggable proteins that drive cancer. These predictions are based on three sets of protein sequence descriptors (amino acid composition, di-amino acid composition, and tri-amino acid composition), calculated using Rcpi. These descriptors were chosen for their proven ability to capture essential information about protein sequences that are critical for predicting druggability88.
AC effectively represents a protein's primary structure by highlighting the frequency of each amino acid within the sequence, helping to identify general trends and patterns associated with druggable proteins. DC captures the local interactions between pairs of amino acids, providing insight into the secondary structure and local folding patterns, which are crucial for understanding functional regions and binding sites. TC considers interactions between triplets of amino acids, offering a more detailed view of the amino acid sequence, which is essential for accurately predicting protein interactions with drugs and other ligands30,89.
Focusing on these features ensures computational efficiency and reduces the risk of overfitting, which can occur with an excessive number of features. Our comprehensive benchmarking demonstrated that these descriptors consistently provided robust performance across various machine learning classifiers. While the inclusion of additional features, such as secondary structure elements or solvent accessibility, might offer incremental benefits, the chosen descriptors strike an optimal balance between model performance, computational feasibility, and biological relevance. This balance allows for effective and interpretable predictions while maintaining the practicality of the computational framework. Furthermore, the identified amino acid sequence patterns will inform future studies on protein properties.
Subsequently, we utilized Jupyter notebooks built on Python and scikit-learn to construct 13 types of ML classifiers (GNB, KNN, LDA, SVM linear, SVM, LR, MLP, DT, RF, XGB, GB, AdaB, and Bagging), along with five types of feature selection methods with various parameters (Fig. 1). All scripts used the mean AUROC values from threefold cross-validation to quantify classification performance. We tested models using 20, 100, 200, and 400 features30.
Figure 2 illustrates the AUROC values for a classifier using only 20 features: AC descriptors without feature selection, DC descriptors with LinearSCV feature selection (DC-LinearSVC20), PCA features from DC (DC-PCAn20), TC descriptors selected by SelectPercentile(f_classif, percentile = 0.25) (TC-Percn20), and TC descriptors selected with LinearSVC (TC-LinearSVC20). Notably, using only 20 AC descriptors with SVM yielded an AUROC of 0.926. The best performance was achieved using SVM (RBF) with 20 PCA components from 400 DC descriptors, resulting in an AUROC of 0.958. Additional results can be found in Supplementary Table 6.
Figure 3 displays AUROC values for a classifier using 100 features: PCA transformed of 400 DC descriptors (DC-PCAn100), TC descriptors selected with SelectPercentile(f_classif, percentile = 1.25) (TC-Perc1.25), TC descriptors with LinearSVC (TC-LinearSVC100), and 100 features selected by LinearSVC from 200 PCA components of 8,000 TC descriptors (TC-PCA200LinearSVC100). Increasing the number of features to 100 (five times more than 20) improved the AUROC to 0.976 using the same SVM (RBF) with TC-PCA200LinearSVC10030.
Figure 4 shows the AUROC values for classifiers using 200 selected features (double the number from 100): PCA transformation of 400 DC descriptors (DC-PCAn200), DC descriptors selected with SelectPercentile (DC-Perc50), 200 PCA components of 8,000 transformed TC descriptors (TC-PCAn200), TC descriptors selected with SelectPercentile(f_classif, percentile = 2.5) (TC-Perc2.5), and TC descriptors with LinearSVC (TC-LinearSVC200). The combination of PCA and SVM for DC-PCAn200 resulted in the best classifier, achieving an AUROC of 0.981 (Supplementary Table 6). Further, using all 400 DC descriptors with SVM, the mean AUROC reached 0.982 ± 0.0021. Additionally, with 8,000 pure TC descriptors and SVM linear, the mean AUROC was 0.992 ± 0.0028. It is important for a classification model to avoid having more input features than data instances. We also sought to prioritize pure descriptors over PCA transformations. As a compromise, we selected the following as the best model for subsequent protein-related cancer predictions: 200 TC descriptors selected with LinearSVC, a non-linear SVM classifier with an AUROC of 0.975 ± 0.003 and an accuracy of 0.929 ± 0.006 (threefold cross-validation). The list of the 200 selected features is available in the Jupyter notebooks.
Selected features analysis
The following is the list of selected features for the best model: NRA, QRA, INA, MCA, YEA, THA, CSA, VYA, KNR, WDR, TER, PQR, YGR, EHR, LIR, VSR, ERN, MDN, SDN, LHN, YIN, FFN, RSN, QSN, FWN, ACD, WCD, MED, CHD, SHD, MLD, SMD, WPD, SSD, HTD, DWD, VYD, KNC, NDC, IHC, VHC, GYC, MCE, NHE, ALE, HME, LPE, AWE, EYE, QYE, GVE, FVE, SAQ, FNQ, MDQ, PCQ, WEQ, RQQ, NGQ, HLQ, RMQ, DFQ, GPQ, DSQ, YSQ, AWQ, RVQ, QRG, HGG, TGG, KLG, NKG, FPG, SSG, RTG, PTG, IVG, CDH, FDH, PDH, TQH, KHH, FHH, IFH, NSH, WSH, FWH, WRI, NDI, EDI, FEI, WEI, WQI, MGI, PMI, AAL, EKL, IKL, FKL, GPL, ESL, DVL, MVL, VVL, GNK, HNK, HDK, HCK, EQK, DHK, QLK, EKK, SMK, FFK, QSK, EWK, AVK, WRM, WNM, REM, WQM, SHM, LLM, SMM, NFM, TSM, RWM, GYM, KYM, VYM, HVM, IVM, LDF, YQF, NGF, HGF, FWF, FAP, FNP, PEP, SQP, QGP, VHP, PLP, HKP, NPP, QPP, STP, TTP, KWP, YWP, SRS, HDS, WDS, HCS, LES, DHS, SHS, PSS, SSS, LWS, LAT, DRT, GRT, IRT, INT, VQT, NLT, CLT, KKT, YTT, QWT, FYT, KCW, QGW, VGW, MIW, IKW, RFW, DFW, HVW, KVW, NRY, CHY, DMY, YPY, YAV, SRV, ENV, HNV, GEV, QGV, HGV, TGV, WHV, LLV, IMV, DSV, TSV, QYV. The normalized importance for the 10% selected features is presented in Table 1. The most important amino acid patterns for this classification are HME, NSH, SSS, HTD, DHK, ERN, NDI, DRT, VYD, FFN, SHM, NDC, RFW, WRI, GYC, MGI, PEP, GVE, DSQ, and LLV. The HME pattern is the most important feature for druggable proteins, while the NSH pattern has only half the importance of HME.
In Table 2, the frequencies of the amino acids in all selected features demonstrate the importance of H (histidine), S (serine), D (aspartic acid), and Q (glutamine) in classifying druggable proteins. Additionally, H and S appear in the first five most important tri-amino acid patterns. The biological significance of the amino acids in these patterns is outlined below: (a) HME (histidine–methionine–glutamic acid): Histidine is essential for protein synthesis and enzyme catalysis, methionine is the initiator amino acid for protein translation, and glutamic acid is involved in neurotransmission and protein folding; (b) NSH (asparagine–serine–histidine): Asparagine is crucial for glycoprotein synthesis and serine is involved in phosphorylation and protein structure; (c) SSS (serine–serine–serine): serine is essential for cell signaling, protein synthesis, and metabolism; (d) HTD (histidine–threonine–aspartic acid): Threonine is important for protein stability and immune function, and aspartic acid contributes to protein structure and function; and (e) DHK (aspartic acid–histidine–lysine): Lysine is essential for protein synthesis and collagen formation5,90,91.
Cancer-driving proteins
We transformed 2,339 cancer-driving proteins into molecular descriptors using the best model to predict their druggability. Consequently, these protein sequences were converted into 200 selected TC descriptors. As a result, 2,080 (88.9%) of these cancer-driving proteins were predicted to have druggable activity (Fig. 5A and Supplementary Table 5). For validation, we compared the ChEMBL evidence scores of proteins involved in clinical trials54, distinguishing among the positive set of druggable proteins (mean score = 0.712), druggable cancer-driving proteins (class 1, mean score = 0.706), ‘hard-to-drug’ cancer-driving proteins (class 0, mean score = 0.596), and the negative set of ‘hard-to-drug’ proteins (mean score = 0.414). As expected, the Bonferroni correction revealed no significant difference between the positive set and druggable cancer-driving proteins, nor between the negative set and ‘hard-to-drug’ proteins. Interestingly, it did reveal a significant difference between druggable cancer-driving proteins (class 1) and ‘hard-to-drug’ proteins (class 0) (P < 0.001) (Fig. 5B). This indicates that druggable cancer-driving proteins are distinctively more validated as potential targets compared to ‘hard-to-drug’ proteins, underscoring the relevance and accuracy of the classification method used. These findings validate the effectiveness of the prediction model in distinguishing between truly druggable targets and those that are more challenging to target therapeutically, highlighting its potential utility in the drug discovery process.
Following the prediction and validation of the 2,080 druggable cancer-driving proteins, we extracted the target-disease evidence scores from the Open Targets platform. This was done to prioritize the most relevant druggable cancer-driving proteins already involved in late-stage clinical trials (ChEMBL score > 0.9) and those not yet involved in clinical trials (ChEMBL score = 0)52,53,54. The target-disease evidence score was encompassed data from various bioinformatic tools including Open Target Genetics55, ClinVar (covering germinal and somatic variants)56,57, Genomics England PanelApp58, Cancer Gene Census59, IntOGen60, the Cancer Biomarkers database61, SLAPenrich62, the Reactome Knowledgebase63, and PhenoDigm64. This overall score, derived from an integration of these bioinformatic approaches, enabled us to identify proteins strongly associated with cancer traits. Of these, 52 were druggable cancer-driving proteins involved in late-phase clinical trials (Fig. 5C and Supplementary Tables 7 and 8), and 296 were druggable cancer-driving proteins not yet involved in clinical trials (Fig. 5D and Supplementary Tables 7 and 9). Furthermore, the five bioinformatic approaches yielding the highest target-disease evidence scores for the 296 druggable proteins not yet in clinical trials were Cancer Gene Census (mean = 0.90), SLAPenrich (0.88), Reactome (0.84), Genomics England PanelApp (0.79), and Cancer Biomarkers (0.77) (Fig. 5E).
Drugs involved in late-phase clinical trials
Figure 6 presents an update on phase III and IV clinical trials involving drugs that target cancer-driving proteins, as cataloged by the Open Targets Platform52. The Sankey plot in the figure reveals a total of 257 clinical trial events, involving 94 drugs with 38 different mechanisms of action, which target 52 key cancer-driving proteins across 26 types of cancer (Supplementary Table 10). The most frequently involved drugs in these late-phase clinical trials were regorafenib, binimetinib, pazopanib, and sorafenib. The mechanisms of action most common in these trials included FGFR inhibitors, FLT3 inhibitors, MEK inhibitors, and EGFR inhibitors. The cancer-driving proteins most frequently targeted in the trials were GABRB2, MAP2K1, and MAP2K2. Additionally, the cancer types most commonly evaluated in these late-phase clinical trial events were liver cancer, lung cancer, breast cancer, leukemia, and colorectal cancer. This comprehensive therapeutic landscape has enabled us to identify key patterns and trends in cancer treatment research.
Shortest pathways to cancer hallmark phenotypes
After identifying 296 druggable proteins not yet involved in clinical trials, we conducted multi-omics analyses to prioritize the most relevant cancer-driving proteins as potential therapeutic targets across various cancer types70,80,92,93,94. In this context, we employed the CancerGeneNet software and found that 184 (62%) of these proteins showed distance scores indicative of their involvement in the shortest pathways leading to cancer hallmark phenotypes66,67, as detailed in Supplementary Table 11. Figures 7A and B illustrate these druggable proteins and their shortest paths to cancer hallmarks. The top three hallmarks are cell proliferation (with a mean distance score of 1.27 and 154 proteins involved), cell differentiation (1.51; 160), and resistance to cell death (1.55; 157) (Supplementary Table 12). Utilizing the Bonferroni correction test, we observed that these druggable proteins had significantly shorter paths to these cancer hallmark phenotypes (P < 0.001). These findings are highly relevant because the prioritized druggable proteins in this analysis could be crucial targets for focusing new therapeutic strategies on processes such as cell proliferation or resistance to cell death.
Chemistry-based score
canSAR is a comprehensive knowledgebase dedicated to drug discovery and offers an extensive structure-based ligandability assessment72. Consequently, we retrieved the chemistry-based scores for the previously prioritized 184 proteins. The mean chemistry-based score of these 184 proteins was 69.9%. In our analysis, we considered all proteins with a ligandability score higher than the mean (cutoff > 69.9%), encompassing all proteins with the very high scores and the best proteins with high scores. This analysis enabled us to identify 79 (43%) druggable cancer-driving proteins with the highest ligandability, as shown in Fig. 7C and Supplementary Table 13. Ligandability analysis refers to a protein’s ability to bind efficiently to a drug. High ligandability helps identify and prioritize proteins that can be effective targets for new drugs, thereby increasing the specificity of the drug’s action and reducing the time and cost associated with pharmaceutical development95.
A pathology atlas for human cancer
We explored the Human Pathology Atlas, developed by the Human Protein Atlas program, and subsequently conducted a Kaplan–Meier analysis to examine the correlation between mRNA and protein expression and patient survival74,75,76,77. This analysis aimed at determining the prognostic significance of 79 highly ligandable, druggable cancer-driving genes/proteins (Supplementary Table 14). Our findings underscore the effectiveness of large-scale system biology projects that utilize publicly available resources. In this study, we identified the 23 key druggable cancer-driving genes/proteins that demonstrated unfavorable prognostic significance (significant log rank P-value < 0.001) across 16 TCGA PanCancer types. These genes/proteins were CDKN2A, BCL10, ACVR1, CASP8, JAG1, TSC1, NBN, PREX2, PPP2R1A, DNM2, VAV1, ASXL1, TPR, HRAS, BUB1B, ATG7, MARK3, SETD2, CCNE1, MUTYH, CDKN2C, RB1, and SMARCA4 (Fig. 7D and Supplementary Table 15).
Functional enrichment analysis
We conducted a functional enrichment analysis of the 23 key druggable cancer-driving proteins using g:Profiler software78. The Manhattan plot enabled us to identify 64 GO biological processes79 and 2 Reactome signaling pathways63 (Fig. 7E and Supplementary Table 16). The most significant annotations, adjusted with the Benjamini–Hochberg correction and an FDR q-value < 0.001, included cell cycle, cell communication, phosphorylation, immune system process, programmed cell death, cell differentiation, cellular senescence, endocrine resistance, G1 phase, and cyclin D events in G1. Interestingly, it is important to highlight that these 23 key druggable cancer-driving proteins are involved in biological processes associated with various therapeutic strategies. These strategies include the inhibition of cellular proliferation96, the inhibition of phosphorylation97, cancer immunotherapy98, activation of programmed cell death99, regulation of senescence100, and evasion of endocrine resistance101.
The oncogenic variome of key druggable cancer-driving proteins and their deleterious effects
Figure 8A presents the analysis of 22,320 variants using OncodriveMUT and boostDM to determine the oncogenic variome in the 23 key druggable cancer-driving genes. This analysis identified 1,598 oncogenic variants, with 11 (1%) being previously known and 1,578 (99%) newly predicted. The analysis of deleteriousness scores revealed that 252 (16%%) of these oncogenic variants had very high CADD scores, 788 (49%) had high CADD scores, and 506 (32%) had medium CADD scores. The most common types of genetic alterations were missense variants (81%), followed by frameshift (6%), and stop-gained variants (5%). Figure 8B displays box plots that illustrate the deleteriousness scores of the oncogenic variants according to their consequence types. Stop-gained variants exhibited the highest mean CADD score (37.2), followed by splice donor (31.2), splice acceptor (30.9), missense (25.9), frameshift (25.8), start lost (21.2), stop lost (17.7), inframe deletion (17.4), splice region (16.8), and inframe insertion variants (16.7). Lastly, Fig. 8C presents bean plots that rank the key druggable cancer-driving genes based on the highest number of oncogenic variants and their deleteriousness scores (Supplementary Table 17).
Identifying oncogenic variants in cancer-driving genes is crucial for developing targeted therapies102,103,104,105. These therapies are specifically designed to inhibit or modify the function of proteins produced by mutated genes, offering more effective treatment options with potentially fewer side effects compared with traditional chemotherapy106. Moreover, this approach enables personalized precision medicine. By understanding specific genetic and epigenetic alterations in a patient’s tumor, treatments can be tailored to target these changes107,108,109,110. In this context, the identification of oncogenic variants in druggable cancer-driving genes is a fundamental aspect of modern oncology, influencing everything from individual patient treatment to broader aspects of cancer research, ethnicity, and public health initiatives106,111,112.
This integrative approach has identified 23 key druggable cancer-driving proteins (CDKN2A, BCL10, ACVR1, CASP8, JAG1, TSC1, NBN, PREX2, PPP2R1A, DNM2, VAV1, ASXL1, TPR, HRAS, BUB1B, ATG7, MARK3, SETD2, CCNE1, MUTYH, CDKN2C, RB1, and SMARCA4), setting the stage for improved therapeutic targets that could significantly boost the efficacy of clinical trials.
Testing the model’s limitations
Like any model, there are limitations when using it for prediction. Due to the limited data on druggable proteins, all 666 druggable proteins were used as class 1 to train the model. This makes it impossible to obtain an external dataset with druggable proteins to confirm the predictive power of the best model. One way to test the model's limitations is to plot the best protein predictions within the space of the selected features, alongside the druggable proteins and hard-to-drug proteins. Since plotting in 200 dimensions (the number of selected features in the best model) is impractical, we approximate by transforming these 200 dimensions into just 2 PCA components for visualization. Class 0 descriptors (hard-to-drug proteins), class 1 descriptors (druggable proteins), and the descriptors corresponding to the 23 key druggable cancer-driving proteins (predicted proteins) have been converted into standard units as in the original dataset for TC descriptors and transformed into 2 PCA components for visualization in Fig. 9. In the figure, druggable proteins are shown in blue, hard-to-drug proteins in red, and the best predicted proteins in green. The plot indicates that even though the negative class (class 0) contains phosphatase proteins, there is no clear separation between the training classes 1 and 0 within the space of the selected TC descriptors, indicating a complex descriptor space.
Prediction points that fall within regions containing mixed points (both class 1 and class 0 points) may be the most trustworthy. In these regions, the model has been exposed to a more diverse dataset, enabling it to learn to better distinguish the patterns and characteristics that differentiate the two classes. Consequently, predictions in these regions are more likely to be accurate and reliable, as the model has learned more robust and generalizable features for data classification. Therefore, predictions made in these mixed regions are likely to be the most robust and trustworthy. The majority of the predicted proteins are located in these mixed regions, suggesting they have a higher potential to be future drug targets. In the supplementary material, a researcher can choose another model with a mean AUROC value greater than a specific cutoff (e.g., 0.9), with fewer features and possibly better PCA representation of the predictions. Future studies should use artificial intelligence and docking tools to predict a list of potential current drugs or new ligands.
Repurposing drugs and metabolites
An additional step to confirm the 23 key druggable cancer-driving proteins involves predicting interactions with ChEMBL-approved drugs (2,466 molecules with masses between 100 and 500) through drug repurposing113,114. Using pairs of drug SMILES codes and protein sequences as inputs, a deep learning model called PLAPT evaluated the binding affinity (or negative log10 affinity)86. The model employs pre-trained transformers like ProtBERT and ChemBERTa to convert the protein sequence and SMILES structure into embeddings for the model. Supplementary Tables 18 and 19, along with the GitHub file titled Supplementary_interactions_gene-drug(byPLAPT), present the affinity values for each drug-protein pair. The mean affinity values (minimum affinities or maximum negative log10 affinities) for all 23 proteins indicate that the top drugs clinically relevant to cancer treatment that can interact with these proteins include: mifepristone (targeting CASP8), pentostatin (BCL10, CASP8, CCNE1, and CDKN2A), afatinib (ACVR1, CDKN2C, and HRAS), alitretinoin (ACVR1, CDKN2C, HRAS, and PREX2), talazoparib (ACVR1, CDKN2C, and HRAS), alpelisib (ACVR1, CDKN2C, HRAS, NBN, PREX2, and SMARCA4), ulipristal acetate (ACVR1, ASXL1, CDKN2C, HRAS, NBN, PREX2, RB1, and SMARCA4), lorlatinib (ACVR1, ASXL1, ATG7, DNM2, HRAS, JAG1, MARK3, NBN, PPP2R1A, PREX2, RB1, SETD2, SMARCA4, TPR, TSC1, and VAV1), piflufolastat (ASXL1, ATG7, BUB1B, DNM2, JAG1, MARK3, MYTYH, NBN, PPP2R1A, PREX2, RB1, SETD2, SMARCA4, TPR, TSC1, and VAV1), pyrvinium pamoate (ASXL1, ATG7, BUB1B, DNM2, HRAS, JAG, MARK3, NBN, PPP2R1A, PREX2, RB1, SETD2, SMARCA4, TPR, TSC, and VAV1), and tepotinib hydrochloride (ASXL1, ATG7, BUB1B, DNM2, JAG1, MARK3, MUTYH, NBN, PPP2R1A, PREX2, RB1, SETD2, SMARCA4, TPR, TSC1, and VAV1 (Fig. 10).
Mifepristone, a progesterone receptor antagonist, has been explored for its potential in treating glioblastoma, breast cancer, and uveal melanoma due to its ability to act on multiple receptor types, including glucocorticoid and androgen receptors115,116,117. Pentostatin is a chemotherapy drug primarily used for treating hairy cell leukemia and T-cell prolymphocytic leukemia. It is a purine analog that works by inhibiting the enzyme adenosine deaminase, crucial for DNA synthesis and cell replication, leading to the accumulation of deoxyadenosine triphosphate and ultimately causing cell death, particularly in rapidly dividing cancer118. Afatinib is an oral medication primarily used for treating non-small cell lung cancer. It functions as a tyrosine kinase inhibitor, targeting and blocking the EGFR protein as well as other members of the ErbB family, including HER2 and ErbB4119. Alitretinoin, a derivative of vitamin A, is used in cancer treatment primarily for Kaposi sarcoma. It binds to and activates retinoid receptors (RAR and RXR), which regulate gene expression involved in cell differentiation and proliferation, helping to inhibit the growth of Kaposi sacroma cells120. Talazoparib works by inhibiting PARP enzymes, which play a crucial role in DNA repair. By blocking these enzymes, talazoparib prevents cancer cells from repairing their DNA, leading to cell death, especially in cells with BRCA1/2 mutations that already have compromised DNA repair mechanisms43,121. Alpelisib is an oral medication used in combination with fulvestrant to treat hormone receptor-positive, HER2-negative advanced or metastatic breast cancer with PIK3CA mutations. It works as a PI3K inhibitor, specifically targeting the alpha isoform of the enzyme, which is crucial in the PI3K/AKT signaling pathway involved in cancer cell growth and survival122. Ulipristal acetate is a progesterone receptor modulator implicated in the proliferation and growth of certain cancer cells. It competes with progesterone, thereby inhibiting the progesterone-induced proliferation of breast cancer cells, making it a candidate for reducing breast cancer risk, especially in individuals with BRCA1/2 mutations123. Lorlatinib inhibits ALK and ROS1 kinases, which are involved in cancer cell growth and survival. It is effective against multiple ALK mutations that confer resistance to first- and second-generation ALK inhibitors124. Piflufolastat F-18 binds to the prostate-specific membrane antigen, a protein overexpressed on the surface of most prostate cancer cells. Once bound, the radioactive tracer emits positrons detected by a PET scanner, revealing the location of PSMA-positive lesions in the body125. Pyrvinium pamoate is an androgen receptor antagonist that targets multiple cellular pathways. It disrupts mitochondrial function by inhibiting electron transport chain complexes I and II, reducing mitochondrial fitness and increased glycolysis, especially under hypoglycemic conditions often found in tumors. It also reduces WNT and Hedgehog signaling pathways, crucial for cancer cell proliferation and survival126,127,128,129. Lastly, tepotinib hydrochloride is a tyrosine kinase inhibitor targeting the MET receptor. By inhibiting this receptor, it interferes with cancer cell growth and survival pathways, which are crucial for the proliferation and metastasis of MET-altered cancer cells.
The last screening for interactions was conducted for the HRAS protein (P01112) using 217,776 molecules from the HMDB (see all affinities in the Supplementary Table 20 and the GitHub file titled Supplimentary_affinities_hmdb_HRAS-P01112). Among the best potential interactions between HRAS and metabolites, the following were identified: cyanidin 5-O-beta-d-glucoside (HMDB0304305), chlorophyll (HMDB0303604), delphinidin 3-(3″-p-coumaroylglucoside) (HMDB0030099), cis-neoxanthin (HMDB0302969), verteporfin (HMDB0014603), pinotin A (HMDB0029240), benztropine (HMDB0014390), adapalene (HMDB0014355), inulin (HMDB0014776), and ceftriaxone (HMDB0015343). Future studies involving molecular docking, molecular dynamics, or other AI-based interaction prediction models will be needed to further confirm these interactions.
Conclusions
This study presents an innovative machine learning-based method for predicting druggable proteins that drive cancer, utilizing three sets of protein sequence descriptors: amino acid composition, di-amino acid composition, and tri-amino acid composition. These descriptors, chosen for their ability to capture essential information about protein sequences, have demonstrated robust performance across various machine learning classifiers.
Our results emphasize the effectiveness of these descriptors in balancing model performance, computational efficiency, and biological relevance. Specifically, the use of SVM classifiers with 200 TC descriptors selected by LinearSVC achieved high predictive accuracy. The model's robustness was validated by achieving high AUROC values, with the best performance reaching an AUROC of 0.992 using SVM with 8000 pure TC descriptors.
The practical utility of this model was demonstrated by predicting the druggability of 2,339 cancer-driving proteins, with 88.9% predicted to have druggable activity. Validation using ChEMBL evidence scores confirmed the model's accuracy in differentiating druggable from hard-to-drug proteins, highlighting its potential in drug discovery and therapeutic development.
Additionally, integrating multi-omics analyses and chemistry-based scores identified 23 key druggable cancer-driving proteins, prioritized based on their involvement in critical cancer-related pathways and ligandability. Analyzing these proteins and their interaction with clinically relevant drugs provides valuable insights for developing targeted cancer therapies.
While our study demonstrates the model's capabilities, it also acknowledges limitations, such as the challenge of validating predictions with external datasets due to limited data on druggable proteins. Nonetheless, the drug repurposing analysis identified high-affinity interactions between the 23 key druggable cancer-driving proteins and 11 clinically relevant FDA-approved drugs. Future research should aim to enhance model validation using artificial intelligence and docking tools to confirm predicted interactions with current drugs or new ligands, facilitating the translation of repurposed drugs into clinical trials.
In summary, this study provides a comprehensive framework for predicting druggable cancer-driving proteins, combining computational efficiency with biological relevance. The integration of machine learning, multi-omics analyses, and chemistry-based assessments paves the way for identifying and prioritizing new therapeutic targets, advancing precision oncology and personalized medicine.
Data availability
All data generated or analyzed during this study are included in this published article (and its Supplementary Information files), and the scripts are available as a free repository at https://github.com/muntisa/machine-learning-for-druggable-proteins.
References
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).
Barbarino, J. M., Whirl-Carrillo, M., Altman, R. B. & Klein, T. E. PharmGKB: A worldwide resource for pharmacogenomic information. Wiley Interdiscip. Rev. Syst. Biol. Med. 10, e1417 (2018).
Venter, J. C., Smith, H. O. & Adams, M. D. The sequence of the human genome. Clin. Chem. 61, 1207–1208 (2015).
Uhlén, M. et al. Tissue-based map of the human proteome. Science 347, 1260419 (2015).
Kandoi, G., Acencio, M. L. & Lemke, N. Prediction of druggable proteins using machine learning and systems biology: A mini-review. Front. Physiol. 6, 366 (2015).
Brown, D. & Superti-Furga, G. Rediscovering the sweet spot in drug discovery. Drug Discov. Today 8, 1067–1077 (2003).
Cheng, A. C. et al. Structure-based maximal affinity model predicts small-molecule druggability. Nat. Biotechnol. 25, 71–75 (2007).
Gashaw, I., Ellinghaus, P., Sommer, A. & Asadullah, K. What makes a good drug target?. Drug Discov. Today 17(Suppl), S24-30 (2012).
Oprea, T. I. et al. Unexplored therapeutic opportunities in the human genome. Nat. Rev. Drug Discov. 17, 317–332 (2018).
Wishart, D. S. et al. DrugBank: A comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 34, D668–D672 (2006).
Wishart, D. S. et al. DrugBank: A knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res. 36, D901–D906 (2008).
Guerrero, S. et al. Analysis of racial/ethnic representation in select basic and applied cancer research studies. Sci. Rep. 8, 13978 (2018).
García-Cárdenas, J. M. et al. Toward equitable precision oncology: Monitoring racial and ethnic inclusion in genomics and clinical trials. JCO Precis. Oncol. 8, e2300398 (2024).
Costa, P. R., Acencio, M. L. & Lemke, N. A machine learning approach for genome-wide prediction of morbid and druggable human genes based on systems-level data. BMC Genomics 11(Suppl 5), S9 (2010).
Blanco, J. L., Porto-Pazos, A. B., Pazos, A. & Fernandez-Lozano, C. Prediction of high anti-angiogenic activity peptides in silico using a generalized linear model and feature selection. Sci. Rep. 8, 15688 (2018).
Wei, L., Zhou, C., Chen, H., Song, J. & Su, R. ACPred-FL: A sequence-based predictor using effective feature representation to improve the prediction of anti-cancer peptides. Bioinformatics 34, 4007–4016 (2018).
Martínez-Arzate, S. G. et al. PTML model for proteome mining of B-cell epitopes and theoretical-experimental study of Bm86 Protein sequences from Colima, Mexico. J. Proteome Res. 16, 4093–4103 (2017).
Fernandez-Lozano, C. et al. Classification of signaling proteins based on molecular star graph descriptors using Machine Learning models. J. Theor. Biol. 384, 50–58 (2015).
Munteanu, C. R. et al. LECTINPred: Web server that uses complex networks of protein structure for prediction of lectins with potential use as cancer biomarkers or in parasite vaccine design. Mol. Inform. 33, 276–285 (2014).
Fernández-Blanco, E., Aguiar-Pulido, V., Munteanu, C. R. & Dorado, J. Random Forest classification based on star graph topological indices for antioxidant proteins. J. Theor. Biol. 317, 331–337 (2013).
Zhu, M. et al. The analysis of the drug-targets based on the topological properties in the human protein-protein interaction network. J. Drug Target. 17, 524–532 (2009).
Jeon, J. et al. A systematic approach to identify novel cancer drug targets using machine learning, inhibitor design and high-throughput screening. Genome Med. 6, 57 (2014).
Li, Z.-C. et al. Large-scale identification of potential drug targets based on the topological features of human protein-protein interaction network. Anal. Chim. Acta 871, 18–27 (2015).
Laenen, G., Thorrez, L., Börnigen, D. & Moreau, Y. Finding the targets of a drug by integration of gene expression data with a protein interaction network. Mol. Biosyst. 9, 1676–1685 (2013).
Emig, D. et al. Drug target prediction and repositioning using an integrated network-based approach. PLoS ONE 8, e60618 (2013).
Yao, L. & Rzhetsky, A. Quantitative systems-level determinants of human genes targeted by successful drugs. Genome Res. 18, 206–213 (2008).
Yildirim, M. A., Goh, K.-I., Cusick, M. E., Barabási, A.-L. & Vidal, M. Drug-target network. Nat. Biotechnol. 25, 1119–1126 (2007).
Cao, D.-S., Xiao, N., Xu, Q.-S. & Chen, A. F. Rcpi: R/Bioconductor package to generate various descriptors of proteins, compounds and their interactions. Bioinformatics 31, 279–281 (2015).
Hao, J. & Ho, T. K. Machine learning made easy: A review of Scikit-learn package in python programming language. J. Educ. Behav. Stat. 44, 348–361 (2019).
López-Cortés, A. et al. Prediction of breast cancer proteins involved in immunotherapy, metastasis, and RNA-binding using molecular descriptors and artificial neural networks. Sci. Rep. 10, 8515 (2020).
Cover, T. & Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inform. Theory 13, 21–27 (1967).
Mika, S., Ratsch, G., Weston, J., Scholkopf, B. & Mullers, K. R. Fisher discriminant analysis with kernels. In Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468) 41–48 https://doi.org/10.1109/NNSP.1999.788121 (IEEE, 1999).
Patle, A. & Chouhan, D. S. SVM kernel functions for classification. In 2013 International Conference on Advances in Technology and Engineering (ICATE) 1–9 https://doi.org/10.1109/ICAdTE.2013.6524743 (IEEE, 2013).
Peduzzi, P., Concato, J., Kemper, E., Holford, T. R. & Feinstein, A. R. A simulation study of the number of events per variable in logistic regression analysis. J. Clin. Epidemiol. 49, 1373–1379 (1996).
White, B. W. & Rosenblatt, F. Principles of neurodynamics: Perceptrons and the theory of brain mechanisms. Am. J. Psychol. 76, 705 (1963).
Swain, P. H. & Hauska, H. The decision tree classifier: Design and potential. IEEE Trans. Geosci. Electron. 15, 142–147 (1977).
Breiman, L. Random Forests (Springer Science and Business Media LLC, 2001). https://doi.org/10.1023/a:1010933404324.
Chen, T. & Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’16 785–794 https://doi.org/10.1145/2939672.2939785 (ACM Press, 2016).
Friedman, J. H. Stochastic gradient boosting. Comput. Stat. Data Anal. 38, 367–378 (2002).
Hughes, G. On the mean accuracy of statistical pattern recognizers. IEEE Trans. Inform. Theory 14, 55–63 (1968).
Breiman, L. Bagging predictors. Mach. Learn. 24, 123–140 (1996).
Jolliffe, I. Principal component analysis. In Encyclopedia of Statistics in Behavioral Science (eds Everitt, B. S. & Howell, D. C.) (Wiley, 2005). https://doi.org/10.1002/0470013192.bsa501.
Wishart, D. S. et al. DrugBank 5.0: A major update to the DrugBank database for 2018. Nucleic Acids Res. 46, D1074–D1082 (2018).
Corsello, S. M. et al. The Drug Repurposing Hub: A next-generation drug library and information resource. Nat. Med. 23, 405–408 (2017).
Tonks, N. K. Protein tyrosine phosphatases: From genes, to function, to disease. Nat. Rev. Mol. Cell Biol. 7, 833–846 (2006).
Brautigan, D. L. Protein Ser/Thr phosphatases—The ugly ducklings of cell signalling. FEBS J. 280, 324–345 (2013).
Fahs, S., Lujan, P. & Köhn, M. Approaches to study phosphatases. ACS Chem. Biol. 11, 2944–2961 (2016).
Xie, X. et al. Recent advances in targeting the “undruggable” proteins: From drug discovery to clinical trials. Signal Transduct. Target. Ther. 8, 335 (2023).
Repana, D. et al. The Network of Cancer Genes (NCG): A comprehensive catalogue of known and candidate cancer genes from cancer sequencing screens. Genome Biol. 20, 1 (2019).
Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. SMOTE: Synthetic minority over-sampling technique. JAIR 16, 321–357 (2002).
Bradley, A. P. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn. 30, 1145–1159 (1997).
Ochoa, D. et al. Open Targets Platform: Supporting systematic drug-target identification and prioritisation. Nucleic Acids Res. 49, D1302–D1310 (2021).
Davies, M. et al. ChEMBL web services: Streamlining access to drug discovery data and utilities. Nucleic Acids Res. 43, W612–W620 (2015).
Mendez, D. et al. ChEMBL: Towards direct deposition of bioassay data. Nucleic Acids Res. 47, D930–D940 (2019).
Ghoussaini, M. et al. Open Targets Genetics: Systematic identification of trait-associated genes using large-scale genetics and functional genomics. Nucleic Acids Res. 49, D1311–D1320 (2021).
Cook, C. E. et al. The European Bioinformatics Institute in 2016: Data growth and integration. Nucleic Acids Res. 44, D20–D26 (2016).
Landrum, M. J. et al. ClinVar: Improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 46, D1062–D1067 (2018).
Martin, A. R. et al. PanelApp crowdsources expert knowledge to establish consensus diagnostic gene panels. Nat. Genet. 51, 1560–1565 (2019).
Sondka, Z. et al. The COSMIC Cancer Gene Census: Describing genetic dysfunction across all human cancers. Nat. Rev. Cancer 18, 696–705 (2018).
Martínez-Jiménez, F. et al. A compendium of mutational cancer driver genes. Nat. Rev. Cancer 20, 555–572 (2020).
Tamborero, D. et al. Cancer Genome Interpreter annotates the biological and clinical relevance of tumor alterations. Genome Med. 10, 25 (2018).
Iorio, F. et al. Pathway-based dissection of the genomic heterogeneity of cancer hallmarks’ acquisition with SLAPenrich. Sci. Rep. 8, 6713 (2018).
Fabregat, A. et al. The Reactome pathway Knowledgebase. Nucleic Acids Res. 44, D481–D487 (2016).
Smedley, D. et al. PhenoDigm: Analyzing curated annotations to associate animal models with human diseases. Database 2013, bat025 (2013).
Carvalho-Silva, D. et al. Open Targets Platform: New developments and updates two years on. Nucleic Acids Res. 47, D1056–D1065 (2019).
Iannuccelli, M. et al. CancerGeneNet: Linking driver genes to cancer hallmarks. Nucleic Acids Res. 48, D416–D421 (2020).
Lo Surdo, P. et al. SIGNOR 3.0, the SIGnaling network open resource 3.0: 2022 update. Nucleic Acids Res. 51, D631–D637 (2023).
Ryan, D. P. & Matthews, J. M. Protein-protein interactions in human disease. Curr. Opin. Struct. Biol. 15, 441–446 (2005).
Szklarczyk, D. et al. The STRING database in 2021: Customizable protein-protein networks, and functional characterization of user-uploaded gene/measurement sets. Nucleic Acids Res. 49, D605–D612 (2021).
Ramos-Medina, M. J. et al. CardiOmics signatures reveal therapeutically actionable targets and drugs for cardiovascular diseases. Heliyon 10, e23682 (2024).
Hanahan, D. Hallmarks of cancer: New dimensions. Cancer Discov. 12, 31–46 (2022).
Mitsopoulos, C. et al. canSAR: Update to the cancer translational research and drug discovery knowledgebase. Nucleic Acids Res. 49, D1074–D1082 (2021).
UniProt Consortium. UniProt: A worldwide hub of protein knowledge. Nucleic Acids Res. 47, D506–D515 (2019).
Uhlen, M. et al. A pathology atlas of the human cancer transcriptome. Science 357, eaan2507 (2017).
Thul, P. J. & Lindskog, C. The human protein atlas: A spatial map of the human proteome. Protein Sci. 27, 233–244 (2018).
Simon, R., Mirlacher, M. & Sauter, G. Immunohistochemical analysis of tissue microarrays. Methods Mol. Biol. 664, 113–126 (2010).
Zhang, Q. et al. Identification of potential diagnostic and prognostic biomarkers for prostate cancer. Oncol. Lett. 18, 4237–4245 (2019).
Raudvere, U. et al. g:Profiler: A web server for functional enrichment analysis and conversions of gene lists (2019 update). Nucleic Acids Res. 47, W191–W198 (2019).
The Gene Ontology Consortium. The Gene Ontology resource: Enriching a GOld mine. Nucleic Acids Res. 49, D325–D334 (2021).
López-Cortés, A. et al. Identification of key proteins in the signaling crossroads between wound healing and cancer hallmark phenotypes. Sci. Rep. 11, 17245 (2021).
Collins, R. L. et al. A structural variation reference for medical and population genetics. Nature 581, 444–451 (2020).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Muiños, F., Martínez-Jiménez, F., Pich, O., Gonzalez-Perez, A. & Lopez-Bigas, N. In silico saturation mutagenesis of cancer genes. Nature 596, 428–432 (2021).
Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).
Rentzsch, P., Witten, D., Cooper, G. M., Shendure, J. & Kircher, M. CADD: Predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res. 47, D886–D894 (2019).
Rose, T., Monti, N., Anand, N. & Shen, T. PLAPT: Protein-ligand binding affinity prediction using pretrained transformers. BioRxiv https://doi.org/10.1101/2024.02.08.575577 (2024).
Wishart, D. S. et al. HMDB 5.0: The human metabolome database for 2022. Nucleic Acids Res. 50, D622–D631 (2022).
Cunningham, M. et al. PINNED: Identifying characteristics of druggable human proteins using an interpretable neural network. J. Cheminform. 15, 64 (2023).
Wang, C. et al. Predicting drug-target interactions with electrotopological state fingerprints and amphiphilic pseudo amino acid composition. Int. J. Mol. Sci. 21, 5694 (2020).
Chu, H. & Liu, T. Comprehensive research on druggable proteins: From PSSM to pre-trained language models. Int. J. Mol. Sci. 25, 4507 (2024).
Vernone, A., Berchialla, P. & Pescarmona, G. Human protein cluster analysis using amino acid frequencies. PLoS ONE 8, e60220 (2013).
Pérez-Villa, A. et al. Integrated multi-omics analysis reveals the molecular interplay between circadian clocks and cancer pathogenesis. Sci. Rep. 13, 14198 (2023).
López-Cortés, A. et al. The close interaction between hypoxia-related proteins and metastasis in pancarcinomas. Sci. Rep. 12, 11100 (2022).
López-Cortés, A. et al. Gene prioritization, communality analysis, networking and metabolic integrated pathway to better understand breast cancer pathogenesis. Sci. Rep. 8, 16679 (2018).
Wang, Y. et al. Expedited mapping of the ligandable proteome using fully functionalized enantiomeric probe pairs. Nat. Chem. 11, 1113–1123 (2019).
Gross, S. M. et al. Analysis and modeling of cancer drug responses using cell cycle phase-specific rate effects. Nat. Commun. 14, 3450 (2023).
Cohen, P., Cross, D. & Jänne, P. A. Kinase drug discovery 20 years after imatinib: Progress and future directions. Nat. Rev. Drug Discov. 20, 551–569 (2021).
Waldman, A. D., Fritz, J. M. & Lenardo, M. J. A guide to cancer immunotherapy: From T cell basic science to clinical practice. Nat. Rev. Immunol. 20, 651–668 (2020).
Eastman, A. Activation of programmed cell death by anticancer agents: Cisplatin as a model system. Cancer Cells 2, 275–280 (1990).
Wang, L., Lankhorst, L. & Bernards, R. Exploiting senescence for the treatment of cancer. Nat. Rev. Cancer 22, 340–355 (2022).
Hanker, A. B., Sudhan, D. R. & Arteaga, C. L. Overcoming endocrine resistance in breast cancer. Cancer Cell 37, 496–513 (2020).
Varela, N. M. et al. A new insight for the identification of oncogenic variants in breast and prostate cancers in diverse human populations, with a focus on latinos. Front. Pharmacol. 12, 630658 (2021).
Yumiceba, V. et al. Oncology and pharmacogenomics insights in polycystic ovary syndrome: An integrative analysis. Front. Endocrinol. 11, 585130 (2020).
Paz-Y-Miño, C. et al. Positive association of the androgen receptor CAG repeat length polymorphism with the risk of prostate cancer. Mol. Med. Report. 14, 1791–1798 (2016).
Echeverría-Garcés, G. et al. Gastric cancer actionable genomic alterations across diverse populations worldwide and pharmacogenomics strategies based on precision oncology. Front. Pharmacol. 15, 1373007 (2024).
López-Cortés, A. et al. Pharmacogenomics, biomarker network, and allele frequencies in colorectal cancer. Pharmacogenomics J. 20, 136–158 (2020).
Salas-Hernández, A. et al. An updated examination of the perception of barriers for pharmacogenomics implementation and the usefulness of drug/gene pairs in Latin America and the Caribbean. Front. Pharmacol. 14, 1175737 (2023).
Quinones, L. A. et al. Perception of the usefulness of drug/gene pairs and barriers for pharmacogenomics in Latin America. Curr. Drug Metab. 15, 202–208 (2014).
López-Cortés, A. et al. OncoOmics approaches to reveal essential genes in breast cancer: A panoramic view from pathogenesis to precision medicine. Sci. Rep. 10, 5285 (2020).
Ocaña-Paredes, B. et al. The pharmacoepigenetic paradigm in cancer treatment. Front. Pharmacol. 15, 1381168 (2024).
Pirmohamed, M. Pharmacogenomics: Current status and future perspectives. Nat. Rev. Genet. 24, 350–362 (2023).
López-Cortés, A., Guerrero, S., Redal, M. A., Alvarado, A. T. & Quiñones, L. A. State of art of cancer pharmacogenomics in Latin American populations. Int. J. Mol. Sci. 18, 639 (2017).
Zdrazil, B. et al. The ChEMBL Database in 2023: A drug discovery platform spanning multiple bioactivity data types and time periods. Nucleic Acids Res. 52, D1180–D1192 (2024).
López-Cortés, A. et al. In silico analyses of immune system protein interactome network, single-cell RNA sequencing of human tissues, and artificial neural networks reveal potential therapeutic targets for drug repurposing against COVID-19. Front. Pharmacol. 12, 598925 (2021).
Llaguno-Munive, M., Vazquez-Lopez, M. I., Jurado, R. & Garcia-Lopez, P. Mifepristone repurposing in treatment of high-grade gliomas. Front. Oncol. 11, 606907 (2021).
Alvarez, P. B. et al. Anticancer effects of mifepristone on human uveal melanoma cells. Cancer Cell Int. 21, 607 (2021).
Elía, A. et al. Beneficial effects of mifepristone treatment in patients with breast cancer selected by the progesterone receptor isoform ratio: Results from the MIPRA trial. Clin. Cancer Res. 29, 866–877 (2023).
Cassileth, P. A. et al. Pentostatin induces durable remissions in hairy cell leukemia. J. Clin. Oncol. 9, 243–246 (1991).
Harada, Y. et al. Anti-cancer effect of afatinib, dual inhibitor of HER2 and EGFR, on novel mutation HER2 E401G in models of patient-derived cancer. BMC Cancer 23, 77 (2023).
Htet, K. Z., Waul, M. A. & Leslie, K. S. Topical treatments for Kaposi sarcoma: A systematic review. Skin Health Dis. 2, e107 (2022).
Litton, J. K. et al. Talazoparib in patients with advanced breast cancer and a germline BRCA mutation. N. Engl. J. Med. 379, 753–763 (2018).
André, F. et al. Alpelisib for PIK3CA-mutated, hormone receptor-positive advanced breast cancer. N. Engl. J. Med. 380, 1929–1940 (2019).
Bartlett, T. E. et al. Antiprogestins reduce epigenetic field cancerization in breast tissue of young healthy women. Genome Med. 14, 64 (2022).
Kumar, A. et al. Lorlatinib in the second line and beyond for ALK positive lung cancer: Real-world data from resource-constrained settings. BJC Rep. 2, 35 (2024).
Arafa, A. T. et al. Impact of piflufolastat F-18 PSMA PET imaging on clinical decision-making in prostate cancer across disease states: A retrospective review. Prostate 83, 863–870 (2023).
Ponzini, F. M. et al. Repurposing the FDA-approved anthelmintic pyrvinium pamoate for pancreatic cancer treatment: Study protocol for a phase I clinical trial in early-stage pancreatic ductal adenocarcinoma. BMJ Open 13, e073839 (2023).
Tomitsuka, E., Kita, K. & Esumi, H. An anticancer agent, pyrvinium pamoate inhibits the NADH-fumarate reductase system—a unique mitochondrial energy metabolism in tumour microenvironments. J. Biochem. 152, 171–183 (2012).
Ishii, I., Harada, Y. & Kasahara, T. Reprofiling a classical anthelmintic, pyrvinium pamoate, as an anti-cancer drug targeting mitochondrial respiration. Front. Oncol. 2, 137 (2012).
Schultz, C. W. & Nevler, A. Pyrvinium pamoate: Past, present, and future as an anti-cancer drug. Biomedicines 10, 3249 (2022).
Acknowledgements
This work was supported by Universidad de Las Américas, Ecuador; the grant ED431C 2022/46—Competitive Reference Groups. GRC—funded by the EU and Xunta de Galicia, Spain; and the Latin American Society of Pharmacogenomics and Personalized Medicine (SOLFAGEM).
Author information
Authors and Affiliations
Contributions
A.L.C. and C.R.M. conceived the subject, the conceptualization of the study, and wrote the manuscript. A.L.C., A.C.A., G.E.G., P.E.E., M.P.A., N.E., J.B.M., C.M.C.S., and C.R.M. did data curation and supplementary data. C.R.M. and J.D. built the models using machine learning. G.E.G., J.D., A.P., H.G.D., Y.P.C., and E.T. gave conceptual advice, valuable scientific input, and edited the final version of the manuscript. A.L.C. and C.R.M. supervised the project. A.L.C. did funding acquisition. Finally, all authors reviewed and approved the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
López-Cortés, A., Cabrera-Andrade, A., Echeverría-Garcés, G. et al. Unraveling druggable cancer-driving proteins and targeted drugs using artificial intelligence and multi-omics analyses. Sci Rep 14, 19359 (2024). https://doi.org/10.1038/s41598-024-68565-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-024-68565-7
- Springer Nature Limited