Unraveling druggable cancer-driving proteins and targeted drugs using artificial intelligence and multi-omics analyses

López-Cortés, Andrés; Cabrera-Andrade, Alejandro; Echeverría-Garcés, Gabriela; Echeverría-Espinoza, Paulina; Pineda-Albán, Micaela; Elsitdie, Nicole; Bueno-Miño, José; Cruz-Segundo, Carlos M.; Dorado, Julian; Pazos, Alejandro; Gonzáles-Díaz, Humberto; Pérez-Castillo, Yunierkis; Tejera, Eduardo; Munteanu, Cristian R.

doi:10.1038/s41598-024-68565-7

Unraveling druggable cancer-driving proteins and targeted drugs using artificial intelligence and multi-omics analyses

Article
Open access
Published: 21 August 2024

Volume 14, article number 19359, (2024)
Cite this article

Download PDF

You have full access to this open access article

Scientific Reports

Unraveling druggable cancer-driving proteins and targeted drugs using artificial intelligence and multi-omics analyses

Download PDF

Andrés López-Cortés¹,
Alejandro Cabrera-Andrade^2,3,
Gabriela Echeverría-Garcés^4,5,
Paulina Echeverría-Espinoza¹,
Micaela Pineda-Albán¹,
Nicole Elsitdie¹,
José Bueno-Miño¹,
Carlos M. Cruz-Segundo^6,7,
Julian Dorado^6,8,
Alejandro Pazos^6,8,9,
Humberto Gonzáles-Díaz^10,11,
Yunierkis Pérez-Castillo²,
Eduardo Tejera² &
…
Cristian R. Munteanu^6,8,9

894 Accesses
1 Citation
8 Altmetric
Explore all metrics

Abstract

The druggable proteome refers to proteins that can bind to small molecules with appropriate chemical affinity, inducing a favorable clinical response. Predicting druggable proteins through screening and in silico modeling is imperative for drug design. To contribute to this field, we developed an accurate predictive classifier for druggable cancer-driving proteins using amino acid composition descriptors of protein sequences and 13 machine learning linear and non-linear classifiers. The optimal classifier was achieved with the support vector machine method, utilizing 200 tri-amino acid composition descriptors. The high performance of the model is evident from an area under the receiver operating characteristics (AUROC) of 0.975 ± 0.003 and an accuracy of 0.929 ± 0.006 (threefold cross-validation). The machine learning prediction model was enhanced with multi-omics approaches, including the target-disease evidence score, the shortest pathways to cancer hallmarks, structure-based ligandability assessment, unfavorable prognostic protein analysis, and the oncogenic variome. Additionally, we performed a drug repurposing analysis to identify drugs with the highest affinity capable of targeting the best predicted proteins. As a result, we identified 79 key druggable cancer-driving proteins with the highest ligandability, and 23 of them demonstrated unfavorable prognostic significance across 16 TCGA PanCancer types: CDKN2A, BCL10, ACVR1, CASP8, JAG1, TSC1, NBN, PREX2, PPP2R1A, DNM2, VAV1, ASXL1, TPR, HRAS, BUB1B, ATG7, MARK3, SETD2, CCNE1, MUTYH, CDKN2C, RB1, and SMARCA4. Moreover, we prioritized 11 clinically relevant drugs targeting these proteins. This strategy effectively predicts and prioritizes biomarkers, therapeutic targets, and drugs for in-depth studies in clinical trials. Scripts are available at https://github.com/muntisa/machine-learning-for-druggable-proteins.

Machine learning prediction of oncology drug targets based on protein and network properties

Article Open access 14 March 2020

In silico re-identification of properties of drug target proteins

Article Open access 31 May 2017

Testing the predictive power of reverse screening to infer drug targets, with the help of machine learning

Article Open access 09 May 2024

Introduction

The human genome comprises approximately 19,890 protein-coding genes, yet not all these proteins serve as suitable drug targets^1,2,3. The druggable proteome refers to the subset of proteins capable of binding to an antibody or small molecule with the requisite chemical properties and affinity⁴. Druggability describes the feature of a target molecule, wherein it induces a favorable clinical response upon interaction with a drug-like compound⁵. It is noteworthy that an estimated 60% of small molecule drug discovery projects falter during the hit-to-lead phase due to the target’s lack of druggability^5,6. Predicting a target’s druggability early in drug discovery is thus crucial. Only about 10% of the human genome consists of druggable targets, and merely half of these are disease-relevant⁷.

According to Gashaw et al., for a drug target to be considered ideal, it should possess specific properties: an unimpeded operation characterized by the absence of competitive binding, the presence of a biomarker that facilitates monitoring its efficacy, differential expression throughout the body to enable precise targeting, minimal interference with physiological conditions, the ability to alter a disease, and suitability for high-throughput screening^7,8.

In the context of the human genome, which consists of numerous protein-coding genes, roughly 3,000 are believed to be part of the druggable genome. However, drugs that have received approval from the US Food & Drug Administration (FDA) only target a meager twenty percent of these proteins⁹. To provide more specifics, the FDA has approved 672 drugs, each classified based on its protein class: enzymes (260; 39%), transporters (149; 22%), G-protein coupled receptors (98; 15%), CD markers (71; 11%), voltage-gated ion channels (49; 7%), and nuclear receptors (24; 4%), to name a few^10,11. It is essential to note that drugs rendering the protein target inactive are termed antagonists, while those that stimulate the protein target are labeled agonists. In terms of the cellular locations of the targets for these FDA-approved drugs, various prediction methods for transmembrane and signal proteins suggest that 250 (37%) were integral to the membrane, 201 (30%) were intracellular, 101 (15%) existed as single-pass transmembrane proteins, 83 (12%) were secreted, 28 (4%) appeared as combined membrane-bound and secreted isoforms, and 9 (1%) were simultaneously integral to the membrane and exhibited a single-pass membrane structure^4,10,11.

The limited number of drugs approved to date can be attributed to several factors, including the intricacies of experimenting with all proteins and nucleic acid fragments, a lack of information related to ethnicity, and a limited understanding of many diseases at the molecular level^12,13. Given these challenges, there is a significant demand for computational models that can accurately predict drug targets on a genome-wide scale, ensuring both high sensitivity and specificity⁵. Furthermore, leveraging extensive data sources, such as metabolic and gene regulatory networks, protein–protein interactions, multi-omics datasets, and gene expression profiles, in conjunction with data mining tools like machine learning (ML), can aid in constructing predictive models. These models can discern biologically relevant patterns that indicate druggability in potential drug targets¹⁴.

Several classification models have been developed for predicting protein activities, including anti-angiogenic¹⁵, anti-cancer¹⁶, enzyme classes¹⁵, epitopes¹⁷, signaling¹⁸, lectins¹⁹, antioxidants²⁰, and druggability^{14,21,22,23,24,25,26,27}. Thus, the main aim of our study was to build an effective ML classifier to forecast the druggability of cancer-driving proteins, validate them through integrated multi-omics approaches, propose potential druggable proteins per cancer type, and propose potential targeted drugs and metabolites.

Methods

Machine learning prediction model

Figure 1 presents the general flow chart of the proposed methodology to obtain a classifier for druggable proteins. Firstly, we conducted a database with druggable proteins and ‘hard-to-drug’ proteins. Secondly, three families of protein composition descriptors were calculated using RCPI (R package)²⁸: 20 amino acid composition (AC), 400 di-amino acid composition (DC), and 8,000 tri-amino acid composition (TC). In the next step, Jupyter notebooks with Python scikit-learn²⁹ were used to test 13 types of ML classifiers by combining the 3 families of descriptors (AC, DC, TC) with five different feature selection methods and with different parameters. The employed classifiers include Gaussian Naive Bayes (GNB)³⁰, k-nearest neighbors algorithm (KNN)³¹, linear discriminant analysis (LDA)³², support vector machine (SVM) both linear and non-linear based on radial basis functions (RBF)³³, logistics regression (LR)³⁴, multilayer perceptron (MLP) or neural network with 20 neurons in one hidden layer³⁵, decision tree (DT)³⁶, random forest (RF)³⁷, XGBoost (XGB), an optimized distributed gradient boosting library³⁸, Gradient Boosting for classification (GB)³⁹, AdaBoost classifier (AdaB)⁴⁰, and Bagging classifier⁴¹. The feature selection methods utilized were principal component analysis (PCA)⁴², feature selection based on a percentile of the highest scores with f_classif (ANOVA F-value between label/feature for classification tasks), feature selector removing features with variance below a threshold, linear support vector classification, and the extra-trees classifier.

GNB is a probabilistic classifier based on Bayes' theorem, assuming all features are independent³⁰. KNN is a non-parametric classifier that categorizes an unclassified sample using the nearest of k samples in the training set (k = 3)³¹. LDA is a fundamental linear classifier, fitting class conditional densities to the input features using Bayes’ rule³². The linear SVM maps input features into a higher-dimensional space³³, while for nonlinear challenges, SVM employs Gaussian radial basis as nonlinear kernel functions. LR, another linear classifier, estimates binary response probabilities using varying weights³⁴. The MLP is a category of neural networks with artificial neurons and a single hidden layer, capable of integrating both linear and nonlinear activation functions³⁵. DT structures decision rules from the input features, with classification rules defined as paths from the root to the leaf³⁶. RF, an ensemble method, combines parallel decision trees, exhibiting low-bias, minimal correlation between individual trees, and high variance³⁷. XGB, an ensemble method, uses sequential weak trees to improve classification performance³⁸. GB, meant for classification, is a base boost method employing sequential weak classifiers³⁹. AdaB is a meta-estimator that initiates fitting with a classifier on the original dataset, subsequently adding more copies of the classifier with adjusted weights for misclassified instances⁴⁰. The Bagging classifier, a variant of AdaB, incorporates additional classifiers based on subsets of the original dataset⁴¹.

The ML prediction model was constructed using two protein sets. The positive set comprised 666 druggable proteins with FDA-approved drugs, as per the DrugBank database (www.drugbank.ca)⁴³ and the Broad Institute’s Drug Repurposing Hub (https://clue.io/repurposing)⁴⁴. In contrast, the negative protein set consisted of 219 ‘hard-to-drug’ protein phosphatases, which were previously referred to as ‘undruggable’ targets⁴⁵. As noted by Xie et al., kinases are classic examples of druggable targets that play a significant role in modulating cell motility. Conversely, phosphatases act as their counterparts, critically regulating cellular dynamics by removing phosphate from proteins, including serine, threonine, and tyrosine residues^46,47,48. Detailed information on the gene symbol and gene ID for all druggable and ‘hard-to-drug’ proteins can be found in Supplementary Tables 1 and 2, while Supplementary Tables 3 and 4 provide the FASTA sequences of all proteins analyzed in this study. Lastly, the final ML prediction model was applied to scan 2,339 cancer-driving proteins sourced from the Network of Cancer Genes⁴⁹ (Supplementary Table 5).

After computing the amino acid composition descriptors, the datasets comprised 885 proteins. Proteins in the druggable class were labeled as 1, while those in the ‘hard-to-drug’ class were labeled as 0. Due to the imbalance in the datasets, we employed the synthetic minority over-sampling technique (SMOTE) as described by Chawla et al.⁵⁰. We used a threefold cross-validation (CV) approach to construct the ML classifiers. For each fold, a sequential pipeline was executed: (a) Scaling: the training set was standardized using the StandardScaler, and the test set was transformed to match the same scale. (b) Feature Selection or Dimension Reduction: the dimensionality of the training set was either reduced using a feature selection method, such as LinearSVC, or through a dimension reduction technique like Principal Component Analysis (PCA). (c) Cross-Validation Evaluation: The cross_val_score method was employed to compute the area under the receiver operating characteristic (AUROC) scores across the 13 ML methods for all splits. (d) Mean values and standard deviations (SD) of the AUROC scores for each ML classifier were calculated and displayed for the test subset⁵¹.

The best model to be used for predictions was chosen using criteria such as mean AUROC, SD of AUROC, the number of features, and the type model features (original or transformed). All the results obtained can be reproduced by using the scripts available at the GitHub repository: https://github.com/muntisa/machine-learning-for-druggable-proteins.

In addition, the importance of the features for the best model was analyzed using a function that calculates the permutation feature importance. This is done by randomly shuffling each feature and measuring the decrease in the model's performance. This process was repeated 10 times for each feature, and the average importance value was calculated. The result was a list of feature importances, which can be used to identify the most important features in the model. The importance values were normalized (values between 0 and 1), and the top 10% most important features were highlighted. Additionally, an extra analysis of the single amino acid frequencies in all selected features of the best model was conducted, with values also normalized between 0 and 1.

Target-disease evidence score

Open Targets (https://www.targetvalidation.org) is a platform that provides comprehensive data integration, enabling access to and visualization of potential drug targets associated with cancer⁵². ChEMBL (https://www.ebi.ac.uk/chembl/) is a database that catalogs bioactive molecules with drug-like properties⁵³. The ChEMBL evidence score denotes a target-disease relationship that is supported by an FDA-approved drug or a clinical candidate drug targeting the gene product in question and indicated for cancer treatment⁵⁴. In this study, to validate the significance of our previously predicted druggable proteins, we compared the ChEMBL evidence scores of druggable cancer-driving proteins and those of ‘hard-to-drug’ proteins. The scores were then statistically analyzed using the Bonferroni correction test, with a significance threshold set at P < 0.001.

In our pursuit to identify and prioritize the most critical druggable cancer-driving proteins, we retrieved target-disease evidence scores provided by ten distinct bioinformatic tools. This analytical effort spanned proteins currently undergoing clinical trials as well as those not yet under examination in clinical trials. Regarding these tools, Open Target Genetics (https://genetics.opentargets.org) specializes in identifying trait-causal genes from significant loci in genome-wide association studies (GWAS)⁵⁵. ClinVar (https://www.ncbi.nlm.nih.gov/clinvar/) serves as a repository detailing the relationships between human germline or somatic variants and their associated phenotypes. Its evidence score is based on the clinical significance of a genetic variant^56,57. The Genomics England PanelApp (https://panelapp.genomicsengland.co.uk) merges crowdsourced expertise with curation to establish target-cancer relationships⁵⁸. Cancer Gene Census (https://cancer.sanger.ac.uk/census), part of the Wellcome Sanger Institute Catalogue of Somatic Mutations in Cancer (COSMIC), aims to catalog genes containing mutations causally linked to cancer⁵⁹. IntOGen (https://www.intogen.org) offers a methodology to pinpoint potential cancer driver genes, using large-scale mutational data from sequenced tumor samples⁶⁰. The Cancer Biomarkers database, part of the Cancer Genome Interpreter (http://www.cancergenomeinterpreter.org), features biomarkers relevant to drug sensitivity, resistance, and toxicity for drugs targeting specific cancer entities⁶¹. SLAPenrich (https://saezlab.github.io/SLAPenrich/) introduces a novel statistical approach to recognize significantly mutated pathways at the population level across large cohorts of cancer patients⁶². The Reactome Knowledgebase (https://reactome.org) delineates molecular aspects of various cellular functions like signal transduction and metabolism, identifying reaction pathways impacted by specific cancer types⁶³. Phenotype comparisons for DIsease Genes and Models (PhenoDigm) (http://www.sanger.ac.uk/resources/databases/phenodigm), an algorithm provided by the International Mouse Phenotypes Consortium (IMPC), offers insights into gene–disease associations by analyzing phenotype data⁶⁴. Lastly, an overall score was calculated integrating the information from all bioinformatic approaches to identify and prioritize essential druggable cancer-driving proteins.

Drugs involved in late-phase clinical trials

The Open Targets platform (https://www.targetvalidation.org), enhanced with ChEMBL annotations, provides an integrated data framework. This enables access and visualization of potential drug targets associated with cancer^52,54,65. Furthermore, the Broad Institute’s Drug Repurposing Hub (https://clue.io/repurposing) is a curated collection of drugs approved by the Food and Drug Administration (FDA). This hub aided us in discerning the mechanisms of action for drugs employed in cancer treatments⁴⁴. As a result, we analyzed the druggable cancer-driving proteins with a ChEMBL evidence score of > 0.9 to map out the therapeutic landscape for drugs in phase III and IV clinical trials.

Distance score of shortest pathways to cancer hallmark phenotypes

CancerGeneNet (https://signor.uniroma2.it/CancerGeneNet/) is a curated bioinformatics resource provided by the SIGnaling Network Open Resource (SIGNOR 3.0)^66,67. This platform uses experimental annotations to bridge two interaction layers crucial to cell physiology, connecting proteins influenced by cancer drivers with proteins that impact on the hallmarks of cancer^66,68. To elucidate the dynamics of these interactions, the procedure for calculating the distance score for the shortest pathways is outlined as follows: a) initiate a path query between two nodes; b) within the path string, each step is characterized by a pair of nodes and an edge, representing the nature of the interaction (e.g., activation or inhibition); c) the ‘distance’ parameter calculates the path length, incorporating the reliability of each step. Each step's reliability score, denoted as 'r', is derived from supporting evidence extracted from the STRING database⁶⁹. This score is converted into a distance using the equation: \(d=1-r\). The final path score, represented as \({D}_{path}={\sum }_{rel=1}^{N} \left(1-{r}_{rel}\right)\), is the sum of each step distance, with 'N' standing for the total number of steps in a path^67,70.

Iannuccelli et al. implemented a programmatic approach to calculate the shortest distance scores, or paths, between specific proteins and cancer phenotypes using the ‘shortest path’ function from the igraph R package. Our primary aim was to probe the signaling nexus between druggable cancer-driving proteins (not yet involved in clinical trials and with a ChEMBL evidence score = 0) and the hallmarks of cancer⁷¹.

Within this framework, we determined the shortest paths for both positive and negative regulations of druggable cancer-driving proteins linked to angiogenesis, immortality, inflammation, metastasis, proliferation, cell death, differentiation, DNA repair, and glycolysis. We then carried out multiple comparison tests, employing the Bonferroni correction (P < 0.001, 95% confidence interval), to compare the distance scores of druggable cancer-driving proteins across different cancer phenotypes. Lastly, we ranked these druggable proteins with the shortest paths to each cancer hallmark phenotype.

Chemistry-based score

canSAR (http://cansar.icr.ac.uk) is a comprehensive knowledgebase dedicated to drug discovery. It integrates data from genomics, proteomics, pharmacology, drugs, and chemicals with structural proteins and protein networks⁷². This bioinformatic resource encompasses the complete human proteome (20,375 sequences) sourced from the Uniprot Swiss-Prot database⁷³. Additionally, canSAR provides an extensive structure-based ligandability assessment, covering more than 4.5 million cavities⁷². The chemistry-based score is categorized into four levels: low (0–24%), suggesting the protein is less likely to be a successful drug target; moderate (25–49%), indicating a moderate probability of druggability; high (50–74%), suggesting the protein has a good probability of being druggable; and very high (75–100%), indicating the protein is very likely to be druggable and is often considered a high priority for drug development due to its high probability of successful binging with drugs⁷². Using this data, we retrieved the chemistry-based score to validate our machine learning prediction method and prioritize key druggable cancer-driving proteins for each cancer type.

A pathology atlas for human cancer

The Human Pathology Atlas, available at (https://www.proteinatlas.org/humanproteome/pathology), is an integral component of the Human Protein Atlas project. This atlas explores the prognostic relevance of druggable cancer-driving genes/proteins across 17 The Cancer Genome Atlas (TCGA) PanCancer types in almost 8,000 patients^74,75. Anchored in transcriptomics and antibody profiling, the atlas emerges as an essential tool for tailoring cancer treatments based on precision oncology⁷⁴. Immunohistochemistry (IHC) stands as the gold standard method for in situ protein expression analysis in tissue samples. The combination of IHC and tissue microarray (TMA) technology allows simultaneous analysis of hundreds of tissue samples with an unprecedented degree of experimental standardization⁷⁶.

The Atlas provides staining profiles for proteins in human tumor tissues, generated through the synergy of IHC and TMA techniques. This is complemented by Kaplan–Meier analysis, linking mRNA expression levels to patient survival. Patient samples were classified into two expression groups and the correlation between expression level and patient survival was examined. Using Kaplan–Meier survival estimators, the prognosis of different patient cohorts was determined. Log-rank tests were employed to compare these results. Genes/proteins with marked correlations to detrimental outcomes (log-rank P-values < 0.001) in the Kaplan–Meier evaluations were pinpointed as unfavorable prognostic indicators across TCGA PanCancer types⁷⁷.

Functional enrichment analysis

Functional enrichment analysis provides researchers with curated insights and a deeper understanding of protein sets derived from omics-scale experiments. For our study, we focused on druggable cancer-driving proteins that have not yet entered clinical trials, as indicated by a ChEMBL evidence score of 0. These proteins also show an unfavorable prognosis across various TCGA PanCancer types. To evaluate enrichment, we employed g:Profiler version e101_eg48_p14_baf17f0, accessible at (https://biit.cs.ut.ee/gprofiler/gost)⁷⁸. Our objective was to pinpoint significant annotations, following the Benjamini–Hochberg FDR q < 0.001 criteria, related to Gene Ontology (GO) biological processes (http://geneontology.org/)⁷⁹ and Reactome signaling pathways (https://reactome.org/)⁶³. The results of the functional enrichment analysis were visualized using a Manhattan plot, and significant terms associated with cancer hallmark phenotypes were manually curated⁸⁰.

The oncogenic variome of key druggable cancer-driving proteins

Identifying the oncogenic variome of druggable cancer-driving proteins encompassed two primary steps. First, we obtained 22,320 single nucleotide and insertion/deletion variants from the 23 key druggable cancer-driving proteins. This data was retrieved from 76,156 genomes belonging to the Genome Aggregation database (gnomAD v3.2.1) (https://gnomad.broadinstitute.org/), and using the GRCh38/hg38 human reference genome^1,81,82. Second, we performed the oncodriveMUT and boostDM methods integrated within the Cancer Genome Interpreter (CGI) platform (https://www.cancergenomeinterpreter.org) to evaluate the tumorigenic potential of the acquired genomic variants. This approach enabled us to categorize driver variants into known, predicted, and passenger classifications based on the Catalog of Validated Oncogenic Mutations^61,83. OncodriveMUT is a rule-based strategy that analyzes genomic features, such as regions depleted by germline variants, gene mechanisms of action, gene signals of positive selection, and clusters of somatic mutations. Conversely, boostDM is a machine learning strategy that assesses the oncogenic potential of mutations in human tissues by employing in silico saturation mutagenesis of cancer genes^61,83.

Deleteriousness of the oncogenic variome

The Combined Annotation-Dependent Depletion (CADD) tool, version 1.4 (https://cadd.gs.washington.edu/), is a bioinformatic resource that assesses the deleterious effects of diverse gene mutations within the human genome. It integrates over 60 genomic features to evaluate the impact of single nucleotide and insertion/deletion variants⁸⁴. The CADD framework analyzes multiple annotations by comparing natural selection against simulated mutations, using the GRCh38/hg38 human reference genome⁸⁵. For this study, we performed CADD to determine the deleteriousness of both known and predicted oncogenic variants associated with pivotal druggable cancer-driving genes. Lastly, the CADD deleteriousness scores were categorized as very high (30–50), high (25–30), medium (15–25), low (10–15), and very low (0–10).

Artificial intelligence prediction of drugs and metabolites

To investigate the potential interactions of current drugs (for drug repurposing) and metabolites with the best-predicted proteins, an artificial intelligence (AI)-based tool called Protein–Ligand Binding Affinity Prediction Using Pretrained Transformers (PLAPT) was used to predict the interaction affinity (or negative log10 affinity)⁸⁶. PLAPT (https://github.com/trrt-good/WELP-PLAPT/) predicts the binding affinity of ligand–protein complexes using the SMILES code of ligands and the sequence of proteins. The model employs pre-trained transformers such as ProtBERT and ChemBERTa to convert the protein sequence and SMILES structure into embeddings for the model.

Using the PLAPT Python package, all ligand–protein affinities were calculated. Two families of ligands were used: 2,466 ChEMBL approved drugs with masses between 100 and 500 (downloaded via a Python script) for potential drug repurposing against multiple predicted druggable proteins, and 217,776 molecules from the Human Metabolome Database (HMDB) (https://hmdb.ca/downloads) against a single important predicted druggable protein (due to computational limits)⁸⁷.

Results and discussion

Machine learning prediction model

The current study introduces innovative classification models designed to predict new druggable proteins that drive cancer. These predictions are based on three sets of protein sequence descriptors (amino acid composition, di-amino acid composition, and tri-amino acid composition), calculated using Rcpi. These descriptors were chosen for their proven ability to capture essential information about protein sequences that are critical for predicting druggability⁸⁸.

AC effectively represents a protein's primary structure by highlighting the frequency of each amino acid within the sequence, helping to identify general trends and patterns associated with druggable proteins. DC captures the local interactions between pairs of amino acids, providing insight into the secondary structure and local folding patterns, which are crucial for understanding functional regions and binding sites. TC considers interactions between triplets of amino acids, offering a more detailed view of the amino acid sequence, which is essential for accurately predicting protein interactions with drugs and other ligands^30,89.

Focusing on these features ensures computational efficiency and reduces the risk of overfitting, which can occur with an excessive number of features. Our comprehensive benchmarking demonstrated that these descriptors consistently provided robust performance across various machine learning classifiers. While the inclusion of additional features, such as secondary structure elements or solvent accessibility, might offer incremental benefits, the chosen descriptors strike an optimal balance between model performance, computational feasibility, and biological relevance. This balance allows for effective and interpretable predictions while maintaining the practicality of the computational framework. Furthermore, the identified amino acid sequence patterns will inform future studies on protein properties.

Subsequently, we utilized Jupyter notebooks built on Python and scikit-learn to construct 13 types of ML classifiers (GNB, KNN, LDA, SVM linear, SVM, LR, MLP, DT, RF, XGB, GB, AdaB, and Bagging), along with five types of feature selection methods with various parameters (Fig. 1). All scripts used the mean AUROC values from threefold cross-validation to quantify classification performance. We tested models using 20, 100, 200, and 400 features³⁰.

Figure 2 illustrates the AUROC values for a classifier using only 20 features: AC descriptors without feature selection, DC descriptors with LinearSCV feature selection (DC-LinearSVC20), PCA features from DC (DC-PCAn20), TC descriptors selected by SelectPercentile(f_classif, percentile = 0.25) (TC-Percn20), and TC descriptors selected with LinearSVC (TC-LinearSVC20). Notably, using only 20 AC descriptors with SVM yielded an AUROC of 0.926. The best performance was achieved using SVM (RBF) with 20 PCA components from 400 DC descriptors, resulting in an AUROC of 0.958. Additional results can be found in Supplementary Table 6.

Figure 3 displays AUROC values for a classifier using 100 features: PCA transformed of 400 DC descriptors (DC-PCAn100), TC descriptors selected with SelectPercentile(f_classif, percentile = 1.25) (TC-Perc1.25), TC descriptors with LinearSVC (TC-LinearSVC100), and 100 features selected by LinearSVC from 200 PCA components of 8,000 TC descriptors (TC-PCA200LinearSVC100). Increasing the number of features to 100 (five times more than 20) improved the AUROC to 0.976 using the same SVM (RBF) with TC-PCA200LinearSVC100³⁰.

Figure 4 shows the AUROC values for classifiers using 200 selected features (double the number from 100): PCA transformation of 400 DC descriptors (DC-PCAn200), DC descriptors selected with SelectPercentile (DC-Perc50), 200 PCA components of 8,000 transformed TC descriptors (TC-PCAn200), TC descriptors selected with SelectPercentile(f_classif, percentile = 2.5) (TC-Perc2.5), and TC descriptors with LinearSVC (TC-LinearSVC200). The combination of PCA and SVM for DC-PCAn200 resulted in the best classifier, achieving an AUROC of 0.981 (Supplementary Table 6). Further, using all 400 DC descriptors with SVM, the mean AUROC reached 0.982 ± 0.0021. Additionally, with 8,000 pure TC descriptors and SVM linear, the mean AUROC was 0.992 ± 0.0028. It is important for a classification model to avoid having more input features than data instances. We also sought to prioritize pure descriptors over PCA transformations. As a compromise, we selected the following as the best model for subsequent protein-related cancer predictions: 200 TC descriptors selected with LinearSVC, a non-linear SVM classifier with an AUROC of 0.975 ± 0.003 and an accuracy of 0.929 ± 0.006 (threefold cross-validation). The list of the 200 selected features is available in the Jupyter notebooks.

Selected features analysis

The following is the list of selected features for the best model: NRA, QRA, INA, MCA, YEA, THA, CSA, VYA, KNR, WDR, TER, PQR, YGR, EHR, LIR, VSR, ERN, MDN, SDN, LHN, YIN, FFN, RSN, QSN, FWN, ACD, WCD, MED, CHD, SHD, MLD, SMD, WPD, SSD, HTD, DWD, VYD, KNC, NDC, IHC, VHC, GYC, MCE, NHE, ALE, HME, LPE, AWE, EYE, QYE, GVE, FVE, SAQ, FNQ, MDQ, PCQ, WEQ, RQQ, NGQ, HLQ, RMQ, DFQ, GPQ, DSQ, YSQ, AWQ, RVQ, QRG, HGG, TGG, KLG, NKG, FPG, SSG, RTG, PTG, IVG, CDH, FDH, PDH, TQH, KHH, FHH, IFH, NSH, WSH, FWH, WRI, NDI, EDI, FEI, WEI, WQI, MGI, PMI, AAL, EKL, IKL, FKL, GPL, ESL, DVL, MVL, VVL, GNK, HNK, HDK, HCK, EQK, DHK, QLK, EKK, SMK, FFK, QSK, EWK, AVK, WRM, WNM, REM, WQM, SHM, LLM, SMM, NFM, TSM, RWM, GYM, KYM, VYM, HVM, IVM, LDF, YQF, NGF, HGF, FWF, FAP, FNP, PEP, SQP, QGP, VHP, PLP, HKP, NPP, QPP, STP, TTP, KWP, YWP, SRS, HDS, WDS, HCS, LES, DHS, SHS, PSS, SSS, LWS, LAT, DRT, GRT, IRT, INT, VQT, NLT, CLT, KKT, YTT, QWT, FYT, KCW, QGW, VGW, MIW, IKW, RFW, DFW, HVW, KVW, NRY, CHY, DMY, YPY, YAV, SRV, ENV, HNV, GEV, QGV, HGV, TGV, WHV, LLV, IMV, DSV, TSV, QYV. The normalized importance for the 10% selected features is presented in Table 1. The most important amino acid patterns for this classification are HME, NSH, SSS, HTD, DHK, ERN, NDI, DRT, VYD, FFN, SHM, NDC, RFW, WRI, GYC, MGI, PEP, GVE, DSQ, and LLV. The HME pattern is the most important feature for druggable proteins, while the NSH pattern has only half the importance of HME.

Table 1 Feature importance for 10% of the selected features of the best classification model.

Full size table

In Table 2, the frequencies of the amino acids in all selected features demonstrate the importance of H (histidine), S (serine), D (aspartic acid), and Q (glutamine) in classifying druggable proteins. Additionally, H and S appear in the first five most important tri-amino acid patterns. The biological significance of the amino acids in these patterns is outlined below: (a) HME (histidine–methionine–glutamic acid): Histidine is essential for protein synthesis and enzyme catalysis, methionine is the initiator amino acid for protein translation, and glutamic acid is involved in neurotransmission and protein folding; (b) NSH (asparagine–serine–histidine): Asparagine is crucial for glycoprotein synthesis and serine is involved in phosphorylation and protein structure; (c) SSS (serine–serine–serine): serine is essential for cell signaling, protein synthesis, and metabolism; (d) HTD (histidine–threonine–aspartic acid): Threonine is important for protein stability and immune function, and aspartic acid contributes to protein structure and function; and (e) DHK (aspartic acid–histidine–lysine): Lysine is essential for protein synthesis and collagen formation^5,90,91.

Table 2 Frequencies of the amino acids in the selected tri-amino acids groups for the best classification model.

Full size table

Cancer-driving proteins

We transformed 2,339 cancer-driving proteins into molecular descriptors using the best model to predict their druggability. Consequently, these protein sequences were converted into 200 selected TC descriptors. As a result, 2,080 (88.9%) of these cancer-driving proteins were predicted to have druggable activity (Fig. 5A and Supplementary Table 5). For validation, we compared the ChEMBL evidence scores of proteins involved in clinical trials⁵⁴, distinguishing among the positive set of druggable proteins (mean score = 0.712), druggable cancer-driving proteins (class 1, mean score = 0.706), ‘hard-to-drug’ cancer-driving proteins (class 0, mean score = 0.596), and the negative set of ‘hard-to-drug’ proteins (mean score = 0.414). As expected, the Bonferroni correction revealed no significant difference between the positive set and druggable cancer-driving proteins, nor between the negative set and ‘hard-to-drug’ proteins. Interestingly, it did reveal a significant difference between druggable cancer-driving proteins (class 1) and ‘hard-to-drug’ proteins (class 0) (P < 0.001) (Fig. 5B). This indicates that druggable cancer-driving proteins are distinctively more validated as potential targets compared to ‘hard-to-drug’ proteins, underscoring the relevance and accuracy of the classification method used. These findings validate the effectiveness of the prediction model in distinguishing between truly druggable targets and those that are more challenging to target therapeutically, highlighting its potential utility in the drug discovery process.

Following the prediction and validation of the 2,080 druggable cancer-driving proteins, we extracted the target-disease evidence scores from the Open Targets platform. This was done to prioritize the most relevant druggable cancer-driving proteins already involved in late-stage clinical trials (ChEMBL score > 0.9) and those not yet involved in clinical trials (ChEMBL score = 0)^52,53,54. The target-disease evidence score was encompassed data from various bioinformatic tools including Open Target Genetics⁵⁵, ClinVar (covering germinal and somatic variants)^56,57, Genomics England PanelApp⁵⁸, Cancer Gene Census⁵⁹, IntOGen⁶⁰, the Cancer Biomarkers database⁶¹, SLAPenrich⁶², the Reactome Knowledgebase⁶³, and PhenoDigm⁶⁴. This overall score, derived from an integration of these bioinformatic approaches, enabled us to identify proteins strongly associated with cancer traits. Of these, 52 were druggable cancer-driving proteins involved in late-phase clinical trials (Fig. 5C and Supplementary Tables 7 and 8), and 296 were druggable cancer-driving proteins not yet involved in clinical trials (Fig. 5D and Supplementary Tables 7 and 9). Furthermore, the five bioinformatic approaches yielding the highest target-disease evidence scores for the 296 druggable proteins not yet in clinical trials were Cancer Gene Census (mean = 0.90), SLAPenrich (0.88), Reactome (0.84), Genomics England PanelApp (0.79), and Cancer Biomarkers (0.77) (Fig. 5E).

Drugs involved in late-phase clinical trials

Figure 6 presents an update on phase III and IV clinical trials involving drugs that target cancer-driving proteins, as cataloged by the Open Targets Platform⁵². The Sankey plot in the figure reveals a total of 257 clinical trial events, involving 94 drugs with 38 different mechanisms of action, which target 52 key cancer-driving proteins across 26 types of cancer (Supplementary Table 10). The most frequently involved drugs in these late-phase clinical trials were regorafenib, binimetinib, pazopanib, and sorafenib. The mechanisms of action most common in these trials included FGFR inhibitors, FLT3 inhibitors, MEK inhibitors, and EGFR inhibitors. The cancer-driving proteins most frequently targeted in the trials were GABRB2, MAP2K1, and MAP2K2. Additionally, the cancer types most commonly evaluated in these late-phase clinical trial events were liver cancer, lung cancer, breast cancer, leukemia, and colorectal cancer. This comprehensive therapeutic landscape has enabled us to identify key patterns and trends in cancer treatment research.

Shortest pathways to cancer hallmark phenotypes

After identifying 296 druggable proteins not yet involved in clinical trials, we conducted multi-omics analyses to prioritize the most relevant cancer-driving proteins as potential therapeutic targets across various cancer types^{70,80,92,93,94}. In this context, we employed the CancerGeneNet software and found that 184 (62%) of these proteins showed distance scores indicative of their involvement in the shortest pathways leading to cancer hallmark phenotypes^66,67, as detailed in Supplementary Table 11. Figures 7A and B illustrate these druggable proteins and their shortest paths to cancer hallmarks. The top three hallmarks are cell proliferation (with a mean distance score of 1.27 and 154 proteins involved), cell differentiation (1.51; 160), and resistance to cell death (1.55; 157) (Supplementary Table 12). Utilizing the Bonferroni correction test, we observed that these druggable proteins had significantly shorter paths to these cancer hallmark phenotypes (P < 0.001). These findings are highly relevant because the prioritized druggable proteins in this analysis could be crucial targets for focusing new therapeutic strategies on processes such as cell proliferation or resistance to cell death.

Chemistry-based score

canSAR is a comprehensive knowledgebase dedicated to drug discovery and offers an extensive structure-based ligandability assessment⁷². Consequently, we retrieved the chemistry-based scores for the previously prioritized 184 proteins. The mean chemistry-based score of these 184 proteins was 69.9%. In our analysis, we considered all proteins with a ligandability score higher than the mean (cutoff > 69.9%), encompassing all proteins with the very high scores and the best proteins with high scores. This analysis enabled us to identify 79 (43%) druggable cancer-driving proteins with the highest ligandability, as shown in Fig. 7C and Supplementary Table 13. Ligandability analysis refers to a protein’s ability to bind efficiently to a drug. High ligandability helps identify and prioritize proteins that can be effective targets for new drugs, thereby increasing the specificity of the drug’s action and reducing the time and cost associated with pharmaceutical development⁹⁵.

A pathology atlas for human cancer

We explored the Human Pathology Atlas, developed by the Human Protein Atlas program, and subsequently conducted a Kaplan–Meier analysis to examine the correlation between mRNA and protein expression and patient survival^74,75,76,77. This analysis aimed at determining the prognostic significance of 79 highly ligandable, druggable cancer-driving genes/proteins (Supplementary Table 14). Our findings underscore the effectiveness of large-scale system biology projects that utilize publicly available resources. In this study, we identified the 23 key druggable cancer-driving genes/proteins that demonstrated unfavorable prognostic significance (significant log rank P-value < 0.001) across 16 TCGA PanCancer types. These genes/proteins were CDKN2A, BCL10, ACVR1, CASP8, JAG1, TSC1, NBN, PREX2, PPP2R1A, DNM2, VAV1, ASXL1, TPR, HRAS, BUB1B, ATG7, MARK3, SETD2, CCNE1, MUTYH, CDKN2C, RB1, and SMARCA4 (Fig. 7D and Supplementary Table 15).

Functional enrichment analysis

We conducted a functional enrichment analysis of the 23 key druggable cancer-driving proteins using g:Profiler software⁷⁸. The Manhattan plot enabled us to identify 64 GO biological processes⁷⁹ and 2 Reactome signaling pathways⁶³ (Fig. 7E and Supplementary Table 16). The most significant annotations, adjusted with the Benjamini–Hochberg correction and an FDR q-value < 0.001, included cell cycle, cell communication, phosphorylation, immune system process, programmed cell death, cell differentiation, cellular senescence, endocrine resistance, G1 phase, and cyclin D events in G1. Interestingly, it is important to highlight that these 23 key druggable cancer-driving proteins are involved in biological processes associated with various therapeutic strategies. These strategies include the inhibition of cellular proliferation⁹⁶, the inhibition of phosphorylation⁹⁷, cancer immunotherapy⁹⁸, activation of programmed cell death⁹⁹, regulation of senescence¹⁰⁰, and evasion of endocrine resistance¹⁰¹.

The oncogenic variome of key druggable cancer-driving proteins and their deleterious effects

Figure 8A presents the analysis of 22,320 variants using OncodriveMUT and boostDM to determine the oncogenic variome in the 23 key druggable cancer-driving genes. This analysis identified 1,598 oncogenic variants, with 11 (1%) being previously known and 1,578 (99%) newly predicted. The analysis of deleteriousness scores revealed that 252 (16%%) of these oncogenic variants had very high CADD scores, 788 (49%) had high CADD scores, and 506 (32%) had medium CADD scores. The most common types of genetic alterations were missense variants (81%), followed by frameshift (6%), and stop-gained variants (5%). Figure 8B displays box plots that illustrate the deleteriousness scores of the oncogenic variants according to their consequence types. Stop-gained variants exhibited the highest mean CADD score (37.2), followed by splice donor (31.2), splice acceptor (30.9), missense (25.9), frameshift (25.8), start lost (21.2), stop lost (17.7), inframe deletion (17.4), splice region (16.8), and inframe insertion variants (16.7). Lastly, Fig. 8C presents bean plots that rank the key druggable cancer-driving genes based on the highest number of oncogenic variants and their deleteriousness scores (Supplementary Table 17).

Identifying oncogenic variants in cancer-driving genes is crucial for developing targeted therapies^{102,103,104,105}. These therapies are specifically designed to inhibit or modify the function of proteins produced by mutated genes, offering more effective treatment options with potentially fewer side effects compared with traditional chemotherapy¹⁰⁶. Moreover, this approach enables personalized precision medicine. By understanding specific genetic and epigenetic alterations in a patient’s tumor, treatments can be tailored to target these changes^{107,108,109,110}. In this context, the identification of oncogenic variants in druggable cancer-driving genes is a fundamental aspect of modern oncology, influencing everything from individual patient treatment to broader aspects of cancer research, ethnicity, and public health initiatives^106,111,112.

This integrative approach has identified 23 key druggable cancer-driving proteins (CDKN2A, BCL10, ACVR1, CASP8, JAG1, TSC1, NBN, PREX2, PPP2R1A, DNM2, VAV1, ASXL1, TPR, HRAS, BUB1B, ATG7, MARK3, SETD2, CCNE1, MUTYH, CDKN2C, RB1, and SMARCA4), setting the stage for improved therapeutic targets that could significantly boost the efficacy of clinical trials.

Testing the model’s limitations

Like any model, there are limitations when using it for prediction. Due to the limited data on druggable proteins, all 666 druggable proteins were used as class 1 to train the model. This makes it impossible to obtain an external dataset with druggable proteins to confirm the predictive power of the best model. One way to test the model's limitations is to plot the best protein predictions within the space of the selected features, alongside the druggable proteins and hard-to-drug proteins. Since plotting in 200 dimensions (the number of selected features in the best model) is impractical, we approximate by transforming these 200 dimensions into just 2 PCA components for visualization. Class 0 descriptors (hard-to-drug proteins), class 1 descriptors (druggable proteins), and the descriptors corresponding to the 23 key druggable cancer-driving proteins (predicted proteins) have been converted into standard units as in the original dataset for TC descriptors and transformed into 2 PCA components for visualization in Fig. 9. In the figure, druggable proteins are shown in blue, hard-to-drug proteins in red, and the best predicted proteins in green. The plot indicates that even though the negative class (class 0) contains phosphatase proteins, there is no clear separation between the training classes 1 and 0 within the space of the selected TC descriptors, indicating a complex descriptor space.

Prediction points that fall within regions containing mixed points (both class 1 and class 0 points) may be the most trustworthy. In these regions, the model has been exposed to a more diverse dataset, enabling it to learn to better distinguish the patterns and characteristics that differentiate the two classes. Consequently, predictions in these regions are more likely to be accurate and reliable, as the model has learned more robust and generalizable features for data classification. Therefore, predictions made in these mixed regions are likely to be the most robust and trustworthy. The majority of the predicted proteins are located in these mixed regions, suggesting they have a higher potential to be future drug targets. In the supplementary material, a researcher can choose another model with a mean AUROC value greater than a specific cutoff (e.g., 0.9), with fewer features and possibly better PCA representation of the predictions. Future studies should use artificial intelligence and docking tools to predict a list of potential current drugs or new ligands.

Repurposing drugs and metabolites

An additional step to confirm the 23 key druggable cancer-driving proteins involves predicting interactions with ChEMBL-approved drugs (2,466 molecules with masses between 100 and 500) through drug repurposing^113,114. Using pairs of drug SMILES codes and protein sequences as inputs, a deep learning model called PLAPT evaluated the binding affinity (or negative log10 affinity)⁸⁶. The model employs pre-trained transformers like ProtBERT and ChemBERTa to convert the protein sequence and SMILES structure into embeddings for the model. Supplementary Tables 18 and 19, along with the GitHub file titled Supplementary_interactions_gene-drug(byPLAPT), present the affinity values for each drug-protein pair. The mean affinity values (minimum affinities or maximum negative log10 affinities) for all 23 proteins indicate that the top drugs clinically relevant to cancer treatment that can interact with these proteins include: mifepristone (targeting CASP8), pentostatin (BCL10, CASP8, CCNE1, and CDKN2A), afatinib (ACVR1, CDKN2C, and HRAS), alitretinoin (ACVR1, CDKN2C, HRAS, and PREX2), talazoparib (ACVR1, CDKN2C, and HRAS), alpelisib (ACVR1, CDKN2C, HRAS, NBN, PREX2, and SMARCA4), ulipristal acetate (ACVR1, ASXL1, CDKN2C, HRAS, NBN, PREX2, RB1, and SMARCA4), lorlatinib (ACVR1, ASXL1, ATG7, DNM2, HRAS, JAG1, MARK3, NBN, PPP2R1A, PREX2, RB1, SETD2, SMARCA4, TPR, TSC1, and VAV1), piflufolastat (ASXL1, ATG7, BUB1B, DNM2, JAG1, MARK3, MYTYH, NBN, PPP2R1A, PREX2, RB1, SETD2, SMARCA4, TPR, TSC1, and VAV1), pyrvinium pamoate (ASXL1, ATG7, BUB1B, DNM2, HRAS, JAG, MARK3, NBN, PPP2R1A, PREX2, RB1, SETD2, SMARCA4, TPR, TSC, and VAV1), and tepotinib hydrochloride (ASXL1, ATG7, BUB1B, DNM2, JAG1, MARK3, MUTYH, NBN, PPP2R1A, PREX2, RB1, SETD2, SMARCA4, TPR, TSC1, and VAV1 (Fig. 10).

Mifepristone, a progesterone receptor antagonist, has been explored for its potential in treating glioblastoma, breast cancer, and uveal melanoma due to its ability to act on multiple receptor types, including glucocorticoid and androgen receptors^115,116,117. Pentostatin is a chemotherapy drug primarily used for treating hairy cell leukemia and T-cell prolymphocytic leukemia. It is a purine analog that works by inhibiting the enzyme adenosine deaminase, crucial for DNA synthesis and cell replication, leading to the accumulation of deoxyadenosine triphosphate and ultimately causing cell death, particularly in rapidly dividing cancer¹¹⁸. Afatinib is an oral medication primarily used for treating non-small cell lung cancer. It functions as a tyrosine kinase inhibitor, targeting and blocking the EGFR protein as well as other members of the ErbB family, including HER2 and ErbB4¹¹⁹. Alitretinoin, a derivative of vitamin A, is used in cancer treatment primarily for Kaposi sarcoma. It binds to and activates retinoid receptors (RAR and RXR), which regulate gene expression involved in cell differentiation and proliferation, helping to inhibit the growth of Kaposi sacroma cells¹²⁰. Talazoparib works by inhibiting PARP enzymes, which play a crucial role in DNA repair. By blocking these enzymes, talazoparib prevents cancer cells from repairing their DNA, leading to cell death, especially in cells with BRCA1/2 mutations that already have compromised DNA repair mechanisms^43,121. Alpelisib is an oral medication used in combination with fulvestrant to treat hormone receptor-positive, HER2-negative advanced or metastatic breast cancer with PIK3CA mutations. It works as a PI3K inhibitor, specifically targeting the alpha isoform of the enzyme, which is crucial in the PI3K/AKT signaling pathway involved in cancer cell growth and survival¹²². Ulipristal acetate is a progesterone receptor modulator implicated in the proliferation and growth of certain cancer cells. It competes with progesterone, thereby inhibiting the progesterone-induced proliferation of breast cancer cells, making it a candidate for reducing breast cancer risk, especially in individuals with BRCA1/2 mutations¹²³. Lorlatinib inhibits ALK and ROS1 kinases, which are involved in cancer cell growth and survival. It is effective against multiple ALK mutations that confer resistance to first- and second-generation ALK inhibitors¹²⁴. Piflufolastat F-18 binds to the prostate-specific membrane antigen, a protein overexpressed on the surface of most prostate cancer cells. Once bound, the radioactive tracer emits positrons detected by a PET scanner, revealing the location of PSMA-positive lesions in the body¹²⁵. Pyrvinium pamoate is an androgen receptor antagonist that targets multiple cellular pathways. It disrupts mitochondrial function by inhibiting electron transport chain complexes I and II, reducing mitochondrial fitness and increased glycolysis, especially under hypoglycemic conditions often found in tumors. It also reduces WNT and Hedgehog signaling pathways, crucial for cancer cell proliferation and survival^{126,127,128,129}. Lastly, tepotinib hydrochloride is a tyrosine kinase inhibitor targeting the MET receptor. By inhibiting this receptor, it interferes with cancer cell growth and survival pathways, which are crucial for the proliferation and metastasis of MET-altered cancer cells.

The last screening for interactions was conducted for the HRAS protein (P01112) using 217,776 molecules from the HMDB (see all affinities in the Supplementary Table 20 and the GitHub file titled Supplimentary_affinities_hmdb_HRAS-P01112). Among the best potential interactions between HRAS and metabolites, the following were identified: cyanidin 5-O-beta-d-glucoside (HMDB0304305), chlorophyll (HMDB0303604), delphinidin 3-(3″-p-coumaroylglucoside) (HMDB0030099), cis-neoxanthin (HMDB0302969), verteporfin (HMDB0014603), pinotin A (HMDB0029240), benztropine (HMDB0014390), adapalene (HMDB0014355), inulin (HMDB0014776), and ceftriaxone (HMDB0015343). Future studies involving molecular docking, molecular dynamics, or other AI-based interaction prediction models will be needed to further confirm these interactions.

Conclusions

This study presents an innovative machine learning-based method for predicting druggable proteins that drive cancer, utilizing three sets of protein sequence descriptors: amino acid composition, di-amino acid composition, and tri-amino acid composition. These descriptors, chosen for their ability to capture essential information about protein sequences, have demonstrated robust performance across various machine learning classifiers.

Our results emphasize the effectiveness of these descriptors in balancing model performance, computational efficiency, and biological relevance. Specifically, the use of SVM classifiers with 200 TC descriptors selected by LinearSVC achieved high predictive accuracy. The model's robustness was validated by achieving high AUROC values, with the best performance reaching an AUROC of 0.992 using SVM with 8000 pure TC descriptors.

The practical utility of this model was demonstrated by predicting the druggability of 2,339 cancer-driving proteins, with 88.9% predicted to have druggable activity. Validation using ChEMBL evidence scores confirmed the model's accuracy in differentiating druggable from hard-to-drug proteins, highlighting its potential in drug discovery and therapeutic development.

Additionally, integrating multi-omics analyses and chemistry-based scores identified 23 key druggable cancer-driving proteins, prioritized based on their involvement in critical cancer-related pathways and ligandability. Analyzing these proteins and their interaction with clinically relevant drugs provides valuable insights for developing targeted cancer therapies.

While our study demonstrates the model's capabilities, it also acknowledges limitations, such as the challenge of validating predictions with external datasets due to limited data on druggable proteins. Nonetheless, the drug repurposing analysis identified high-affinity interactions between the 23 key druggable cancer-driving proteins and 11 clinically relevant FDA-approved drugs. Future research should aim to enhance model validation using artificial intelligence and docking tools to confirm predicted interactions with current drugs or new ligands, facilitating the translation of repurposed drugs into clinical trials.

In summary, this study provides a comprehensive framework for predicting druggable cancer-driving proteins, combining computational efficiency with biological relevance. The integration of machine learning, multi-omics analyses, and chemistry-based assessments paves the way for identifying and prioritizing new therapeutic targets, advancing precision oncology and personalized medicine.

Data availability

All data generated or analyzed during this study are included in this published article (and its Supplementary Information files), and the scripts are available as a free repository at https://github.com/muntisa/machine-learning-for-druggable-proteins.

References

Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Barbarino, J. M., Whirl-Carrillo, M., Altman, R. B. & Klein, T. E. PharmGKB: A worldwide resource for pharmacogenomic information. Wiley Interdiscip. Rev. Syst. Biol. Med. 10, e1417 (2018).
Article PubMed PubMed Central Google Scholar
Venter, J. C., Smith, H. O. & Adams, M. D. The sequence of the human genome. Clin. Chem. 61, 1207–1208 (2015).
Article CAS PubMed Google Scholar
Uhlén, M. et al. Tissue-based map of the human proteome. Science 347, 1260419 (2015).
Article PubMed Google Scholar
Kandoi, G., Acencio, M. L. & Lemke, N. Prediction of druggable proteins using machine learning and systems biology: A mini-review. Front. Physiol. 6, 366 (2015).
Article PubMed PubMed Central Google Scholar
Brown, D. & Superti-Furga, G. Rediscovering the sweet spot in drug discovery. Drug Discov. Today 8, 1067–1077 (2003).
Article PubMed Google Scholar
Cheng, A. C. et al. Structure-based maximal affinity model predicts small-molecule druggability. Nat. Biotechnol. 25, 71–75 (2007).
Article PubMed Google Scholar
Gashaw, I., Ellinghaus, P., Sommer, A. & Asadullah, K. What makes a good drug target?. Drug Discov. Today 17(Suppl), S24-30 (2012).
Article CAS PubMed Google Scholar
Oprea, T. I. et al. Unexplored therapeutic opportunities in the human genome. Nat. Rev. Drug Discov. 17, 317–332 (2018).
Article CAS PubMed PubMed Central Google Scholar
Wishart, D. S. et al. DrugBank: A comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 34, D668–D672 (2006).
Article CAS PubMed Google Scholar
Wishart, D. S. et al. DrugBank: A knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res. 36, D901–D906 (2008).
Article CAS PubMed Google Scholar
Guerrero, S. et al. Analysis of racial/ethnic representation in select basic and applied cancer research studies. Sci. Rep. 8, 13978 (2018).
Article ADS PubMed PubMed Central Google Scholar
García-Cárdenas, J. M. et al. Toward equitable precision oncology: Monitoring racial and ethnic inclusion in genomics and clinical trials. JCO Precis. Oncol. 8, e2300398 (2024).
Article PubMed Google Scholar
Costa, P. R., Acencio, M. L. & Lemke, N. A machine learning approach for genome-wide prediction of morbid and druggable human genes based on systems-level data. BMC Genomics 11(Suppl 5), S9 (2010).
Article PubMed PubMed Central Google Scholar
Blanco, J. L., Porto-Pazos, A. B., Pazos, A. & Fernandez-Lozano, C. Prediction of high anti-angiogenic activity peptides in silico using a generalized linear model and feature selection. Sci. Rep. 8, 15688 (2018).
Article ADS PubMed PubMed Central Google Scholar
Wei, L., Zhou, C., Chen, H., Song, J. & Su, R. ACPred-FL: A sequence-based predictor using effective feature representation to improve the prediction of anti-cancer peptides. Bioinformatics 34, 4007–4016 (2018).
Article CAS PubMed PubMed Central Google Scholar
Martínez-Arzate, S. G. et al. PTML model for proteome mining of B-cell epitopes and theoretical-experimental study of Bm86 Protein sequences from Colima, Mexico. J. Proteome Res. 16, 4093–4103 (2017).
Article PubMed Google Scholar
Fernandez-Lozano, C. et al. Classification of signaling proteins based on molecular star graph descriptors using Machine Learning models. J. Theor. Biol. 384, 50–58 (2015).
Article ADS CAS PubMed Google Scholar
Munteanu, C. R. et al. LECTINPred: Web server that uses complex networks of protein structure for prediction of lectins with potential use as cancer biomarkers or in parasite vaccine design. Mol. Inform. 33, 276–285 (2014).
Article CAS PubMed Google Scholar
Fernández-Blanco, E., Aguiar-Pulido, V., Munteanu, C. R. & Dorado, J. Random Forest classification based on star graph topological indices for antioxidant proteins. J. Theor. Biol. 317, 331–337 (2013).
Article ADS MathSciNet PubMed Google Scholar
Zhu, M. et al. The analysis of the drug-targets based on the topological properties in the human protein-protein interaction network. J. Drug Target. 17, 524–532 (2009).
Article CAS PubMed Google Scholar
Jeon, J. et al. A systematic approach to identify novel cancer drug targets using machine learning, inhibitor design and high-throughput screening. Genome Med. 6, 57 (2014).
Article PubMed PubMed Central Google Scholar
Li, Z.-C. et al. Large-scale identification of potential drug targets based on the topological features of human protein-protein interaction network. Anal. Chim. Acta 871, 18–27 (2015).
Article ADS CAS PubMed Google Scholar
Laenen, G., Thorrez, L., Börnigen, D. & Moreau, Y. Finding the targets of a drug by integration of gene expression data with a protein interaction network. Mol. Biosyst. 9, 1676–1685 (2013).
Article CAS PubMed Google Scholar
Emig, D. et al. Drug target prediction and repositioning using an integrated network-based approach. PLoS ONE 8, e60618 (2013).
Article ADS CAS PubMed PubMed Central Google Scholar
Yao, L. & Rzhetsky, A. Quantitative systems-level determinants of human genes targeted by successful drugs. Genome Res. 18, 206–213 (2008).
Article CAS PubMed PubMed Central Google Scholar
Yildirim, M. A., Goh, K.-I., Cusick, M. E., Barabási, A.-L. & Vidal, M. Drug-target network. Nat. Biotechnol. 25, 1119–1126 (2007).
Article CAS PubMed Google Scholar
Cao, D.-S., Xiao, N., Xu, Q.-S. & Chen, A. F. Rcpi: R/Bioconductor package to generate various descriptors of proteins, compounds and their interactions. Bioinformatics 31, 279–281 (2015).
Article CAS PubMed Google Scholar
Hao, J. & Ho, T. K. Machine learning made easy: A review of Scikit-learn package in python programming language. J. Educ. Behav. Stat. 44, 348–361 (2019).
Article Google Scholar
López-Cortés, A. et al. Prediction of breast cancer proteins involved in immunotherapy, metastasis, and RNA-binding using molecular descriptors and artificial neural networks. Sci. Rep. 10, 8515 (2020).
Article ADS PubMed PubMed Central Google Scholar
Cover, T. & Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inform. Theory 13, 21–27 (1967).
Article Google Scholar
Mika, S., Ratsch, G., Weston, J., Scholkopf, B. & Mullers, K. R. Fisher discriminant analysis with kernels. In Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468) 41–48 https://doi.org/10.1109/NNSP.1999.788121 (IEEE, 1999).
Patle, A. & Chouhan, D. S. SVM kernel functions for classification. In 2013 International Conference on Advances in Technology and Engineering (ICATE) 1–9 https://doi.org/10.1109/ICAdTE.2013.6524743 (IEEE, 2013).
Peduzzi, P., Concato, J., Kemper, E., Holford, T. R. & Feinstein, A. R. A simulation study of the number of events per variable in logistic regression analysis. J. Clin. Epidemiol. 49, 1373–1379 (1996).
Article CAS PubMed Google Scholar
White, B. W. & Rosenblatt, F. Principles of neurodynamics: Perceptrons and the theory of brain mechanisms. Am. J. Psychol. 76, 705 (1963).
Article Google Scholar
Swain, P. H. & Hauska, H. The decision tree classifier: Design and potential. IEEE Trans. Geosci. Electron. 15, 142–147 (1977).
Article ADS Google Scholar
Breiman, L. Random Forests (Springer Science and Business Media LLC, 2001). https://doi.org/10.1023/a:1010933404324.
Book Google Scholar
Chen, T. & Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’16 785–794 https://doi.org/10.1145/2939672.2939785 (ACM Press, 2016).
Friedman, J. H. Stochastic gradient boosting. Comput. Stat. Data Anal. 38, 367–378 (2002).
Article MathSciNet Google Scholar
Hughes, G. On the mean accuracy of statistical pattern recognizers. IEEE Trans. Inform. Theory 14, 55–63 (1968).
Article Google Scholar
Breiman, L. Bagging predictors. Mach. Learn. 24, 123–140 (1996).
Article Google Scholar
Jolliffe, I. Principal component analysis. In Encyclopedia of Statistics in Behavioral Science (eds Everitt, B. S. & Howell, D. C.) (Wiley, 2005). https://doi.org/10.1002/0470013192.bsa501.
Chapter Google Scholar
Wishart, D. S. et al. DrugBank 5.0: A major update to the DrugBank database for 2018. Nucleic Acids Res. 46, D1074–D1082 (2018).
Article CAS PubMed Google Scholar
Corsello, S. M. et al. The Drug Repurposing Hub: A next-generation drug library and information resource. Nat. Med. 23, 405–408 (2017).
Article CAS PubMed PubMed Central Google Scholar
Tonks, N. K. Protein tyrosine phosphatases: From genes, to function, to disease. Nat. Rev. Mol. Cell Biol. 7, 833–846 (2006).
Article CAS PubMed Google Scholar
Brautigan, D. L. Protein Ser/Thr phosphatases—The ugly ducklings of cell signalling. FEBS J. 280, 324–345 (2013).
Article CAS PubMed Google Scholar
Fahs, S., Lujan, P. & Köhn, M. Approaches to study phosphatases. ACS Chem. Biol. 11, 2944–2961 (2016).
Article CAS PubMed Google Scholar
Xie, X. et al. Recent advances in targeting the “undruggable” proteins: From drug discovery to clinical trials. Signal Transduct. Target. Ther. 8, 335 (2023).
Article PubMed PubMed Central Google Scholar
Repana, D. et al. The Network of Cancer Genes (NCG): A comprehensive catalogue of known and candidate cancer genes from cancer sequencing screens. Genome Biol. 20, 1 (2019).
Article PubMed PubMed Central Google Scholar
Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. SMOTE: Synthetic minority over-sampling technique. JAIR 16, 321–357 (2002).
Article Google Scholar
Bradley, A. P. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn. 30, 1145–1159 (1997).
Article ADS Google Scholar
Ochoa, D. et al. Open Targets Platform: Supporting systematic drug-target identification and prioritisation. Nucleic Acids Res. 49, D1302–D1310 (2021).
Article CAS PubMed Google Scholar
Davies, M. et al. ChEMBL web services: Streamlining access to drug discovery data and utilities. Nucleic Acids Res. 43, W612–W620 (2015).
Article CAS PubMed PubMed Central Google Scholar
Mendez, D. et al. ChEMBL: Towards direct deposition of bioassay data. Nucleic Acids Res. 47, D930–D940 (2019).
Article CAS PubMed Google Scholar
Ghoussaini, M. et al. Open Targets Genetics: Systematic identification of trait-associated genes using large-scale genetics and functional genomics. Nucleic Acids Res. 49, D1311–D1320 (2021).
Article CAS PubMed Google Scholar
Cook, C. E. et al. The European Bioinformatics Institute in 2016: Data growth and integration. Nucleic Acids Res. 44, D20–D26 (2016).
Article CAS PubMed Google Scholar
Landrum, M. J. et al. ClinVar: Improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 46, D1062–D1067 (2018).
Article CAS PubMed Google Scholar
Martin, A. R. et al. PanelApp crowdsources expert knowledge to establish consensus diagnostic gene panels. Nat. Genet. 51, 1560–1565 (2019).
Article CAS PubMed Google Scholar
Sondka, Z. et al. The COSMIC Cancer Gene Census: Describing genetic dysfunction across all human cancers. Nat. Rev. Cancer 18, 696–705 (2018).
Article CAS PubMed PubMed Central Google Scholar
Martínez-Jiménez, F. et al. A compendium of mutational cancer driver genes. Nat. Rev. Cancer 20, 555–572 (2020).
Article PubMed Google Scholar
Tamborero, D. et al. Cancer Genome Interpreter annotates the biological and clinical relevance of tumor alterations. Genome Med. 10, 25 (2018).
Article PubMed PubMed Central Google Scholar
Iorio, F. et al. Pathway-based dissection of the genomic heterogeneity of cancer hallmarks’ acquisition with SLAPenrich. Sci. Rep. 8, 6713 (2018).
Article ADS PubMed PubMed Central Google Scholar
Fabregat, A. et al. The Reactome pathway Knowledgebase. Nucleic Acids Res. 44, D481–D487 (2016).
Article CAS PubMed Google Scholar
Smedley, D. et al. PhenoDigm: Analyzing curated annotations to associate animal models with human diseases. Database 2013, bat025 (2013).
Article PubMed PubMed Central Google Scholar
Carvalho-Silva, D. et al. Open Targets Platform: New developments and updates two years on. Nucleic Acids Res. 47, D1056–D1065 (2019).
Article CAS PubMed Google Scholar
Iannuccelli, M. et al. CancerGeneNet: Linking driver genes to cancer hallmarks. Nucleic Acids Res. 48, D416–D421 (2020).
Article CAS PubMed Google Scholar
Lo Surdo, P. et al. SIGNOR 3.0, the SIGnaling network open resource 3.0: 2022 update. Nucleic Acids Res. 51, D631–D637 (2023).
Article PubMed Google Scholar
Ryan, D. P. & Matthews, J. M. Protein-protein interactions in human disease. Curr. Opin. Struct. Biol. 15, 441–446 (2005).
Article CAS PubMed Google Scholar
Szklarczyk, D. et al. The STRING database in 2021: Customizable protein-protein networks, and functional characterization of user-uploaded gene/measurement sets. Nucleic Acids Res. 49, D605–D612 (2021).
Article CAS PubMed Google Scholar
Ramos-Medina, M. J. et al. CardiOmics signatures reveal therapeutically actionable targets and drugs for cardiovascular diseases. Heliyon 10, e23682 (2024).
Article CAS PubMed Google Scholar
Hanahan, D. Hallmarks of cancer: New dimensions. Cancer Discov. 12, 31–46 (2022).
Article CAS PubMed Google Scholar
Mitsopoulos, C. et al. canSAR: Update to the cancer translational research and drug discovery knowledgebase. Nucleic Acids Res. 49, D1074–D1082 (2021).
Article CAS PubMed Google Scholar
UniProt Consortium. UniProt: A worldwide hub of protein knowledge. Nucleic Acids Res. 47, D506–D515 (2019).
Article Google Scholar
Uhlen, M. et al. A pathology atlas of the human cancer transcriptome. Science 357, eaan2507 (2017).
Article PubMed Google Scholar
Thul, P. J. & Lindskog, C. The human protein atlas: A spatial map of the human proteome. Protein Sci. 27, 233–244 (2018).
Article CAS PubMed Google Scholar
Simon, R., Mirlacher, M. & Sauter, G. Immunohistochemical analysis of tissue microarrays. Methods Mol. Biol. 664, 113–126 (2010).
Article CAS PubMed Google Scholar
Zhang, Q. et al. Identification of potential diagnostic and prognostic biomarkers for prostate cancer. Oncol. Lett. 18, 4237–4245 (2019).
ADS CAS PubMed PubMed Central Google Scholar
Raudvere, U. et al. g:Profiler: A web server for functional enrichment analysis and conversions of gene lists (2019 update). Nucleic Acids Res. 47, W191–W198 (2019).
Article CAS PubMed PubMed Central Google Scholar
The Gene Ontology Consortium. The Gene Ontology resource: Enriching a GOld mine. Nucleic Acids Res. 49, D325–D334 (2021).
Article Google Scholar
López-Cortés, A. et al. Identification of key proteins in the signaling crossroads between wound healing and cancer hallmark phenotypes. Sci. Rep. 11, 17245 (2021).
Article ADS PubMed PubMed Central Google Scholar
Collins, R. L. et al. A structural variation reference for medical and population genetics. Nature 581, 444–451 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Muiños, F., Martínez-Jiménez, F., Pich, O., Gonzalez-Perez, A. & Lopez-Bigas, N. In silico saturation mutagenesis of cancer genes. Nature 596, 428–432 (2021).
Article ADS PubMed Google Scholar
Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).
Article CAS PubMed PubMed Central Google Scholar
Rentzsch, P., Witten, D., Cooper, G. M., Shendure, J. & Kircher, M. CADD: Predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res. 47, D886–D894 (2019).
Article CAS PubMed Google Scholar
Rose, T., Monti, N., Anand, N. & Shen, T. PLAPT: Protein-ligand binding affinity prediction using pretrained transformers. BioRxiv https://doi.org/10.1101/2024.02.08.575577 (2024).
Article PubMed PubMed Central Google Scholar
Wishart, D. S. et al. HMDB 5.0: The human metabolome database for 2022. Nucleic Acids Res. 50, D622–D631 (2022).
Article CAS PubMed Google Scholar
Cunningham, M. et al. PINNED: Identifying characteristics of druggable human proteins using an interpretable neural network. J. Cheminform. 15, 64 (2023).
Article PubMed PubMed Central Google Scholar
Wang, C. et al. Predicting drug-target interactions with electrotopological state fingerprints and amphiphilic pseudo amino acid composition. Int. J. Mol. Sci. 21, 5694 (2020).
Article CAS PubMed PubMed Central Google Scholar
Chu, H. & Liu, T. Comprehensive research on druggable proteins: From PSSM to pre-trained language models. Int. J. Mol. Sci. 25, 4507 (2024).
Article CAS PubMed PubMed Central Google Scholar
Vernone, A., Berchialla, P. & Pescarmona, G. Human protein cluster analysis using amino acid frequencies. PLoS ONE 8, e60220 (2013).
Article ADS CAS PubMed PubMed Central Google Scholar
Pérez-Villa, A. et al. Integrated multi-omics analysis reveals the molecular interplay between circadian clocks and cancer pathogenesis. Sci. Rep. 13, 14198 (2023).
Article ADS PubMed PubMed Central Google Scholar
López-Cortés, A. et al. The close interaction between hypoxia-related proteins and metastasis in pancarcinomas. Sci. Rep. 12, 11100 (2022).
Article ADS PubMed PubMed Central Google Scholar
López-Cortés, A. et al. Gene prioritization, communality analysis, networking and metabolic integrated pathway to better understand breast cancer pathogenesis. Sci. Rep. 8, 16679 (2018).
Article ADS PubMed PubMed Central Google Scholar
Wang, Y. et al. Expedited mapping of the ligandable proteome using fully functionalized enantiomeric probe pairs. Nat. Chem. 11, 1113–1123 (2019).
Article CAS PubMed PubMed Central Google Scholar
Gross, S. M. et al. Analysis and modeling of cancer drug responses using cell cycle phase-specific rate effects. Nat. Commun. 14, 3450 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Cohen, P., Cross, D. & Jänne, P. A. Kinase drug discovery 20 years after imatinib: Progress and future directions. Nat. Rev. Drug Discov. 20, 551–569 (2021).
Article CAS PubMed PubMed Central Google Scholar
Waldman, A. D., Fritz, J. M. & Lenardo, M. J. A guide to cancer immunotherapy: From T cell basic science to clinical practice. Nat. Rev. Immunol. 20, 651–668 (2020).
Article CAS PubMed PubMed Central Google Scholar
Eastman, A. Activation of programmed cell death by anticancer agents: Cisplatin as a model system. Cancer Cells 2, 275–280 (1990).
CAS PubMed Google Scholar
Wang, L., Lankhorst, L. & Bernards, R. Exploiting senescence for the treatment of cancer. Nat. Rev. Cancer 22, 340–355 (2022).
Article CAS PubMed Google Scholar
Hanker, A. B., Sudhan, D. R. & Arteaga, C. L. Overcoming endocrine resistance in breast cancer. Cancer Cell 37, 496–513 (2020).
Article CAS PubMed PubMed Central Google Scholar
Varela, N. M. et al. A new insight for the identification of oncogenic variants in breast and prostate cancers in diverse human populations, with a focus on latinos. Front. Pharmacol. 12, 630658 (2021).
Article CAS PubMed PubMed Central Google Scholar
Yumiceba, V. et al. Oncology and pharmacogenomics insights in polycystic ovary syndrome: An integrative analysis. Front. Endocrinol. 11, 585130 (2020).
Article Google Scholar
Paz-Y-Miño, C. et al. Positive association of the androgen receptor CAG repeat length polymorphism with the risk of prostate cancer. Mol. Med. Report. 14, 1791–1798 (2016).
Article Google Scholar
Echeverría-Garcés, G. et al. Gastric cancer actionable genomic alterations across diverse populations worldwide and pharmacogenomics strategies based on precision oncology. Front. Pharmacol. 15, 1373007 (2024).
Article PubMed PubMed Central Google Scholar
López-Cortés, A. et al. Pharmacogenomics, biomarker network, and allele frequencies in colorectal cancer. Pharmacogenomics J. 20, 136–158 (2020).
Article PubMed Google Scholar
Salas-Hernández, A. et al. An updated examination of the perception of barriers for pharmacogenomics implementation and the usefulness of drug/gene pairs in Latin America and the Caribbean. Front. Pharmacol. 14, 1175737 (2023).
Article PubMed PubMed Central Google Scholar
Quinones, L. A. et al. Perception of the usefulness of drug/gene pairs and barriers for pharmacogenomics in Latin America. Curr. Drug Metab. 15, 202–208 (2014).
Article PubMed Google Scholar
López-Cortés, A. et al. OncoOmics approaches to reveal essential genes in breast cancer: A panoramic view from pathogenesis to precision medicine. Sci. Rep. 10, 5285 (2020).
Article ADS PubMed PubMed Central Google Scholar
Ocaña-Paredes, B. et al. The pharmacoepigenetic paradigm in cancer treatment. Front. Pharmacol. 15, 1381168 (2024).
Article PubMed PubMed Central Google Scholar
Pirmohamed, M. Pharmacogenomics: Current status and future perspectives. Nat. Rev. Genet. 24, 350–362 (2023).
Article CAS PubMed Google Scholar
López-Cortés, A., Guerrero, S., Redal, M. A., Alvarado, A. T. & Quiñones, L. A. State of art of cancer pharmacogenomics in Latin American populations. Int. J. Mol. Sci. 18, 639 (2017).
Article PubMed PubMed Central Google Scholar
Zdrazil, B. et al. The ChEMBL Database in 2023: A drug discovery platform spanning multiple bioactivity data types and time periods. Nucleic Acids Res. 52, D1180–D1192 (2024).
Article PubMed Google Scholar
López-Cortés, A. et al. In silico analyses of immune system protein interactome network, single-cell RNA sequencing of human tissues, and artificial neural networks reveal potential therapeutic targets for drug repurposing against COVID-19. Front. Pharmacol. 12, 598925 (2021).
Article PubMed PubMed Central Google Scholar
Llaguno-Munive, M., Vazquez-Lopez, M. I., Jurado, R. & Garcia-Lopez, P. Mifepristone repurposing in treatment of high-grade gliomas. Front. Oncol. 11, 606907 (2021).
Article CAS PubMed PubMed Central Google Scholar
Alvarez, P. B. et al. Anticancer effects of mifepristone on human uveal melanoma cells. Cancer Cell Int. 21, 607 (2021).
Article CAS PubMed PubMed Central Google Scholar
Elía, A. et al. Beneficial effects of mifepristone treatment in patients with breast cancer selected by the progesterone receptor isoform ratio: Results from the MIPRA trial. Clin. Cancer Res. 29, 866–877 (2023).
Article PubMed Google Scholar
Cassileth, P. A. et al. Pentostatin induces durable remissions in hairy cell leukemia. J. Clin. Oncol. 9, 243–246 (1991).
Article CAS PubMed Google Scholar
Harada, Y. et al. Anti-cancer effect of afatinib, dual inhibitor of HER2 and EGFR, on novel mutation HER2 E401G in models of patient-derived cancer. BMC Cancer 23, 77 (2023).
Article CAS PubMed PubMed Central Google Scholar
Htet, K. Z., Waul, M. A. & Leslie, K. S. Topical treatments for Kaposi sarcoma: A systematic review. Skin Health Dis. 2, e107 (2022).
Article PubMed PubMed Central Google Scholar
Litton, J. K. et al. Talazoparib in patients with advanced breast cancer and a germline BRCA mutation. N. Engl. J. Med. 379, 753–763 (2018).
Article CAS PubMed PubMed Central Google Scholar
André, F. et al. Alpelisib for PIK3CA-mutated, hormone receptor-positive advanced breast cancer. N. Engl. J. Med. 380, 1929–1940 (2019).
Article PubMed Google Scholar
Bartlett, T. E. et al. Antiprogestins reduce epigenetic field cancerization in breast tissue of young healthy women. Genome Med. 14, 64 (2022).
Article CAS PubMed PubMed Central Google Scholar
Kumar, A. et al. Lorlatinib in the second line and beyond for ALK positive lung cancer: Real-world data from resource-constrained settings. BJC Rep. 2, 35 (2024).
Article Google Scholar
Arafa, A. T. et al. Impact of piflufolastat F-18 PSMA PET imaging on clinical decision-making in prostate cancer across disease states: A retrospective review. Prostate 83, 863–870 (2023).
Article CAS PubMed Google Scholar
Ponzini, F. M. et al. Repurposing the FDA-approved anthelmintic pyrvinium pamoate for pancreatic cancer treatment: Study protocol for a phase I clinical trial in early-stage pancreatic ductal adenocarcinoma. BMJ Open 13, e073839 (2023).
Article PubMed PubMed Central Google Scholar
Tomitsuka, E., Kita, K. & Esumi, H. An anticancer agent, pyrvinium pamoate inhibits the NADH-fumarate reductase system—a unique mitochondrial energy metabolism in tumour microenvironments. J. Biochem. 152, 171–183 (2012).
Article CAS PubMed Google Scholar
Ishii, I., Harada, Y. & Kasahara, T. Reprofiling a classical anthelmintic, pyrvinium pamoate, as an anti-cancer drug targeting mitochondrial respiration. Front. Oncol. 2, 137 (2012).
Article PubMed PubMed Central Google Scholar
Schultz, C. W. & Nevler, A. Pyrvinium pamoate: Past, present, and future as an anti-cancer drug. Biomedicines 10, 3249 (2022).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This work was supported by Universidad de Las Américas, Ecuador; the grant ED431C 2022/46—Competitive Reference Groups. GRC—funded by the EU and Xunta de Galicia, Spain; and the Latin American Society of Pharmacogenomics and Personalized Medicine (SOLFAGEM).

Author information

Authors and Affiliations

Cancer Research Group (CRG), Faculty of Medicine, Universidad de Las Américas, Quito, Ecuador
Andrés López-Cortés, Paulina Echeverría-Espinoza, Micaela Pineda-Albán, Nicole Elsitdie & José Bueno-Miño
Grupo de Bio-Quimioinformática, Universidad de Las Américas, Quito, Ecuador
Alejandro Cabrera-Andrade, Yunierkis Pérez-Castillo & Eduardo Tejera
Escuela de Enfermería, Facultad de Ciencias de la Salud, Universidad de Las Américas, Quito, Ecuador
Alejandro Cabrera-Andrade
Centro de Referencia Nacional de Genómica, Secuenciación y Bioinformática, Instituto Nacional de Investigación en Salud Pública “Leopoldo Izquieta Pérez”, Quito, Ecuador
Gabriela Echeverría-Garcés
Latin American Network for the Implementation and Validation of Clinical Pharmacogenomics Guidelines (RELIVAF-CYTED), Santiago, Chile
Gabriela Echeverría-Garcés
RNASA-IMEDIR, Computer Science Faculty, University of A Coruna, A Coruña, Spain
Carlos M. Cruz-Segundo, Julian Dorado, Alejandro Pazos & Cristian R. Munteanu
Tecnológico de Estudios Superiores de Jocotitlán, Jocotitlán, Mexico
Carlos M. Cruz-Segundo
Centro de Investigación en Tecnologías de la Información y las Comunicaciones (CITIC), University of A Coruna, A Coruña, Spain
Julian Dorado, Alejandro Pazos & Cristian R. Munteanu
Biomedical Research Institute of A Coruna (INIBIC), University Hospital Complex of A Coruna (CHUAC), A Coruña, Spain
Alejandro Pazos & Cristian R. Munteanu
Department of Organic Chemistry II, University of the Basque Country UPV/EHU, Biscay, Spain
Humberto Gonzáles-Díaz
IKERBASQUE, Basque Foundation for Science, Biscay, Spain
Humberto Gonzáles-Díaz

Authors

Andrés López-Cortés
View author publications
You can also search for this author in PubMed Google Scholar
Alejandro Cabrera-Andrade
View author publications
You can also search for this author in PubMed Google Scholar
Gabriela Echeverría-Garcés
View author publications
You can also search for this author in PubMed Google Scholar
Paulina Echeverría-Espinoza
View author publications
You can also search for this author in PubMed Google Scholar
Micaela Pineda-Albán
View author publications
You can also search for this author in PubMed Google Scholar
Nicole Elsitdie
View author publications
You can also search for this author in PubMed Google Scholar
José Bueno-Miño
View author publications
You can also search for this author in PubMed Google Scholar
Carlos M. Cruz-Segundo
View author publications
You can also search for this author in PubMed Google Scholar
Julian Dorado
View author publications
You can also search for this author in PubMed Google Scholar
Alejandro Pazos
View author publications
You can also search for this author in PubMed Google Scholar
Humberto Gonzáles-Díaz
View author publications
You can also search for this author in PubMed Google Scholar
Yunierkis Pérez-Castillo
View author publications
You can also search for this author in PubMed Google Scholar
Eduardo Tejera
View author publications
You can also search for this author in PubMed Google Scholar
Cristian R. Munteanu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.L.C. and C.R.M. conceived the subject, the conceptualization of the study, and wrote the manuscript. A.L.C., A.C.A., G.E.G., P.E.E., M.P.A., N.E., J.B.M., C.M.C.S., and C.R.M. did data curation and supplementary data. C.R.M. and J.D. built the models using machine learning. G.E.G., J.D., A.P., H.G.D., Y.P.C., and E.T. gave conceptual advice, valuable scientific input, and edited the final version of the manuscript. A.L.C. and C.R.M. supervised the project. A.L.C. did funding acquisition. Finally, all authors reviewed and approved the manuscript.

Corresponding author

Correspondence to Andrés López-Cortés.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

López-Cortés, A., Cabrera-Andrade, A., Echeverría-Garcés, G. et al. Unraveling druggable cancer-driving proteins and targeted drugs using artificial intelligence and multi-omics analyses. Sci Rep 14, 19359 (2024). https://doi.org/10.1038/s41598-024-68565-7

Download citation

Received: 04 March 2024
Accepted: 25 July 2024
Published: 21 August 2024
DOI: https://doi.org/10.1038/s41598-024-68565-7
Springer Nature Limited

Unraveling druggable cancer-driving proteins and targeted drugs using artificial intelligence and multi-omics analyses

Abstract

Similar content being viewed by others

Machine learning prediction of oncology drug targets based on protein and network properties

In silico re-identification of properties of drug target proteins

Testing the predictive power of reverse screening to infer drug targets, with the help of machine learning

Introduction

Methods

Machine learning prediction model

Target-disease evidence score

Drugs involved in late-phase clinical trials

Distance score of shortest pathways to cancer hallmark phenotypes

Chemistry-based score

A pathology atlas for human cancer

Functional enrichment analysis

The oncogenic variome of key druggable cancer-driving proteins

Deleteriousness of the oncogenic variome

Artificial intelligence prediction of drugs and metabolites

Results and discussion

Machine learning prediction model

Selected features analysis

Cancer-driving proteins

Drugs involved in late-phase clinical trials

Shortest pathways to cancer hallmark phenotypes

Chemistry-based score

A pathology atlas for human cancer

Functional enrichment analysis

The oncogenic variome of key druggable cancer-driving proteins and their deleterious effects

Testing the model’s limitations

Repurposing drugs and metabolites

Conclusions

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Supplementary Information.

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation