Extracting prime protein targets as possible drug candidates: machine learning evaluation

Chattopadhyay, Subhagata; Do, Nhat Phuong; Flower, Darren R.; Chattopadhyay, Amit K.

doi:10.1007/s11517-023-02893-0

Extracting prime protein targets as possible drug candidates: machine learning evaluation

Original Article
Open access
Published: 23 August 2023

Volume 61, pages 3035–3048, (2023)
Cite this article

Download PDF

You have full access to this open access article

Medical & Biological Engineering & Computing Aims and scope Submit manuscript

Extracting prime protein targets as possible drug candidates: machine learning evaluation

Download PDF

Subhagata Chattopadhyay¹,
Nhat Phuong Do²,
Darren R. Flower³ &
…
Amit K. Chattopadhyay ORCID: orcid.org/0000-0001-5499-6008²

991 Accesses
Explore all metrics

Abstract

Extracting “high ranking” or “prime protein targets” (PPTs) as potent MRSA drug candidates from a given set of ligands is a key challenge in efficient molecular docking. This study combines protein-versus-ligand matching molecular docking (MD) data extracted from 10 independent molecular docking (MD) evaluations — ADFR, DOCK, Gemdock, Ledock, Plants, Psovina, Quickvina2, smina, vina, and vinaxb to identify top MRSA drug candidates. Twenty-nine active protein targets (APT) from the enhanced DUD-E repository (http://DUD-E.decoys.org) are matched against 1040 ligands using “forward modeling” machine learning for initial “data mining and modeling” (DDM) to extract PPTs and the corresponding high affinity ligands (HALs). K-means clustering (KMC) is then performed on 400 ligands matched against 29 PTs, with each cluster accommodating HALs, and the corresponding PPTs. Performance of KMC is then validated against randomly chosen head, tail, and middle active ligands (ALs). KMC outcomes have been validated against two other clustering methods, namely, Gaussian mixture model (GMM) and density based spatial clustering of applications with noise (DBSCAN). While GMM shows similar results as with KMC, DBSCAN has failed to yield more than one cluster and handle the noise (outliers), thus affirming the choice of KMC or GMM. Databases obtained from ADFR to mine PPTs are then ranked according to the number of the corresponding HAL-PPT combinations (HPC) inside the derived clusters, an approach called “reverse modeling” (RM). From the set of 29 PTs studied, RM predicts high fidelity of 5 PPTs (17%) that bind with 76 out of 400, i.e., 19% ligands leading to a prediction of next-generation MRSA drug candidates: PPT2 (average HPC is 41.1%) is the top choice, followed by PPT14 (average HPC 25.46%), and then PPT15 (average HPC 23.12%). This algorithm can be generically implemented irrespective of pathogenic forms and is particularly effective for sparse data.

Graphical Abstract

Towards Effective Consensus Scoring in Structure-Based Virtual Screening

Article Open access 23 December 2022

Consensus holistic virtual screening for drug discovery: a novel machine learning model approach

Article Open access 28 May 2024

Predicting binding poses and affinity ranking in D3R Grand Challenge using PL-PatchSurfer2.0

Article 10 September 2019

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Drug design is a key aspect of healthcare that relies on accurate identification of biologically active substances from protein targets (PT) [1]. Ligands (Ls) comprise such biologically active substances that control PTs, which are the functional biomolecules used in the processes of cellular transduction, transformation, and conjugation [2], and hence pharmacokinetic response of the active ligands [3]. PTs can be composed of ion channels, receptors, enzymes, or porter molecules with which drugs-like-ligands bind [2]. Detecting successful L-PT combinations, or more specifically high affinity ligands (HALs) with prime protein targets (PPTs), is still a challenge as new diseases are continuously emerging that require fast responding, high efficacious new drugs with lower adverse effects that are budget conducive as well. The present extensively interdisciplinary study combines tools drawn from molecular biology, probabilistic mathematics, and computer science to automate the detection of HAL and PPTs from the best ligand–protein combinations to identify next-generation MRSA drug candidates.

MRSA is a bacterial infection that is resistant to several antibiotics, making it difficult to treat. The development of AI-powered drugs has offered new hope in the fight against MRSA. Current state of MRSA drugs using artificial intelligence (AI): AI-powered drugs have shown great promise in the fight against MRSA. AI have been used to identify new compounds that can attack MRSA bacteria, and these compounds have been tested in clinical trials. One such compound is called LFF571, which has shown promising results in treating MRSA infections. AI-powered drugs have the potential to revolutionize the way we treat MRSA and other antibiotic-resistant infections. By using AI to identify new compounds, scientists can develop drugs that are more effective and have fewer side effects. AI can also help to identify new drug targets, which can lead to the development of more targeted therapies.

The present study targets three key areas of MRSA drug designing: (i) computational extraction or detection of HAL for PTs, (ii) computational extraction of PPT for HALs, and (iii) probabilistic prediction of interactions of new PTs and Ls [4]. This work primarily focuses on identifying the top PPTs for the corresponding HALs. The novelty lies in stockpiling molecular docking data from 10 different architecture (ADFR; DOCK; Gemdock; Ledock; Plants; Psovina; Quickvina2; smina; vina; and vinaxb) that independently analyze different biochemical pathways, and then combining them using machine learning, first to dimensionally reduce the key elements and then to regress towards probabilistic predictive models.

The study combines information from several machine learning (ML) algorithms to identify correct L, PT candidates, and combinations of two (popularly called as structure–activity-relationship or SAR or quantitative structure–activity-relationship or QSAR) at the outset of a drug design [5]. In SAR, from the structural features of the compound, its biological activities are predicted. SAR is also able to predict the combinatorial strength of the new composite compound benchmarked on a set of pre-trained compounds whose activities are already tested. However, its limitation is noted in L-PT interactions. SAR is unable to predict PT if the Ls are unknown [4]. Therefore, efforts have been made to solve this issue with L-PT 3-D modeling [6]. This approach is not free of its own limitation either. Firstly, L-PT-3D requires knowledge of the full 3-D protein structure, which is not always feasible. Secondly, it relies on an extensive chemical library, and relatively heavy computation [4]. To address these issues, researchers used a sequence of supervised learning algorithms, known as “proteochemometrics,” that outline classifiers that can predict Ls and PTs individually and jointly in a combined formation [7]. These classifiers are support vector machines (SVMs), regressions, artificial neural networks (ANN), fuzzy classifications, and so forth as promising predictors for successful identification of drug targets [8, 9]. K-means clustering (KMC) has also been tried in several studies to discover candidate proteins and its corresponding high affinity agents, particularly in functionality mapping of candidate proteins [10]. Given that we have a phenomenological idea as to the number of clusters and the cluster centers, K-means is an ideal choice for us initially, and then, we validate the performance of KMC with two more clustering techniques, Gaussian mixture model (GMM) and density-based spatial clustering of applications with noise (DBSCAN).

This study automates the extraction of PPTs for a given sample with HALs, initially using data mining and data modeling (DDM), called “forward modeling” (Approach I), and then using a KMC-based “reverse modeling” approach (Approach II) to automate and validate the observations from forward modeling. We later validate the performance of KMC with GMM and DBSCAN, as mentioned. This allows for a statistical estimation within the constraints of sparse data, an approach that can substantially reduce the time needed to find PPTs, thus substituting rigorous laboratory experiments, and hence in optimizing the resources involved with wet-lab experiments.

The next sections illustrate the methodology adopted, demonstration and explanation of the results, and generic implementations of the methodology in drug development studies.

2 Methodology

In this section, we first explain the composition of the DUD-E data (http://dude.docking.org/), and how approaches I and II detailed below can be used to analyze these data.

Approach I: DMM called “forward modeling.” The aim is to mine HALs and its corresponding PTs.
Approach II: K-means clustering as machine learning (ML) technique to automate the prediction of PPTs and validate the HAL-PT combinatorial models thus obtained from the experiments of Approach I (called “reverse modeling”). The performance of KMC is further validated by GMM and DBSCAN clustering methods.

2.1 DUD-E data

Tier 1 involves docking data from the enhanced DUD-E repository (http://dude.docking.org/) using 10 popular and easily accessible (open access) docking programs — ADFR, DOCK6, Gemdock, Ledock, PLANTS, PSOV-ina, QuickVina2, Smina, Autodock Vina, and VinaXB. The choice is governed by reported individual success rates, e.g., DOCK6 at 73.3% [11], Autodock Vina at 80% [12], Gemdock at 79% [13], ADFR at 74% [14], Ledock at 75% [15], PLANTS 72% [16], PSOVina 63% [17], QuickVina2 63% [18], Smina more than 90% [19], and VinaXB 46% [20]. Tier 2 combines data from all 10 scores using statistical (linear and nonlinear) models belonging to four universality classes (detailed later). Tier 3 is about normalizing VS enhancement data from Tier 2 through a novel calibration of the individual best score (Smina in our case) against the respective probability density functions (PDF); existence of Tier 2 PDF points beyond the best individual score defining the improved docking performance from the algorithm in Tier 2. PDF data being non-dimensional, normalization is guaranteed and that too without any information loss. A recent statistical study from our group [21], structured on the ubiquitous consensus scoring (CS) approach, has analyzed the same docking data [11,12,13,14,15,16,17,18,19,20] to outline a substantially less computationally demanding structure to identify top PPT candidates, starting from a statistical mechanics-based universality class approach. Apart from establishing improved ligand–protein docking fidelity through this approach, the study will also serve as a validity benchmark of the ML-based present approach. As shown later, the ML approach compares favorably with its CS counterpart.

Each DUD-E database (DB) consists of 1040 ligands (L) × 29 protein target (PT). Out of 1040, 1000 are decoy ligands (DL), i.e., inactive, and 40 are active ligands (AL). “Decoys” are therefore discarded, and “actives” are considered for the study. Each L has its “affinity” towards a corresponding PT. Ligand–protein binding (LPB) or docking occurs only when the change in the Gibbs free energy of the system is “negative” when the system reaches its thermodynamic equilibrium at a constant pressure and temperature. Therefore, “negative” affinities denote successful LPB/docking. As the extent of LPB/docking is determined by the magnitude of the said negative energy, it can be safely suggested that the magnitude of the negative affinity determines the stability of any ligand protein complex (LPC).

Each ligand in a DB is considered “unique,” that is, the same ligand (similar affinities to corresponding PTs) never recurs in any other DB under consideration.

A representative data matrix is shown in Table 1 below. It shows the affinity strengths (cell values) of the first 4 Ls corresponding to the 29 PTs in ADFR. Note that affinities are “negative” in numbers, indicating attractive potential. Similar AL-PT combinations for the remaining 9 DBs are extracted.

Table 1 Sample of a DB

1 320
2 40
0 31
3 8
4 1

1 320
2 40
0 39
3 1

1 321 (80.25%)
2 40 (10.00%)
0 39 (09.75%)

Extracting prime protein targets as possible drug candidates: machine learning evaluation

Abstract

Graphical Abstract

Similar content being viewed by others

Towards Effective Consensus Scoring in Structure-Based Virtual Screening

Consensus holistic virtual screening for drug discovery: a novel machine learning model approach

Predicting binding poses and affinity ranking in D3R Grand Challenge using PL-PatchSurfer2.0

1 Introduction

2 Methodology

2.1 DUD-E data

2.2 Approach I: data mining and data modeling (DDM) — ‘forward Modeling’

2.2.1 Data mining steps (carried out for each DB)

2.2.2 Summary steps of DDM (DB-wise)

2.2.3 Summary of DDM

2.2.4 Dependency

Correlation heatmap

2.3 Approach II: machine learning (ML) — for ‘automation’ and ‘reverse modeling’

2.3.1 KMC: the steps are given below.

2.3.2 The GMM clustering method

2.3.3 The DBSCAN clustering method

3 Discussions

3.1 Advantages of the method:

3.2 Limitations of the work and hence the targeted future research are as follows:

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval

Consent for publication

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation