Keywords

1 Introduction

G-Protein Coupled Receptors (GPCR) are the largest family of cell surface receptors [1]. These plasma membrane bound receptors have evolved to recognize a variety of extracellular physical and chemical signals and, upon recognition, act as the proximal stimulus in cell signaling pathways. With over ~ 800 members [2], GPCRs are involved in almost every physiological function, from sensation to growth to hormone responses. Due to their widespread physiological relevance and presence of druggable sites, GPCRs are one of the major targets of therapeutic drugs. A 2017 study notes that 475 drugs act at 108 unique GPCRs. Approximately 321 agents are currently in clinical trials, of which ~20% target 66 potentially novel GPCR targets. GPCRs also account for ~27% of the global market share of therapeutic drugs, with aggregated sales for 2011–2015 of ~US$890 billion.

As promising drug targets, assays involving a member of the GPCR family are commonly employed in high throughput screening (HTS) campaigns. There are a plethora of different techniques and a wide range of commercial kits available, many of which are suitable for High Throughput Screening (HTS) [3]. In such HTS, identifying false positives is a challenge. False positives may be compounds that interfere with the assay detection technology in some way, such as inhibiting luciferase in luciferase-based system [4], or quenching fluorescence where it is the final readout [5]. There may also be compounds that are not specific to the target protein, but are promiscuous, either to a narrow or broad class of proteins [6].

In the previous study we developed a machine learning method to flag potential frequent hitters for luciferase assays [4]. In this study we investigated whether the developed methodology can be extended to identify false positives for GPCR assays.

2 Data

2.1 Data Description

Our initial goal was to explore the available data and find suitable assays that we can then use for further analysis. On PUBCHEM, we identified 92 assays with more than 500 compounds for GPCR agonists and antagonists. We separated the two and decided to focus on the agonists. This was just to narrow down the scope of the study. From the list of available agonist screenings we selected the 20 assays with the highest number of active compounds. This is because as we are looking for false positives. Assays that have little to no positives are less relevant for us. For further selection particular assays, we focused on the GPCR subtypes as described below.

2.2 Data Collection

The GPCR family is commonly classified into five different families based on their structural and sequence similarity. The families are then further classified into a family tree [7, 8]. Of these five major families, the Rhodopsin class in the largest. For selecting assays for our analysis, we mapped the target proteins onto this family tree (Fig. 1), and selected assays with set of representative proteins distant from each other in the family tree. This ensures that compounds that are frequently active, are not preferential agonists of a subtype of GPCR, but are more likely a result of assay artifact.

Fig. 1.
figure 1

GPCR family tree represented as a tree and dots mapping the protein targets in identified assays. The colored part of the tree represents the Rhodopsin class of the GPCR family and various subfamilies of the Rhodopsin class are marked with different colors. (Color figure online)

Using these criteria we chose a set of 12 assays and looked for compounds that are frequently active in these assays (see Methods section), i.e. actives across all of the various different subtypes and assay technologies and thus frequent hitters of the Rhodopsin class of GPCR. However, only 59 out of 373,131 compounds matched our definition of being frequently active. Upon closer examination we found that these compounds were tested only thrice, and therefore are more likely to be an artifact of selection criteria rather than a GPCR frequent hitter or assay artifact.

To further refine our search, we next focused on different detection technologies that were used in the assays. We found that half of assays (six) used fluorescence while other six assays used bioluminescence. Only 71 compounds were frequently active in the bioluminescence group. In the fluorescence group, although the number of datapoints and active compounds was very similar, 502 compounds were frequently active (Table 1). This indicates that fluorescence technology contributes many more artifacts and these 502 compounds were used for further analysis.

Table 1. Statistics of compounds for the datasets used in the study.

All data were harvested from PUBCHEM [9], manually or by using the PUBCHEM REST API with Python. All data were obtained and stored locally in the CSV format to be analyzed later with various python scripts.

2.3 Frequent Hitter Flagging

We defined frequent hitters as compounds that were active according to our criteria in more than half of the assays they were tested in. Additionally each compound had to be tested at least in three different assays. Compounds satisfying both criteria were identified using a Python script and flagged as frequent hitters.

3 Methods

Using the freely accessible platform On-line Chemical and Modeling Environment (OCHEM) [10], was used to build models for our data. Different descriptors available in OCHEM include CDK, Dragon 6 and 7, ISHIDA fragmentor, among others. Their detailed description can be found elsewhere [4]. Associative Neural Networks (ASNN) [11], Deep Neural Network (DNN) [12], Extreme Gradient Boost (XGBOOST) [13], and Least Squares Support Vector Machine (LSSVM) [14] algorithms were analyzed for training the models. The methods were used with default parameters as specified on the OCHEM web site.

4 Results and Discussion

4.1 Machine Learning

The analyzed methods were used in combination with different descriptors sets. LSSVM provided on average the highest accuracy amid the chosen algorithms (Table 2). We selected LSSVM models with the highest accuracy based on their ROC-AUC score for building a consensus model. The consensus model had ROC-AUC score of 0.93 with balanced accuracy of 86%.

Table 2. The performance of models built using the GPCR dataset. The ROC-AUC scores are calculated using 5-fold stratified cross-validation. Models marked with asterisk were used to build the consensus model.

To test our model, we constructed an independent dataset by looking up GPCR agonist assays in PUBCHEM that we did not use for the training set. We found five relevant assays with 4323 active compounds. Our frequent hitter analysis identified 157 compounds from these 5 assays. Our consensus model predicted the molecules from this set with a balanced accuracy of 76% and an AUC score of 0.85. The consensus model which was based only subset of 2D descriptors provided a very similar accuracy of 75% and AUC score of 0.85 thus indicating the importance of only 2D information for this analysis.

5 Conclusion

In this study, we analyzed GPCR assays from PUBCHEM with the aim to identify frequent hitters. We found that fluorescence-based assays are more susceptible to false positives than bioluminescence. Compounds that were frequent hitters at fluorescence-based assays did not appear as frequent hitters in bioluminescence assays. A predictive machine-learning model to identify such compounds for GPCR assays was developed. The provided analysis can help to interpret HTS screening using GPCR assays.