2.1 Data Description
Our initial goal was to explore the available data and find suitable assays that we can then use for further analysis. On PUBCHEM, we identified 92 assays with more than 500 compounds for GPCR agonists and antagonists. We separated the two and decided to focus on the agonists. This was just to narrow down the scope of the study. From the list of available agonist screenings we selected the 20 assays with the highest number of active compounds. This is because as we are looking for false positives. Assays that have little to no positives are less relevant for us. For further selection particular assays, we focused on the GPCR subtypes as described below.
2.2 Data Collection
The GPCR family is commonly classified into five different families based on their structural and sequence similarity. The families are then further classified into a family tree [7, 8]. Of these five major families, the Rhodopsin class in the largest. For selecting assays for our analysis, we mapped the target proteins onto this family tree (Fig. 1), and selected assays with set of representative proteins distant from each other in the family tree. This ensures that compounds that are frequently active, are not preferential agonists of a subtype of GPCR, but are more likely a result of assay artifact.
Using these criteria we chose a set of 12 assays and looked for compounds that are frequently active in these assays (see Methods section), i.e. actives across all of the various different subtypes and assay technologies and thus frequent hitters of the Rhodopsin class of GPCR. However, only 59 out of 373,131 compounds matched our definition of being frequently active. Upon closer examination we found that these compounds were tested only thrice, and therefore are more likely to be an artifact of selection criteria rather than a GPCR frequent hitter or assay artifact.
To further refine our search, we next focused on different detection technologies that were used in the assays. We found that half of assays (six) used fluorescence while other six assays used bioluminescence. Only 71 compounds were frequently active in the bioluminescence group. In the fluorescence group, although the number of datapoints and active compounds was very similar, 502 compounds were frequently active (Table 1). This indicates that fluorescence technology contributes many more artifacts and these 502 compounds were used for further analysis.
Table 1. Statistics of compounds for the datasets used in the study.
All data were harvested from PUBCHEM [9], manually or by using the PUBCHEM REST API with Python. All data were obtained and stored locally in the CSV format to be analyzed later with various python scripts.
2.3 Frequent Hitter Flagging
We defined frequent hitters as compounds that were active according to our criteria in more than half of the assays they were tested in. Additionally each compound had to be tested at least in three different assays. Compounds satisfying both criteria were identified using a Python script and flagged as frequent hitters.