Analysis and Modelling of False Positives in GPCR Assays

. G-Protein Coupled Receptors (GPCR) are involved in all the major signaling pathways. As a result, they often serve as potential target for therapeutic drugs. In this study we analyze publicly available assays involving different classes of GPCR to identify false positives. Using the latest developments in Machine Learning, we then build models that can predict such compounds with high con ﬁ dence. Given the ubiquity of GPCR assays, we believe such models will be very helpful in ﬂ agging potential false positives for further testing.


Introduction
G-Protein Coupled Receptors (GPCR) are the largest family of cell surface receptors [1]. These plasma membrane bound receptors have evolved to recognize a variety of extracellular physical and chemical signals and, upon recognition, act as the proximal stimulus in cell signaling pathways. With over * 800 members [2], GPCRs are involved in almost every physiological function, from sensation to growth to hormone responses. Due to their widespread physiological relevance and presence of druggable sites, GPCRs are one of the major targets of therapeutic drugs. A 2017 study notes that 475 drugs act at 108 unique GPCRs. Approximately 321 agents are currently in clinical trials, of which *20% target 66 potentially novel GPCR targets. GPCRs also account for *27% of the global market share of therapeutic drugs, with aggregated sales for 2011-2015 of *US$890 billion.
As promising drug targets, assays involving a member of the GPCR family are commonly employed in high throughput screening (HTS) campaigns. There are a plethora of different techniques and a wide range of commercial kits available, many of which are suitable for High Throughput Screening (HTS) [3]. In such HTS, identifying false positives is a challenge. False positives may be compounds that interfere with the assay detection technology in some way, such as inhibiting luciferase in luciferase-based system [4], or quenching fluorescence where it is the final readout [5]. There may also be compounds that are not specific to the target protein, but are promiscuous, either to a narrow or broad class of proteins [6].
In the previous study we developed a machine learning method to flag potential frequent hitters for luciferase assays [4]. In this study we investigated whether the developed methodology can be extended to identify false positives for GPCR assays.

Data Description
Our initial goal was to explore the available data and find suitable assays that we can then use for further analysis. On PUBCHEM, we identified 92 assays with more than 500 compounds for GPCR agonists and antagonists. We separated the two and decided to focus on the agonists. This was just to narrow down the scope of the study. From the list of available agonist screenings we selected the 20 assays with the highest number of active compounds. This is because as we are looking for false positives. Assays that have little to no positives are less relevant for us. For further selection particular assays, we focused on the GPCR subtypes as described below.

Data Collection
The GPCR family is commonly classified into five different families based on their structural and sequence similarity. The families are then further classified into a family tree [7,8]. Of these five major families, the Rhodopsin class in the largest. For selecting assays for our analysis, we mapped the target proteins onto this family tree ( Fig. 1), and selected assays with set of representative proteins distant from each other in the family tree. This ensures that compounds that are frequently active, are not preferential agonists of a subtype of GPCR, but are more likely a result of assay artifact.
Using these criteria we chose a set of 12 assays and looked for compounds that are frequently active in these assays (see Methods section), i.e. actives across all of the various different subtypes and assay technologies and thus frequent hitters of the Rhodopsin class of GPCR. However, only 59 out of 373,131 compounds matched our definition of being frequently active. Upon closer examination we found that these compounds were tested only thrice, and therefore are more likely to be an artifact of selection criteria rather than a GPCR frequent hitter or assay artifact.
To further refine our search, we next focused on different detection technologies that were used in the assays. We found that half of assays (six) used fluorescence while other six assays used bioluminescence. Only 71 compounds were frequently active in the bioluminescence group. In the fluorescence group, although the number of datapoints and active compounds was very similar, 502 compounds were frequently active (Table 1). This indicates that fluorescence technology contributes many more artifacts and these 502 compounds were used for further analysis.
All data were harvested from PUBCHEM [9], manually or by using the PUBCHEM REST API with Python. All data were obtained and stored locally in the CSV format to be analyzed later with various python scripts.

Frequent Hitter Flagging
We defined frequent hitters as compounds that were active according to our criteria in more than half of the assays they were tested in. Additionally each compound had to be tested at least in three different assays. Compounds satisfying both criteria were identified using a Python script and flagged as frequent hitters.

Methods
Using the freely accessible platform On-line Chemical and Modeling Environment (OCHEM) [10], was used to build models for our data. Different descriptors available in OCHEM include CDK, Dragon 6 and 7, ISHIDA fragmentor, among others. Their detailed description can be found elsewhere [4]. Associative Neural Networks (ASNN) [11], Deep Neural Network (DNN) [12], Extreme Gradient Boost (XGBOOST) [13], and Least Squares Support Vector Machine (LSSVM) [14] algorithms were analyzed for training the models. The methods were used with default parameters as specified on the OCHEM web site.

Machine Learning
The analyzed methods were used in combination with different descriptors sets. LSSVM provided on average the highest accuracy amid the chosen algorithms ( Table 2). We selected LSSVM models with the highest accuracy based on their ROC-AUC score for building a consensus model. The consensus model had ROC-AUC score of 0.93 with balanced accuracy of 86%.
To test our model, we constructed an independent dataset by looking up GPCR agonist assays in PUBCHEM that we did not use for the training set. We found five relevant assays with 4323 active compounds. Our frequent hitter analysis identified 157 compounds from these 5 assays. Our consensus model predicted the molecules from this set with a balanced accuracy of 76% and an AUC score of 0.85. The consensus model which was based only subset of 2D descriptors provided a very similar accuracy of 75% and AUC score of 0.85 thus indicating the importance of only 2D information for this analysis.

Conclusion
In this study, we analyzed GPCR assays from PUBCHEM with the aim to identify frequent hitters. We found that fluorescence-based assays are more susceptible to false positives than bioluminescence. Compounds that were frequent hitters at fluorescencebased assays did not appear as frequent hitters in bioluminescence assays. A predictive machine-learning model to identify such compounds for GPCR assays was developed. The provided analysis can help to interpret HTS screening using GPCR assays.
Acknowledgement. The project leading to this report has received funding from the European Union's Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement No 676434, "Big Data in Chemistry". The article reflects only the author's view and neither the European Commission nor the Research Executive Agency (REA) are responsible for any use that may be made of the information it contains. We thank Michael Withnall for English correction. The authors thank ChemAxon (http://www.chemaxon.com) for Academic license of software tools (Standartizer, ChemAxon plugins) as well as AlvaScience (http:// alvascience.com), Molecular Networks GmbH (http://mn-am.com) and Chemosophia (http:// chemosophia.com) for providing descriptors and Corina 2D to 3D conversion program used in this study. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.