Introduction

Structural Interaction Fingerprints (SIFts), as described by Deng et al.[1], comprise a method for encrypting protein–ligand interactions in the form of a bit string. Such a fingerprint contains repeatable portions of a fixed length, with each position encoding a defined type of interaction. Our previous implementation of SIFts [2] employed nine types of interactions: any, with backbone, with side chain, hydrophobic, charged, hydrogen bond donor and acceptor, polar and aromatic, repeated for every residue in the protein. Interaction fingerprints are protein-centric, i.e., all parts of the fingerprint are encoded with respect to the residues alone. The ligand is, in fact, treated as a homogenous object to some extent because it is not possible to retrieve any information regarding its structure from the fingerprint.

SIFt belongs to the family of interaction fingerprints (IFP) [3] comprising a number of approaches to depict the protein–ligand interactions, ranging from ligand- to target-centric approaches. Ligand-based methods encode the interaction data into the ligand descriptor, forming a three-dimensional model resembling a pharmacophore [4], reducing the ligand structure to a set of interacting fragments [5,6,7] or enriching the compound structure with interaction data [8]. Target-oriented descriptors mostly follow the initial Deng concept [1], expanding the spectrum of the detected interactions [9, 10] and providing with novel visualization methods allowing easy visual inspection [11].

Finally, there is a number of IFP utilizing hybrid approach to the description of the protein–ligand complex, e.g. in a form of atom pairs [12, 13] or Extended Connectivity Interaction Features [14].

The utility of IFP as well as availability of numerous crystal structures enabled the assembly of databases containing structures annotated with interaction fingerprints [15,16,17,18], providing means to visualize binding pockets or screen using pre-generated interaction fingerprints Combining IFPs with generative machine learning models allows for prediction of the interaction fingerprint and docking score without the need of computationally expensive calculations[19].

Interaction-based approaches have been shown to be very potent tools for post-docking analysis of Virtual Screening (VS) results [9, 10, 20,21,22,23], performing comparably or better than classical ligand-based methods, for scoring the ligand-receptor complexes [24]. For instance, IFP has shown its potential to analyze the activity cliffs [22, 25], allowing identification of novel, potent ligands. In addition, the form of the binding site description with interaction fingerprints enables its easy visualization. Moreover, the use of descriptors that are independent of the ligand structure allows the identification of structurally new compounds, which is of extreme importance for VS campaigns aimed at the discovery of new potential drugs.

Bearing in mind the advantages of the interaction-based description of a ligand-receptor complex, we wanted to enrich the algorithm of SIFt generation with more detailed information about the ligand while maintaining the notation independent of the explicit ligand structure. This process resulted in the development of the 2D-SIFt descriptor (an interaction matrix), a matrix that depicts interactions per residue, not solely with the ligand treated as an object with uniform properties but with its different pharmacophore features. Analogously to the linear SIFt descriptor, the interaction matrix results from the concatenation of submatrices of a fixed size that encode interactions with individual amino acids. The developed descriptor maintains the potential for the rapid portrayal of the binding site, allowing easy extraction of the interacting residues; however, the use of ligand pharmacophore features allows the construction of heat maps of binding sites, which facilitate the identification of interaction hotspots (residues and dominating interaction types). A collection of individual interaction matrices can be averaged into a profile, providing a generalized overview of the binding mode and, concomitantly, of the common pharmacophore features of the interacting ligands.

This paper presents the python library for generating and manipulation of the 2D-SIFt descriptor along with two case studies, demonstrating key advantages of extending classical interaction fingerprints: rapid binding site description by means of an averaged descriptor, and the identification of key residues for binding of different types of ligands. In addition, we show that the modularity and flexibility of the interaction matrix format is easy to use with generic residue numbers (here: generic residue numbers for G protein-coupled receptors [26]). The targets of choice for the study are the G protein-coupled receptors (GPCRs) [27], being a superfamily of membrane proteins of great pharmacological relevance [28]. GPCRs share a common heptahelical topology of the transmembrane domain, encouraging the application of generic numbering schemes that describe the positions of the residues within the 7TM bundle rather than subsequent sequence numbers and thus permitting comparison between different members of the protein family [26].

Implementation

The 2D-SIFt descriptor is a matrix of 7 × (9∙N) fields (six standard pharmacophore features together with a “wildcard” feature, nine types of interactions with amino acids, N —number of residues, Fig. 1A). The matrix fields can have values greater than 1 because there can be more than one separate pharmacophore feature of a given type within the ligand interacting with one residue (e.g. a number of aromatic rings surrounding a phenylalanine, Fig. 1B). The pharmacophore features used for the construction of the interaction matrix are common to every pharmacophore modeling software: hydrogen bond donor, hydrogen bond acceptor, hydrophobic group, negatively charged group, positively charged group, aromatic ring. In this research, the features were assigned by fitting the SMARTS patterns, as defined in RDKit library (http://www.rdkit.org, https://github.com/rdkit/rdkit/blob/master/Data/BaseFeatures.fdef). The collection of residual contacts consists of the previously described types: any, side chain, backbone, hydrogen bond donor, hydrogen bond acceptor, charged, hydrophobic and aromatic [2].

Fig. 1
figure 1

Scheme of construction of the 2D-SIFt descriptor. A Schematic representation of the 2D-SIFt portion depicting the interactions of one amino acid. Grayed fields show the incrementable bits. B The symbols in the column headers of the table describe the types of interactions: Any, BB with a backbone, SC interaction with a sidechain, P polar, H hydrophobic, A hydrogen bond acceptor, D hydrogen bond donor, C charged interaction, R aromatic. Rows denote standard pharmacophore features of the ligand: A hydrogen bond acceptor, D hydrogen bond donor, H hydrophobic, N negatively charged group, P positively charged group, R aromatic, vdW any atom. Each ligand-residue interaction is encoded as a 7 × 9 matrix of contacts, and the intensity of the gray corresponds to the number of features interacting with a ligand. Subsequent per-residue matrices are concatenated to form the 2D-SIFt descriptor. The structure presented is the crystal structure of a β2-adrenoreceptor receptor solved with inverse agonist, carazolol (2RH1)

In the first step of the algorithm, the pharmacophore features are assigned to the ligand using the appropriate SMARTS patterns. Then, for every amino acid in the analyzed complex, the interactions between those features and the given residue are evaluated and encrypted into the fields of the interaction matrix. The concatenation of per-residue matrices results in a 2D-SIFt interaction matrix (Fig. 1B).

The interactions are evaluated in a similar manner to one-dimensional SIFt [2]: residues are grouped into classes: aromatic, hydrophobic, polar, negatively and positively charged. For charged, hydrophobic and vdW contacts, the distance (here no greater than 3.5 Å) and the complementarity of interacting features is the basis of detection of the interaction. However, for hydrogen bonds, the maximum distance is 2.8 Å and in addition, the donor (Y-H…X, where X and Y are heavy atoms, H—hydrogen) must not exceed 120° and acceptor (X…H-Y) 90°. For π interactions, the distances are 4.4 Å, 5.5 Å and 6.6 Å for face to face, edge to face and π-cation interactions, respectively. Maximum angle between the aromatic rings in face to face interactions was of 30° and the minimum angle in edge to face was 60°. To avoid redundancy of information, a set of rules was introduced to guide the incrementation of bits in the interaction matrix for one residue.

The interaction profiles are calculated by averaging every field of the interaction matrix. However, to avoid noise and obtain a clear dominating binding mode, values lower than the established threshold (0.3 for the case studies described herein) are silenced (set to 0).

The algorithm was implemented in Python using the RDKit library (http://www.rdkit.org). Curated (cleaned and annotated with generic numbers) crystal structures of GPCRs were obtained from GPCRdb [29]. Protein Preparation Wizard was used to prepare the receptor structures (assignment of the bond order, atom types and optimization of the hydrogen bond networks).

Case study: common antagonist binding site for Class A GPCRs

As previously indicated, the generic numbering scheme for the transmembrane residues of the GPCRs allows comparison between different proteins within the family. This approach has been utilized previously [30] to estimate common binding site for the Class A GPCRs based on crystallographic data. In this case study, analysis of the binding modes was conducted based solely on the crystalline ligand-receptor complexes obtained from GPCRdb [29]. A set of 33 unique Class A targets crystalized with small molecules in the presumably inactive conformation (no G protein nor nanobody bound) was taken into the study. For multiple crystals with different ligands, the antagonists were of preference, however, in case of serotonin receptors 1B and 2B, only agonist-bound structures were available. Among crystals of the same target with antagonists, the structure of the best resolution was used. For a comparison, a similar study was performed on the set of 154 crystal structures of the Class A GPCRs co-crystalized small molecule ligand (out of 173 Class A crystals annotated in GPCRdb in June 2017).

Case study: agonist and antagonist binding modes for β2-AR

The comparison of the binding modes for antagonists and agonists was based on the interaction matrices (2D-SIFt) generated for nine complexes of β2-AR (Additional file 1: Table S2) [31,32,33,34,35,36]. To provide the consistency of the structural data, agonists binding mode was analyzed for the active structures of the β2-AR. The conformation was assumed active whenever the receptor was bound to an agonist along with a cofactor stabilizing the active state, either a G protein (3SN6) or a nanobody. In addition, the crystal structure of β2-AR with irreversible agonist was included into the study. In the case of multiple crystals with an identical ligand, a structure with higher resolution was selected for the study. An inactive receptor was defined as a structure with either an antagonist or an inverse agonist bound. Analogously, a higher resolution was preferred for multiple structures that crystalized with the same compound.

Results

Case study: common antagonist binding site for G protein-coupled receptors

Analysis of the binding sites resulted in 67 GPCRdb residue positions [26] interacting with ligands; however, a low interaction frequency was observed for most of them. Setting the contact threshold to more than 30% of the crystals resulted in 21 positions (Table 1, Additional file 1: Fig. S1) that were hotspots for ligand-receptor interactions. The residue positions forming the most frequent contacts, 3 × 32, 3 × 33, 3 × 36, 6 × 48, 6 × 51 and 7 × 38 (corresponding to 7.39 in Ballesteros-Weinstein notation [37]), occurring in over 70% structures, form previously reported [30] consensus framework for interactions with ligands across Class A GPCRs. Here, additional positions, 5 × 43, 5 × 461, 6 × 55 and 7 × 42, also appeared as ligand anchors (scoring at least 70% contact frequency). All those residue positions were also identified as Orthosteric Binding Site (OBS) [38]. Interestingly, the study also revealed the populations of residue positions belonging to the Secondary Binding Pockets (SBPs) [38], spanning TM2 (2 × 60, 2 × 63) and TM7 (7 × 34, 7 × 35 and 7 × 41) and being the extension of the OBS. Secondary binding pockets play a role in subtype selectivity, as the SBP residues are less conserved. Indeed, 2D-SIFt descriptors indicated distinct interaction profiles within secondary pockets for different receptor subtypes (e.g. opioid receptors, angiotensin receptor or muscarinic receptors—Table 1). Tabular visualization of the interacting ligand features also facilitated the selection of outliers—receptors accommodating distinct type of ligands and showing distinct interaction pattern (LPA1 receptor, 4Z35) or having unusually located binding site (e.g. P2Y1 receptor, 4XNV, binding ligand outside of the OBS).

Table 1 The most frequent interactions within the common binding site of the GPCRs crystallized with small molecule ligands (representative structures). Colors correspond to the specific interaction formed by the ligand pharmacophore features: purple, charged; yellow, hydrogen bond acceptor; blue, hydrogen bond donor; green, hydrophobic; orange, aromatic; gray, any

Similar analysis performed for the full set of available crystal structures of GPCRs bound to small molecules resulted in 19 common interaction hotspots (Additional file 1: Table S1), lacking the TM2 residues identified for representative structures. Again, the set of the residue positions the most frequently interacting with ligands matches the one elucidated for representative structures (3 × 32, 3 × 33, 3 × 36, 5 × 43, 5 × 461, 6 × 48, 6 × 51, 6 × 55, 7 × 38 and 7 × 42 found to be interacting in over 70% crystals). On the other hand, the redundancy of the interactions in the full set of crystal structures (e.g. 17 crystals of β2-AR) does not allow detecting all of the secondary binding pockets but only provides the dominating interacting positions.

List of the common interacting residues was extracted from 35 distinct small molecule (antagonist)-bound crystal structures of GPCRs. Positions that interacted with less than 30% of the structures were excluded. Residue positions were encoded using a GPCRdb numbering scheme, and the contact frequency shows the fraction of detected interactions across the investigated dataset. Colors correspond to the specific interaction formed by the ligand pharmacophore features: purple, charged; yellow, hydrogen bond acceptor; blue, hydrogen bond donor; green, hydrophobic; orange, aromatic; gray, any.

Case study: agonist and antagonist binding modes for β2-AR

The 2D-SIFt descriptors, which were constructed for the dataset of five active and five inactive complexes (Additional file 1: Table S2), revealed differences in the binding modes of the ligand types in terms of both the composition of the set of interacting amino acids and the pharmacophore features of the interacting ligands (Fig. 2A). The common set of 24 residues encompasses the antagonist binding site, with two residues that are unique for the antagonist binding site (M822.53x53 and Y1995.38x39) and five for the agonist binding site (G902.61x60, I942.65x64, K3057.30x31, I3097.36x35, W3137.40x39, Fig. 2B). However, the differences consist not only of the spatial placement of the binding sites—the set of common interacting residues, but also the prevalent types of interactions that are distinct for different types of ligand. Antagonists tend to form less specific interactions, with van der Waals and hydrophobic contacts as the dominating mode of interaction. Agonists, in contrast, often interact via hydrogen bonds and polar contacts, as shown in the differential profile (D1133.32x32, S2045.42x43, S2075.46x461 and N3127.39x38, Fig. 2A). These differences can be explained by the pharmacological role of the compounds: agonists induce a significant change in receptor conformation. This observation for the amino acid composition and specific interactions may serve as a simple filter to distinguish the role of the ligand based solely on the interaction matrix.

Fig. 2
figure 2

Differences between agonistic and antagonistic binding site for β2-AR. Depicted by the 2D-SIFt profile for agonists (red) and antagonists (blue) (A), and visualized in the crystal structure of β2-AR, 2RH1 (B). Residues in white are common to both binding modes, blue is unique to antagonist binding and red is unique to agonist binding. For clarity, the 2D-SIFt heatmaps display the per-residue interactions found in more than 30% of the investigated complexes

The analysis performed for the extended set of all available agonist and antagonist bound structures resulted in narrower set of common residue positions, omitting the TM2 contacts (Additional file 2: Fig. S2). The differences in specific interactions for the remaining positions, however, remained nearly identical.

1D representation

Although matrix representation of the 2D-SIFt descriptor improves visual inspection and analysis of the ligand-receptor interactions, converting the descriptor into bit string would allow for use with existing tools for fingerprint analysis. This would facilitate docking postprocessing or filtering the docking poses with minimal effort. For this reason, the implementation of the 2D-SIFt comes with functions to convert a 2D-SIFt matrix onto linear representation either allowing for values greater than 1 for interactions or keeping the descriptor binary. We have also provided simple tools for calculating both, Tanimoto and Euclidean distance between fingerprints, for binary and non-binary representation, respectively. Linear descriptor contains all incrementable bits from 2D-SIFt matrix (Fig. 1A).

Conclusions

The library presented herein for the automated generation and manipulation of the 2D-SIFt descriptor along with convenient visualization in the form of a heat map provides a powerful tool to consolidate the interaction data obtained from multiple ligand-receptor complexes. As shown in the case studies, the toolkit supports a generic numbering scheme for proteins, thus allowing the assessment of structures with differing sequences but similarities within the same numbering space. The intuitive interpretation of the visualized descriptors (heat maps) provides the opportunity to identify interaction hotspots easily, as well as to differentiate between the binding modes associated with compounds with different pharmacological roles. The visualization in form of a table allows for quick assessment of contributions of individual complexes to the interaction profile, and thus quick identification of the ligand-receptor complexes not matching the common interaction pattern. The 2D-SIFt interaction matrix, which combines both structural and pharmacophore data, supplies a hybrid structure- and ligand-based analysis. The automated approach to the interaction analysis provides convergent results with the implementation of a meticulous manual analysis, yet it is convenient to use and customizable.