Lay abstract

Every cell in our body is an immensely powerful computational device capable of integrating vast amounts of data from intrinsic and extrinsic cues and responding with remarkable fidelity. What underlines this computational power are not static wires, but dynamic interactions that leverage the finite number of genes to generate an almost infinite number of combinatorial interactions between protein components. In the post-genomics era, mapping these interactions represents a next frontier. The sum total of all permitted interactions is referred to as the potential interactome. In any given cell, only a subset of potential interactions will be enabled and this defines the selective differences in signalling between tissues. Understanding the whole provides insight into the information processing power of the system and may suggest new avenues for therapeutic intervention to treat diseases caused by faults in signal processing mechanisms. This study outlines the potential interactome for initial signalling events from the insulin receptor, insulin-like growth factor receptor and all four members of the fibroblast growth factor receptor family. These systems are essential for human development and dysfunctional signalling has been implicated in a wide range of human diseases including diabetes, many cancers, Alzheimer's disease, many developmental disorders and even aging. Binary connections are reported between 50 SH2 domain-containing proteins and 192 phosphopeptide nodes on 13 signal-initiating proteins. This verified almost every interaction described in the past 25 years and adds an extensive new data, providing a step towards fathoming the intricacies of differential cell communication between various tissues and disease states.

Introduction

Signaling immediately downstream of receptor tyrosine kinases (RTKs) is accomplished in large part by the recruitment of phosphotyrosine (pTyr) interacting proteins to sites of tyrosine phosphorylation on the activated receptors and their associated scaffold proteins [13]. A given RTK may contain on the order of 10–20 phosphorylatable tyrosine residues with additional sites available on associated scaffold proteins resulting in a large number of potential sites for recruiting binding partners. The majority of phosphotyrosine interacting proteins contain a conserved Src homology 2 (SH2) domain [4]. The SH2 domain is the classic archetype for the large family of modular protein interaction domains that serve to organize a diverse array of cellular processes [5, 6]. SH2 domains interact with phosphorylated tyrosine-containing peptide sequences [711] and in doing so they couple activated protein tyrosine kinases (PTKs) to intracellular pathways that regulate many aspects of cellular communication in metazoans [12, 13]. The human genome encodes 111 SH2 domain proteins [14, 15] that represent the primary mechanism for cellular signal transduction immediately downstream of PTKs. As one might expect, SH2 domain proteins play an essential role in development and have been linked to a wide array of human malignancies including cancers, diabetes, and immunedeficiencies [14, 16].

Despite the importance of SH2-mediated signaling in human disease, our understanding of their interactions remains far from complete. Direct experimental measurement of binding partners has typically focused on specific interactions driven by hypotheses relating to the precise signaling events under investigation. This yields a set of high quality, but inevitably sparse data. Certain pTyr proteins and SH2 domains are extensively studied while others are more arcane. Nonetheless, the SH2-mediated interactions reported over 25 years of intensive study provide a solid foundation for validating high-throughput datasets.

SH2 domain interactions are almost always phosphorylation dependent as roughly half of the binding energy is devoted to pTyr recognition [17, 18]. Despite this, SH2 domains preserve substantial specificity for peptide ligands, recognizing residues adjacent to the pTyr, particularly those at positions +1 to +5 C-terminal to the critical pTyr [1921]. This is achieved in part by use of complex recognition events that effectively combine the use of motifs and sub-motif modifiers [11]. Specifically, SH2 domains recognize targets not only through permissive residues adjacent to the phosphotyrosine that constitute binding motifs, but also by making use of contextual sequence information and non-permissive residues [22] to define highly selective interactions with physiological peptide ligands. The specificity of SH2 domains enables their use as tools to profile the global phosphotyrosine state of cells or tissues [2327], without a priori knowledge of the specific target proteins or peptides. Profiling signaling using SH2 domains has direct implications to diagnosis and guiding therapeutic decisions as the patterns obtained can be used to classify tumors [27]. The ligand specificity of many SH2 domains has been evaluated using approaches including synthetic peptide libraries [19, 28, 29], oriented peptide libraries [20, 30] and phage display [31]. Information of this type is often described by position-specific scoring matrices (PSSM), and allows programs such as ScanSite and Scoring Matrix-Assisted Ligand Identification (SMALI) to predict potential binding motifs [20, 21].

Recruitment of SH2 domain proteins to phosphorylated sites is a dynamic process and is by no means predetermined by the phosphorylation event alone. Each tyrosine site on a scaffold (including sites on receptors that recruit SH2 domains) can be phosphorylated or unphosphorylated. The phosphorylated site can either be free or occupied by one of its potential binding partners. Each possible assembly of interaction partners on a given scaffold represents an interaction microstate [3235]. The actual populated interaction microstates from which signaling develops is a function of many factors, including protein expression levels, local concentration, and the probability that a given site is phosphorylated. Thus, distinct signaling networks may originate from the same scaffold or receptor in different cell types. This is also true under conditions of aberrant expression of signaling components that are a common occurrence in pathologies such as cancer. Thus, accurate and well-annotated potential interactomes that represent the aggregate available interaction microstates are a valuable resource that opens the door to interpreting studies of signaling in different cell types or under conditions of altered protein expression. As the Human Protein Atlas detailing subcellular localization data and expression data makes clear, cell lines and tissues vary widely and often in unanticipated ways in terms of protein expression [36]. All of this suggests that detailed potential interactomes may provide substantial benefit in understanding cell-type specific signaling.

Herein, we describe a potential interactome obtained using addressable peptide arrays consisting of 192 physiological peptides from the insulin (Ins), insulin growth factor 1 (IGF-1) and fibroblast growth factor (FGF) signaling pathways to identify interactions with 50 SH2 domains. This set represents a broad sampling of the SH2 domains extant in the human genome. The results of this study map a range of potential phosphotyrosine-dependent interactions within the FGF and Ins/IGF-1 pathways. These signaling systems have relevance to understanding complex multi-tissue pathologies such as diabetes and cancer as well as in normal physiology and development. This study confirms 44 of 54 previously described interactions. In addition, we report an extensive set of novel interactions. Validation of 60 binary interaction pairs was conducted using the orthogonal method of solution binding measured by fluorescence polarization. The binding motifs obtained for each SH2 domain closely match those reported in a number of independent studies. Protein co-precipitation experiments, or endogenous phosphorylation upon receptor stimulation, were further used to validate a number of interactions. The results of this study highlight the available pool of potential SH2-mediated interactions with these 13 major signaling proteins and serve as a first step in understanding signaling microstate variations. Interactive figures and additional information may be found at http://www.sh2domain.org.

Results

Peptide arrays for SH2 interactions within the FGF/Ins/IGF-1 signaling pathways

The use of addressable peptide arrays is a reproducible and semi-quantitative approach that has been extensively validated for studying protein interactions with peptide ligands [3739]. To investigate connections between SH2 domain proteins and their putative phosphorylated docking sites on cell surface receptors, we developed addressable arrays consisting of 192 phosphotyrosine peptides. This peptide set was assembled using 71 phosphotyrosine peptide motifs corresponding to all of the cytoplasmic tyrosine residues within the FGF receptors (FGFR1-4), insulin receptor (InsR) and IGF-1 receptor (IGF-1R) (Figure 1A). Activation of these receptors results in the phosphorylation of associated scaffold proteins, and so 75 phosphotyrosine peptides corresponding to a comprehensive list of tyrosine residues within insulin receptor substrates (IRS-1 and IRS-2) and fibroblast receptor substrates (FRS-2 and FRS-3) were included. In addition, 33 phosphotyrosine peptides were incorporated from the downstream signaling proteins PLC-γ1, p130Cas (BCAR1) and p62DOK1. Finally, a set of 12 positive control peptides corresponding to 19 reported interactions with 15 SH2 domains for which equilibrium dissociation constant (KD) values span a range from low nM to 50 μM were incorporated to aid in validating the results. These control peptides provide a reference and establish the empirical cut-off for designated binding interactions (Table 1). No discrimination was made against peptides on the basis of reported phosphorylation state in order to examine a diverse and unbiased set of motifs. The resulting set of 192 phosphotyrosine peptides and their corresponding position in the proteins of origin is noted in Additional file 1: Table S1. Addressable arrays were synthesized as membrane-bound 11-mer peptides using the SPOT synthesis technique [4042]. While the majority of SH2 domains recognize residues C-terminal to the phosphotyrosine in their cognate peptide ligands, additional contacts between SH2 domains and residues N-terminal to the phosphotyrosine are observed for the SH2 domain of Sh2d1a (SAP) [43] and cannot be ruled out in other cases. Peptides were synthesized with six flanking residues C-terminal to the phosphotyrosine and four residues N-terminal to the phosphotyrosine.

Figure 1
figure 1

Probing interactions between SH2 domains and physiological peptide ligands at a systems level. (A) A representation of a SPOT peptide array containing 192 phosphotyrosine peptides including control peptides (black) and peptides from the 13 proteins present on the array indicated by their represented colors. SPOT peptide arrays were incubated with 250nM GST-SH2 domain as indicated. Interactions were detected using anti-GST antisera and Alexa-680-labeled anti-mouse secondary antibody and the intensity of signals recorded using LiCor Odyssey. (B) Neighbor-Joining Tree of all 121 SH2 domains. Highlighted in blue are the 50 SH2 domains selected across different families for this study. (C) Peptide arrays using SPOTS is a semi-quantitative method for measuring protein domain-pTyr peptide interactions. The dissociation constants (KD) were measured between 60 interaction pairs presenting interactions determined using peptide arrays as greater than 3X the mean, between 1 and 3X the mean and less than 3X the mean. The mean KD value for each group is marked with a black line.

Table 1 Literature confirmed interactions 39 array-positive interactions were experimentally verified or confirm previously reported interactions while 23 array-negative interactions empirically suggest a threshold corresponding to a K D of approximately 5 to 10 μM for this data set

To assess the potential network of SH2 domain interactions we selected 50 SH2 domains representing 28 of the 38 families of SH2 domains (Figure 1B) all of which we have previously shown can be expressed and purified [23]. These include a number of extensively studied SH2 domains (Src, Grb2, PLCγ), as well as a number of less studied SH2 domains from proteins such as Shd, She, Shf, Slnk (Sh2d6), Sh2d1a (SAP), Sh2d1b (Eat-2), and Brdg1. To address potential variability in specificity within families we employed all members from the SHB, CRK, GRB2, SRC and ABL families (families are indicated with complete Capitalized lettering).

SH2 domains were arrayed as GST fusion proteins and detected using anti-GST primary antibodies and near-infrared labeled secondary antibodies. In an effort to present a dataset with minimal false positives, we chose an empirical cutoff based on the array average across all peptide spots to classify interactions (Figure 1A). In cases where the intensity of the signal for an individual SH2-domain binding event exceeded the mean intensity of all the peptides on the membrane by three-fold were scored as “array positives” [22]. Non-binding was judged in cases where the intensity of a spot was less than the mean intensity of all spots on the membrane and these were scored as “array negatives”. Peptides with signal intensities between 1X and 3X mean were scored as “indeterminate” and ascribed as neither array positive binding interactions nor array-negative non-binders. Analysis of the distribution of SH2 domain interactions per phosphopeptide revealed that our dataset possessed a bimodal distribution, with a significant number of peptides binding to many SH2 domains (Additional file 2: Figure S2). This signature may be indicative of promiscuity differences between phosphopeptides or there may be a subset of peptides which interact in a nonspecific fashion with either the GST fusion tag or one of the antibodies used for detection, resulting in false positives. Consistent with our goal of reducing the errors associated with identifying false-positives, we probed three separate arrays with three separate preps of the GST fusion tag alone. Potentially non-specificly interacting peptides (so-called ‘sticky’ peptides) were identified as any that bound to GST with above mean intensity in two out of three separate trials. This approach identifies any peptides which interact with GST or either of the recognition antibodies, a known confounding factor for downstream analysis [44]. This conservative approach allows us to score many significant peptides as ‘binders’ which may have been indeterminate before when incorporating the ‘sticky’ peptides into the array average. This resulting in discarding 40 peptides representing 382 potential interaction pairs as non-selective and resulted in a dataset of substantially higher quality.

Validation by orthogonal assays and literature-verified interactions

To verify the binding results obtained from addressable peptide arrays we employed an orthogonal method of determining SH2 interactions with peptide ligands. We measured the dissociation constants of 60 binary SH2-peptide pairs in solution by fluorescence polarization (Table 2, Additional file 2: Figures S3A-C). In all cases array-positive interactions were of high affinity (range 0.18 μm – 5.8 μM, median KD = 2 μM), while array negative interactions were demonstrably lower affinity (median KD > 30 μM) (Figure 1C). This suggests a low false-positive rate and indicates that array-positive interactions correspond to high affinity binding events at a high frequency.

Table 2 Measured affinity values

Probing of arrays individually with each of 50 SH2 domains provides a snapshot of SH2 specificity (Figure 2A). As we have previously shown, this method is highly reproducible [22]. Independent peptide arrays and protein preparations reveal high reproducibility for the select SH2 domains (Shb, Ship2, Sh3bp2) (Figure 2B). To confirm interactions between full-length proteins we performed a set of GST-SH2 pull-down experiments of CHO stably expressing InsR and IRS-1 with or without stimulation with insulin (Additional file 2: Figure S4). These lysates were incubated with GST-SH2 domains and precipitated using glutathione-agarose beads to identify SH2 domains that were capable of precipitating phospho-IRS1 or phospho-InsR. This confirmed previously described interactions such as those involving the PI3K_C, Shp2_N and Fyn (as well as related Src and Itk) SH2 domains [4547]. In addition, interactions observed on the peptide arrays were confirmed for Rasa1, Vav1, and Abl2 and PLC-γ1.

Figure 2
figure 2

Addressable peptide arrays reveal SH2 domain selectivity. (A) 50 SPOT arrays panned against 50 GST-SH2 domains reveals the highly selective nature of SH2 domain phosphopeptide interactions. Interactions were detected using anti-GST antisera and Alexa Fluor-680-labeled anti-goat secondary antibody and the intensity of signals recorded using LiCor Odyssey. (B) Two separate peptide arrays were probed with independent SH2 domain preparations for three SH2 domains (SHB, SHIP2, SH3BP2). The scatter plot reveal some variability between the independent SPOT experiments yet revealing a strong correlation coefficient (R2).

The literature is a rich source of detailed interactions that provide potential validation. Since the discovery of the SH2 domain in 1986 [48], detailed study has uncovered a large set of SH2 interactions. Any high-throughput technique would expect to capture most of these interactions, and failure to do so may be taken as evidence of false-negative results. Each of our addressable peptide arrays included a set of 12 designed control peptides for which 22 reported interactions covered a range of KD values. In addition, we noted 43 interactions with the 13 signaling proteins represented on the arrays reported in UniHI [49] from the interaction databases of MINT [50], BIND [51], HPRD [52], and DIP [53]. Of the 22 designated control interactions, 18 were noted as array-positive (Table 1). Of the remaining four expected interactions, three have measured affinities, and in all cases the equilibrium dissociation constant is weaker than 16 μM. All of the array-positive interactions for which affinity is reported have KD values stronger than 4.1 μM. Thus, this control set suggests an approximate threshold of binding in the range of 10 μM ± 5 μM. Of the 43 database-reported interactions, most were array positive and of those that were not array-positive, a number were just sub-threshold and judged to be indeterminate (Table 1). The ability to recapitulate the vast majority of known (literature-reported) interactions and to verify novel interactions by orthologous methods is indicative of a high quality dataset [54].

Reconciling conflicts with other datasets

As noted above, this study performs well in terms of reproducing the literature reported interactions between the 50 SH2 domains tested and the 13 proteins represented on the addressable arrays (Table 1). A handful of differences with literature-reported interactions must, by necessity, be reconciled. Our assumption is that a high-throughput (HTP) study such as this one should capture upwards of 85% of known (literature reported) interactions and that results that differ from low throughput studies described in the literature should be subject to further testing to identify the nature of the discrepancy and reveal any weakness in the HTP dataset [55]. We examined a set of potential discrepancies and found that in each case our dataset held up well. For instance, FGFR1 Y-766 (SNQEpYLDLSMP) is reported to bind to PLCγ1 in a pTyr dependant manner based on mutational analysis of FGFR1 [55, 56]. We tested the PLCγ2 SH2 domain with an analogous peptide from FGFR3 Y-760 (STDEpYLDLSAP) and failed to detect any interaction. Direct measurement of peptide binding to either the PLCγ2_N or PLCγ2_C SH2 domain by fluorescence polarization in solution also failed to detect an interaction, supporting the results on the array (Table 2, Additional file 2: Figure S3). This may imply that either this is a binding event specific to PLCγ1 (and not PLCγ2), or that the interaction reported at the level of the full-length protein may be more complex, perhaps requiring secondary contact sites that are not available within the context of the short peptide used in the current study. In several other cases, literature-reported interactions that were array-negative turned out to be interactions with IC50 or KD values above 10 μM (Table 1). It is likely that a few low micromolar or even sub-micromolar binding events could be assigned as array-negative in our study due to synthesis yield heterogeneity and the fact that we are limited to arraying at one concentration (0.25 μM in this study). We decided to design an empirical reporting scheme that was conservative, sacrificing many true positives in order to limit false positives, which would have naturally arisen in the process of trying to minimize false negatives. We have made an effort to limit false negatives to those of lower affinity, and we are aware of no instance in our dataset of a sub-micromolar affinity interaction being scored as array-negative.

Many high-affinity interactions, such as the interactions between the Src and Lck SH2 domains and p130Cas pY-664, fell into our array-indeterminate set (1x-3x mean), likely due to the synthesis efficiency and accessibility of these particular peptides and the semi-quantitative nature of the system. Indeed, many of the peptide-SH2 interactions that fall in the indeterminate set are likely to be real binders. Some surprising differences between SH2 domains can be reconciled this way. For instance, comparing between the Abl1 and Abl2 SH2 domains there is a significant difference in array positive interactions between the two. This is surprising considering the sequence similarity between the two domains. Because of the heterogeneities inherent in this study design as indicated above and the similarities between the two proteins, discrepancies of this sort likely represent false negatives. In total, the limited number of incongruities between the current data set and the literature are thus largely reconcilable.

A high-throughput binding study reported interactions between a large set of SH2 domains and phosphopeptides within four receptor tyrosine kinases (including IGF-1R and FGFR1) overlaps with the present study [57]. Our dataset only validates 5 of 51 of these interactions and describes 6 additional interactions not reported in that study. This disagreement is in contrast to the high degree of consensus between the present study and a wide range of previous studies (Table 1). We examined a number of the interactions reported by Kaushansky A et al. using a combination of an orthologous experimental approach, comparison to consensus binding motifs, and literature validation. As noted above, SH2 domains have well described binding motifs and adhere to these remarkably well in the current study. Kaushansky A et al. report a large number of interactions that do not approximate the binding motifs to which the corresponding SH2 domains are known to be capable of binding. In addition, SH2 domains make use of contextual sequence information and non-permissive residues that block binding in order to improve selectivity [22]. For example, the Grb2 family has a very strong preference for an asparagine residue at the +2 position and will not tolerate a proline residue at the +3 position [1922]. Kaushansky A et al. report a series of Grb2 interactions with peptides that do not contain the required permissive residues, and furthermore many that contain strong non-permissive residues (Additional file 2: Figure S7 and Table S3). Similarly, Crk SH2 requires a +3 Leu or Pro yet this motif is absent in many of the Crk SH2 binding peptides reported by Kaushansky et al. Indeed, the 46 interactions reported by Kaushansky et al. that we fail to confirm overwhelmingly contain peptides that lack conformity to the consensus motifs to which the cognate SH2 domains are known to interact [19, 20, 29]. In addition, a number of apparent “hub” peptides reported in Kaushanky et al. contain cysteine residues (eg. FGFR1 pY-583, FGFR1 pY-605, FGFR1 pY-730), and the interactions were probed in the absence of reducing agents [57, 58]. In the present study, binding was assayed in the presence of 1 mM DTT and peptides containing cysteine residues were substituted with serine [59]. Kaushansky et al. provide no corroboration of their results by either orthogonal assay or literature validation, while the present study provides extensive corroboration.

Even in the cases where our data overlap, the reported apparent KD values reported by Kaushansky et al. appear inconsistent with direct measurements conducted using well controlled solution binding measured by fluorescence polarization [57]. For example, Kaushanskyet al. report a KD of 175nM for the interaction between Rasa1-N-SH2 and FGFR1 pY-463 while we measured a KD of 1.54 μM by fluorescence polarization (Additional file 2: Figure S7), Additionally, there are 6 interactions that we report that are not noted by Kaushansky et al. We picked one of these binary pairs at random, the interaction between Crk SH2 and FGFR1 pY-463, and tested binding in solution. We measured KD of 380 nM for this interaction, validating this binding event.

Taken as a whole, comparisons with the literature validate the results presented in this study. Non-array-positive literature-reported interactions tend to fall into three categories: 1) low affinity interactions; 2) near misses that are array-indeterminate and thus just below threshold; or 3) cases where orthogonal measurement confirms no interaction at the level of the individual SH2 domain and 11-mer phosphopeptide. Comparison with an SH2 domain array study reveal limitations in that technique and suggest that SH2 domain arrays on glass substrates may suffer from a high rate of false positive and false negative interactions. This is consistent with results from the same group investigating PDZ domain binding using a similar protein microarray method which concluded that the technique resulted in a false positive rate of approximately 50%, and poor correspondence between array-estimated and solution-binding measured equilibrium-dissociation values [6062].

Metadata-rich interaction maps

Probing arrays with 50 SH2 domains identifies a total of 529 array-positive interactions, together with 5949 array-negative and 1122 indeterminate SH2-ligand pairs. Array-positive interactions between SH2 domains and pTyr sites map the potential SH2 interactome. The connections between SH2 domains and InsR, IGF-1R, IRS-1, IRS-2, FGFR1, FGFR2, FGFR3, FGFR4, FRS2 and FRS3 together with p130Cas, PLCγ1 and p62DOK1 highlight a wide range of putative SH2 interactions within the immediate FGF and Ins/IGF-1 signaling networks (Figure 3). The prediction of novel interactions comes with the inherent caveat that a given SH2 protein would need to be co-expressed with its interaction partner. For example, Grap and Gads are expressed only in certain hematopoietic cells [63, 64]. Interactions recorded for the SH2 domains of Gads and Grap are not useful for predicting interactions in other cell types but may be considered as supporting data for the interactions of the closely related Grb2 SH2 domain. The similar specificity of the SH2 domains of Grb2, Gads and Grap results in an overlapping set of target peptides where the independent binding of all three SH2 domains increases our confidence that this peptide is in fact a high-quality ligand for this class of SH2 domains.

Figure 3
figure 3

High-resolution interaction maps detail an SH2 domains potential interactome. A phosphotyrosine interactome for 13 proteins involved in FGF-family and Insulin-family signaling and 50 SH2 domain partners. Phosphotyrosine peptides are indicated by their position within their host protein and color-coded as either PhosphoSite reported phosphorylation sites (yellow); sites not reported as phosphorylated (red); sites not reported to be phosphorylated but where a closely related site on a paralogous protein is known to be phosphorylated (red/yellow); or the peptide was discarded as non-specific (black). Interactions between the vertices of SH2 domains and phosphopeptides identified in this study are indicated as edges (lines) and color-coded according to the level of support provided by previous studies: if the precise phosphorylation site has been reported to interact with the noted SH2 domain the edge is denoted in red. A black line is representative of proteins that are reported to interact defined by interaction databases including HPRD, BIND, MINT and DIP, but the site of interaction is unknown. SH2 interactions not confirmed by literature but whose binding is greater than 3X mean on the array are represented with grey lines.

To enhance the interaction maps derived the current study, we incorporated multiple layers of additional data gleaned from a variety of sources. Specific phosphopeptides reported in the PhosphoSite database are noted for each of the 13 target proteins in Figure 3 (Additional file 1: Table S1) [65]. Reported phosphorylation remains a moving target, particularly as certain sites may be phosphorylated only in certain tissues or transiently upon recruitment of specific kinases [33]. In cases where phosphorylation of a tyrosine residue has been reported, we assume that region to be solvent accessible and capable of interactions. If phosphorylation has not been reported solvent accessibility may be considered as a minimal threshold for phosphorylation and SH2 domain binding. This is with the caveat that certain residues, such as the activation loop tyrosine in the kinase domain of the InsR and IGF-1R are buried in the inactive state but become phosphorylated and solvent exposed in the activated state. The phosphorylated and exposed activation loop is then able to bind to SH2 domains [66]. Given the dynamic nature of protein structures and the ability of buried residues to become exposed upon structural rearrangement, one cannot presuppose that buried residues never become exposed. Nonetheless, solvent accessibility provides an additional level of support for potential phospho-dependent interactions in cases where phosphorylation has not been reported. Existing structures provide a greater level of confidence in such interactions while at the same time identifying potential anomalous interactions with buried peptides. The Gerstein Accessible Surface algorithm was employed to calculate the accessible molecular surface [67, 68] of each tyrosine residue within structure files PDBID:1IRK, 2DTG, 1P4O, 1K3A, 1IRS, 1QQG, 2FGI, 2PVF, 2PSQ, 1XRO, 2YS5, 2YT2, 2 V76, 1WYX, 1HSQ, and 2HSP that represent regions of InsR, IGF-1R, IRS-1, FGFR1, FGFR2, FRS2, FRS3, p62DOK, p130Cas and PLCg in various conformations (Additional file 2: Table S4). Sites that fell below the threshold of the minimally accessible phosphorylation site (excluding the activation loop tyrosine) are marked in orange text for the residue number in Figure 3. Many of these sites are also excluded as non-specific interaction sites, likely reflecting their hydrophobic nature. Inclusion of structural data, where available, makes use of a significant resource to interpret potential pTyr interaction data.

Previously reported specific SH2-phosphopeptide interactions confirmed in this study (Table 1) are highlighted as red lines (Figure 3) and represent the highest confidence interactions. Noted as black lines are cases for which protein-protein interactions have been reported in MINT [50], BIND [51], HPRD [52], and DIP [53], without reference to specific binding sites or direct involvement of an SH2 domain. Interactions noted in the current study that are not listed in any of the major interaction databases, are represented as grey lines.

Position weighted matrices define physiological ligand specificity

To represent the specificity of SH2 domains in this study we define position weighted matrices (PWMs) based on the array-positive peptides. PWMs such as the position-specific scoring matrix (PSSM) [21] are a well-established method to describe biding motifs. In a PWM, each matrix column describes the probability that a given amino acid will be found at that ligand position. The PWM may also be visualized as a sequence logo [69] (Figure 4A). The 192 physiological peptides represented on the arrays in this study do not conform to a random distribution of residues at each position. To compensate for this the matrices were corrected for the prevalence of amino acids residues at each position in the total data set. In addition, the absence of binding to a given peptide may provide data on inhibitory effects of specific residues. For instance, lack of binding may result from either the absence of critical permissive residues or from the presence of inhibitory residues at specific positions [22]. To make use of both array-positive and array-negative data we corrected for frequency of occurrence of a given residue at each position using the array-positive peptides (posPWM). This is compared to a PWM of the expected frequency of all peptides, excluding non-specific peptides (exPWM). The scoring matrix that results from subtracting exPSSM from posPSSM expresses the deviation observed in the array-positive data from that of all specific peptides on the array. We term this the expectation-deviation scoring matrix (EDSM).

EDSM = posPWM exPWM
(1)
Figure 4
figure 4

Specificity for physiological peptides defines functional groups of SH2 domains. (A) Grb2 SH2 domain positive peptides are highlighted and then represented as an EDSM logo. See Figure S5 for EDSM logos of all tested SH2 domains. (B) An unrooted dendrogram clusters families of SH2 domains related by similar binding patterns. A distance matrix between EDSMs was computed and used to generate an unrooted distance tree (see Figure S6). This is artistically represented as a dendrogram with general specificity information overlaid and functional classes denoted by branch color.

By expressing differences between peptides that bind specifically and the peptide set as a whole, the EDSM attempts to compensate for any inherent bias arising in the relatively small set of non-random peptides drawn from physiological proteins. The EDSM for each SH2 in this study is visualized using sequence logos (Additional file 2: Figure S5) and condensed into a generalized statement of physiological specificity in the form of a regular expression (Table 3). A distance matrix comparing the EDSMs for the physiological specificity of the SH2 domains describe families of SH2 domains related by their preference for physiological ligands (Additional file 2: Figure S6). This is represented as an unrooted tree of SH2 domain specificity (Figure 4B). Six classes of general specificities are displayed among the SH2 domains tested in this study revealing similarity among SH2 domains within the same family (eg Grb2, Gads, Grap) and across different families (Sh2d1b, Ship2) but also subtle differences (eg Abl1 and Abl2). Although the EDSM is informed by both permissive and non-permissive effects, the limited dataset afforded by the addressable arrays in this study limits the utility of the resulting matrices for extrapolating information on non-permissive residues.

Table 3 Specificities obtained using Physiological Ligands

Discussion

The analysis of SH2-mediated interactions with peptide ligands representing the receptors and substrate proteins of the insulin, IGF-1 and FGF systems described herein, reconstructs the set of potential phosphotyrosine-mediated interactions that determine the capacity of these systems to recruit signaling proteins upon activation. The potential interactome outlines the possible signaling states that may participate in signaling. Among the factors that determine the possible signaling networks initiated by activated receptors are 1) the available set of SH2 proteins expressed in specific cells; and 2) the capacity of phosphorylated receptor and scaffold sites to recruit those SH2 proteins. The 111 SH2 domain proteins extant in the human genome vary extensively in their tissue and cell specific expression [15, 36]. In some cases these expression differences are drastic and even define highly tissue-specific signaling networks such as those in B- and T-lymphocytes [14, 15]. Among the 38 SH2 families, 33 possess at least one gene duplicate allowing a duplicate copy to acquire new functions such as specialized tissue functions or novel scaffolding capabilities [70]. The expression of a family member in one tissue may perform a redundant function to its paralog in another tissue but may also diverge in terms of functions (Additional file 2: Table S5). The potential interactome for SH2 domains indicates many cases of potential overlap in binding, resulting in pTyr sites that may act as hubs for multiple interactions or serve distinct binding functions in cases where the SH2 complement varies in different cells. The varied potential interaction permutations, or microstates, in turn, are the basis for highly cell-specific signaling outcomes from discrete signal inputs [34]. In simple terms, differences in the available phosphorylated tyrosine sites as well as in the expression of SH2 domain proteins themselves has the potential to furnish related but distinct signaling events in responses to the same input signal (Figure 5A). Currently the phosphorylation dataset available from PhosphoSite and PhosphoELM provide only a static view of receptor and scaffold phosphorylation. Even within a cell, the available complement of pTyr sites and locally available SH2 domain proteins may vary over the lifetime of a signal. Protein interaction microstates may differ according to the intensity of ligand stimulation and change as signaling complexes move within the cell, for instance as receptors are internalized on signaling endosomes (Figure 5B). For example, Grb10 and Grb14 are closely regulated adaptor proteins that share similar functions by binding to InsR and negatively regulating insulin signaling. While both genes share high expression in the pancreas, expression varies among adipose, liver and the heart (Figure 5C). However, little is known about the temporal and spatial dynamics between these two adaptors. Recently studies utilizing multiple reaction monitoring (MRM) mass-spectrometry has been applied to the Grb2 adaptors to map the dynamic interaction states upon various growth factor stimulation [71]. Analyses of this type will allow us to better dissect the vast number of microstates among different tissues. Thus, potential interactomes represent crucial datasets to interpret cell and tissue specific signaling events. This is particularly relevant in human development and diseases such as cancer in which receptor tyrosine kinases are commonly over-expressed, sometimes by several orders of magnitude. In such pathologies, the primary signaling pathways may be titrated out and novel, normally non-physiological pathways may become activated. For instance, IGF-1R is either overexpressed or hyperphosphorylated and deregulated in a range of cancers and is currently one of the most studied molecular targets in the field of oncology yet direct targeting of IGF-1R has proven problematic due to it’s wide range of important physiological functions [7274]. Under conditions of hyperphysiological abundance of IGF-1R pTyr sites available for SH2 binding, the potential interactome suggests the potential for non-canonical pathways to become activated, perhaps hinting at novel targets for therapeutic intervention.

Figure 5
figure 5

Tissue co-expression and microstate of the Insulin/IGF-1 system. Protein interaction microstates across different cell types and across time and space. (A) Co-expression between receptors and SH2 domains can influence the microstate of a specific tissue. (B) Phosphorylation of receptors under stimulation conditions can determine the temporal and spatial events of SH2 ligand binding within a cell. (C) Hierarchical clustering of the insulin responsive tissue expression levels for human SH2 domain-containing genes.

Even in normal physiological circumstances of healthy tissues, the potential interactome may inform our understanding of tissue-specific signaling events. A variety of tissues can respond to insulin stimulation, including adipose, muscle, pancreas, liver, brain etc. [75, 76]. SH2 domain-containing proteins vary widely in their expression in various cells and tissues (Figure 5C). While this likely represents only a piece of a much larger puzzle, it is conceivable that some of the observed tissue-specific responses and downstream signaling differences may relate to the available complement of SH2-containing signaling proteins and their ability to interact with available pTyr sites. In this way, the potential interactome and cell-specific expression combine to determine effective signaling networks.

Consensus motifs and co-evolution

The interaction data also reveals the specificity of 50 SH2 domains for a set of physiological peptides. Typical binding motifs for SH2 domains describe the residues at positions +1 to +4 C-terminal of the essential phosphotyrosine [7779]. SH2 domain peptide binding motifs have been described for a wide range of SH2 domains using peptide library approaches [19, 20, 29]. Binding motifs obtained from peptide library approaches represent optimal solutions unconstained by physiological parameters such as the confounding effects of kinases recognition or structural influences of native proteins. The motifs described herein represent binding to ‘real-world’ peptides and thus stand as a relevant contrast to peptide-library based data. However it should be noted that this dataset corresponds to a potential physiological interactome. Because all of the peptides haven’t been confirmed to be phosphyorylated in vivo, our interaction maps are best used in conjunction with the expanding mass spectrometry literature and their associated databases.

Broadly speaking, the SH2 consensus binding motifs identified from interactions observed using addressable arrays of physiological peptides are remarkably similar to the motifs described using peptide library approaches (Table 3). Yet binding specificities observed for physiological phosphotyrosine peptide ligands may in some cases represent more than the specificity of the isolated SH2 domain. The EDSM position weighted matrices noted in Additional file 2: Figure S5 reveal a number of cases in which the residues outside of the conventional window of residues at positions +1 to +4 appear to influence binding. Longer contact regions have been noted for certain SH2 domains in the past, though these are generally exceptions to the rule. For instance, the SH2 domain of SH2D1A/SAP binds to an extended peptide in the SLAM receptor comprised of residues −2 to +3 and shows a diminished dependence on phosphorylation of the tyrosine for binding [43]. Physiological peptide ligands co-evolve to allow recognition by their cognate SH2 domain partner, while also acting as competent substrates for their cognate kinases. In some cases, the observed specificity for physiological peptide ligands may therefore represent an amalgam of SH2 specificity, kinase recognition, and other factors. This may, for example, explain the apparent observed preference of the Crk SH2 domain for an Asp residue at the −2 position. The presence of an aspartic acid residue at the −2 position does not appear to contribute to Crk SH2 domain binding (Figure 4B), however, this may instead reveal a signature for a distinct event such as kinase recognition for a specific subset of physiological peptides. Indeed, a large number of tyrosine kinases have reported preference for acidic residues preceding the target tyrosine residue [80, 81]. Not surprisingly, acidic residues are commonly observed in the EDSM logos for the SH2 domains (Additional file 2: Figure S5). In addition to acting as kinase substrates and SH2 domain binding sites, the peptide motif must also presumably be surface exposed, and potentially disordered prior to binding, and these factors may also contribute to the overall physiological peptide motif. Combining multiple motifs in computational searches has been shown to markedly increase predictive accuracy [82], suggesting that the inclusion of indirect components such as kinase specificity may make for a more robust predictor of SH2 interactions. While the current data set is relatively small in size, larger sets of data identifying physiological peptide interactions may provide useful data for investigating the overlapping influences of multiple events required for functional signaling based on overlapping motifs.

In our analysis we find that peptides reported to be phosphorylated in PhosphoSite are significantly more likely to have one or more SH2 domain-binding partners than peptide nodes that are not currently known to be phosphorylated. This is not surprising given that evolutionary pressure may be exerted to conserve critical binding sites. Conversely, given the specificity of SH2 domains, the chances of an SH2-interacting peptide occurring by chance within a non-phosphorylated peptide may be assumed to be relatively low. The more residues that must be specified to stipulate binding, the lower the probability is that this will occur spontaneously within a non-phosphorylated sequence. If only one key residue supported by one of two secondary residues was capable of allowing an SH2 domain to bind, then the chances of randomly generating an SH2 binding site centered around a given tyrosine residue are less than one in a hundred. Given the specificity observed for SH2 domains in this study, the likelihood of a random sequence encoding an SH2 domain ligand appears rather limited. The appearance of a small number of highly connected peptide nodes on sites not currently known to be phosphorylated raises the question of whether SH2 domain-binding might serve as means of predicting phosphorylation. Perhaps highly connected peptide hubs such as IRS1 Y-151, IRS2 Y-184, FRS3 Y-287 and FRS3 Y-322 predict phosphorylation. ScanSite predicts the first three of these sites as kinase substrates, while the sequence surrounding FRS3 Y-322 is identical to a known phosphorylation site on FRS2, suggesting that these may indeed turn out to be phosphorylated under appropriate conditions.

A high degree of selectivity for physiological ligands may itself be an outcome of evolutionary pressures, as has been noted for yeast SH3 domains. The Sho1 SH3 domain recognizes a binding peptide in Pbs1, and no other SH3 domain in the yeast genome cross-reacts with the Pbs1 peptide. SH3 domains from other species that have not been under evolutionary pressure to ignore this site exhibit less selectivity for the Pbs1 peptide [83]. A high degree of specificity among human SH2 domains, combined with cell-specific expression is consistent with the notion that evolutionary pressures drive selectivity of protein-ligand interactions.

Comparison to the literature

In the quarter century since the SH2 domain was first described [48, 84], hundreds of interactions have been described between SH2 domains and phosphotyrosine peptides. In many cases these have been subject to intensive biophysical analysis yielding a considerable set of bonafide interactions against which HTP studies can be validated. Placing new studies within the context of the extant literature is particularly important for systems levels studies for which validation is inherently limited. In the case of the 50 SH2 domains and 192 peptides included in this study, we confirmed 60 interactions by the orthologous method of fluorescence polarization. We compared our results to those reported in previous studies. In the case of carefully controlled studies that examine SH2 interactions, our results closely match the reported interactions (Table 1). However, our results did not match well against one large-scale interaction study conducted using SH2 domain arrays (Additional file 2: Table S3) [57]. Our results suggest that the SH2 protein micro-array results may suffer from high false-positive and false-negative rates and that the reported KD values are likely inaccurate. This is consistent with other studies suggesting that protein microarray data is semi-quantitative and subject to false-positive results [60], particularly in the absence of orthologous validation

Several lessons may be taken from such results and suggest a set of standards that could be universally applied in future high throughput studies of protein-peptide interactions and these are explored in detail elsewhere [54]. First, proteins are fundamentally problematic in that they may easily lose binding activity. A set of positive controls is thus essential and should be present in every assay. Only about half of the SH2 domains express well as fusion proteins from bacteria [23]. The rest suffer from poor expression and lack reproducible binding activity, suggesting that any use of these SH2 domains in high-throughput in vitro binding studies may yield erroneous results. The present study used only 50 SH2 domains that have previously been shown to express well and exhibit good solubility and reproducible binding. A second issue relates to validation by orthologous method, to which the current study examines 60 binary pairs by the orthogonal method of solution phase fluorescence polarization binding, as well as a smaller set by GST-pulldown. A third consideration is agreement between HTP datasets and existing literature. Well-controlled studies reporting peptide-binding motifs for SH2 domains provide a wealth of data. SH2 domains bind to relatively specific motifs [19, 29], and these provide excellent validation tools. Apparent interactions that do not match the known binding motifs are a cause for concern and should be further validated. As noted in Table 1, the dataset described in this study is in strong agreement with literature-reported interactions, and the variations can largely be rationalized.

Concluding remarks

In examining SH2 domain interactions, we followed a systematic approach for systems-level interactome studies using orthologous validation and literature curation as a means of enhancing confidence in the experimental dataset. This results in a large set of high-confidence interactions that outline the potential interactome between 50 SH2 domains and 192 phosphopeptide sequences covering 13 proteins involved in FGF, Insulin, and IGF-1 signaling. The development of a detailed potential interactome for this set of signaling components represents an early step towards a more detailed understanding of cell-specific signaling networks. This stands to deepen our understanding of tissue-specific and disease-specific signaling networks that are predicated upon the varying and inevitably complex interpretation of the potential interactome by the available expressed interaction partners.

Experimental procedures

Plasmids and recombinant proteins

A comprehensive list of 121 SH2 domains contained in 111 human proteins [14] served as the starting point for the assembly of a large set of SH2 domain clones. The cDNA clones for SH2 domains were obtained from ATCC except for those noted otherwise. A complete list of source DNA and SH2 clones is shown in Additional file 3: Table S2. SH2 domains were cloned into pGEX-2TK (Amersham Pharmacia) and verified by DNA sequencing. GST-fusions of SH2 domains were expressed in E. coli strain BL21 (Stratagene) at 37°C overnight and induced with 1 mM IPTG for 3 hours. Cells were centrifuged, resuspended in PBS and lysed by sonication. The cellular fractions were incubated with glutathione sepharose (Thermo Scientific) and washed with PLC lysis buffer (50 mM Hepes pH 7.5, 150 mM NaCl, 10% glycerol, 1% Triton X-100). SH2 proteins were eluted using 10 mM glutathione, 50 mM Tris HCl pH 8.0 and purified using the NAP-10 (Amersham Pharmacia) column system.

Peptide arrays

The peptide libraries were synthesized onto an acid hardened amino-PEG500 cellulose membrane #UC540 (Intavis, Germany) using an Intavis Multipep as described [41]. The estimated yield of peptide at each position was approximately 5 nmols. Addressable peptide arrays representing physiological peptides were composed of 192 peptides, each composed of 11 amino acid residues, corresponding to tyrosine-containing peptides from InsR, IGF-1R, IRS-1, IRS-2, FGFR1, FGFR2, FGFR3, FRS-2, FRS-3, PLCγ1, p130Cas, p62DOK1. Phosphotyrosine residues were located at the fifth position in singly phosphorylated peptides. In most cases Cys residues were replaced with Ser. The membranes were stored at −20 until use. The membranes were deprotected according to manufacturer instructions, using a 95% TFA, 3% TIPS, 2% H2O cocktail for three hours. Phosphotyrosine incorporation was assessed by incubation with anti-phosphotyrosine antisera 4 G10 (Upstate) and pY20 (Santa Cruz). Additional file 1: Table S1 indicates the array position, peptide sequence, protein source position, and comments on related peptides and synthesis problems.

SPOTs Analysis of SH2 domain specificities

All steps were carried out at room temperature unless otherwise specified. The SPOTs membrane was first blocked with 5% nonfat milk in TBS-T (0.1 M TrisHCl (pH 7.4), 150 mM NaCl, and 0.1% Tween 20) overnight at 4°C. GST alone or GST fusion proteins (0.25 μM) were incubated with the SPOTs membrane in the same buffer containing 1 mM DTT for 1½ hours at room temperature and then washed with TBS-T. Anti-GST (Amersham) antibodies were used to detect GST fusion proteins and then incubated with anti-Goat Alexa-Fluor-680 (Molecular Probes). The array membrane was subsequently washed four times with TBS-T for 10 min. Peptides that bound the domain of interest were visualized by Li-Cor Odyssey using the 700 nm channel. Intensities were calculated using a grid with 192 circular features of 2 mm diameter, each centered around a peptide spot to avoid scoring SPOTs with halo or rings. For each feature, the average (integrated) intensity was used for downstream analysis.

Fluorescence polarization

Peptides were synthesized using FMOC-chemistry onto pre-loaded tenta-gel resins. Peptides were then labeled with Rhodamine B (Abbey Color) and then cleaved using trifluoroacetic acid. Peptides were lyophilized and then purified using a LC/MS (Agilent 2100). Dissociation constants were measured using the Beacon 2000 (Invitrogen) as previously described [40].

Data analysis

All analysis steps were performed as previously described [86]. Peptide intensity scores (excluding those defined as non-specific) were averaged across each 192-peptide array, producing an array mean. Array-positive binding was ascribed to interactions with intensities greater than three times the array mean. Peptide spots with average intensity values between 1X-3X the array mean were defined as ‘indeterminate’. Those with intensities below 1X mean were defined as array negative. Non-specific signal was detected by arraying three separate 192 arrays with three separate GST preps at 0.25 μM. Non-specific binding peptides were identified as those with signal intensities greater than 3X the array mean in at least two of three trials.

Phosphorylation status and solvent exposed tyrosines

The structures files of InsR (1IRK, 2DTG), IGF-1R (1P4O), IRS-1 (1IRS, 1QQG), FGFR1 (1FGK), FGFR2 (2PVF), FRS2 (1XR0), p62DOK1 (2 V76), PLCG1 (1HSQ, 2HSP) collected from Protein Data Bank (PBD) (http://www.rcsb.org). Surface accessible tyrosines were solved using the Gerstein algorithm (http://helixweb.nih.gov/structbio/). The phosphorylation status of the 192 sites was identified using the protein modification resource, Phosphosite (http://www.phosphosite.org).

PSSMs and EDSM

For each SH2 domain a position specific scoring matrix (PSSM) was calculated for the array-positive peptides (posPSSM). A second PSSM was calculated for all peptides, excluding those judged to be non-specific, as the expected distribution of amino acids represented on the array (exPSSM). Subtracting exPSSM from posPSSM yields the expectation deviation scoring matrix or EDSM. The EDSM for each SH2 domain was visualized as a logo of positive and negative factors using Weblogo [69].

EDSM clustering

The unbiased position specific expectation deviation scoring matrix was expanded into a hyper-dimensional vector representation, and the Euclidean distances between vectors was computed. The resulting N-by-N distance matrix was then clustered using the Fitch-Margoliash method in the Phylip package [85]. The unrooted tree was drawn using the MEGA package [86].

Reported interactions

Reported peptide interactions were collected by searching HPRD and literature. Reported protein interactions were collected from the major protein-protein interaction databases of MINT [50], BIND [51], HPRD [52], and DIP [53] using UniHI [49].

Cells lines and GST-pull downs

Chinese Hamster Ovary (CHO) cells stably overexpressing insulin receptor (InsR) and IRS-1 were graciously provided by Xiao Jian Sun (UChicago). CHO cells were grown in DMEM/F12 supplemented with 10% fetal bovine serum, penicillin and streptomycin. CHO cells were serum starved for 24 hours and treated with and without insulin (100 nM) for 5 mins. Cells were lysed in HNTG (20 mM Hepes 7.5, NaCl, 1% Triton X-100, 10% Glycerol, 1 mM NaV04) with protease inhibitors (1 mM PMSF, aprotonin and leupeptin). Pre-cleared lysates were incubated with GST-SH2 domains immobilized on glutathione beads and rocked for 3 hours at 4°C. Activated InsR and IRS-1 were detected using anti-phosphotyrosine 4 G10 (Upstate).