Introduction

Solubility is a fundamental and complex biophysical property of globular proteins, which is often crucial for their correct functioning1,2. It is intimately connected with the stability of the three-dimensional (3D) protein structure and strongly depends on environmental quantities such as the pH, the temperature, the buffer type and the protein concentration.

Solubility problems manifest themselves through different physical behaviors. The simplest one consists of the irreversible formation of native-state protein precipitants when the protein concentration overpasses the solubility limit; note that this limit depends on the environmental conditions. The picture gets more complicated when the aggregated or precipitated form includes not only native structures but also misfolded, partially folded and unfolded conformations. The formation of highly ordered aggregates such as amyloid fibrils from misfolded conformations constitutes the pathological characteristics of a large variety of disease conditions such as the neurodegenerative Alzheimer and Parkinson diseases. In these cases, the deposition of the β-amyloid and α-synuclein aggregates, respectively, in the patient’s brain prevents the normal functioning of neurons2,3,4,5.

Lack of solubility is frequently a major bottleneck in high-throughput structural genomic studies as well as in industrial applications requiring high-concentration production of recombinant proteins, such as monoclonal antibody solutions for pharmaceutical applications. In these processes, the formation of amorphous inclusion bodies from the aggregation of different (denatured and partially folded) conformations limits the biological activity of the product and makes necessary to implement complex solubilization and refolding procedures in order to recover the bioactive forms6,7,8,9,10,11.

The understanding of the mechanisms that modulate protein solubility is highly challenging due to their dependence on many intrinsic and extrinsic factors. Unraveling these complex relationships and the connection between the 3D structural properties and the solubility is a crucial objective for many academic and biotechnological applications. Despite the research devoted to these problems in the last 20 years and some important advances, the precise identification of the amino acid interactions and structural characteristics that lead to soluble or aggregated states and their physical interpretation remain elusive.

An early study12 showed that the solubility of proteins overexpressed in Escherichia coli is anti-correlated with the total number of residues. Regarding the contribution of specific amino acids to protein solubility, the favorable role of the negatively charged aspartic and glutamic acids was observed13. This trend was confirmed by other studies14,15,16. In contrast, weakly soluble protein expression appears to be correlated with large, positively charged, surface patches17. Note that recent studies demonstrated that arginines lead to aggregation, but not lysines17,18, probably because the Arg side chain is more prone to inter-protein interactions. Finally, a series of investigations point out that aromatic-rich proteins tend to be less soluble than aromatic-depleted ones16,19.

Many of these properties, combined with sequence features such as the aliphatic index, the secondary structure propensities and/or the amino acid composition, have been employed by computational approaches to predict the soluble nature of target proteins15,19,20,21,22. Although these methods reach good performances and are thus quite useful, their sequence-based nature linked to the fact that they employ “black box” machine learning approaches, fails in providing comprehensive biophysical insights into protein solubility.

In this paper, we used knowledge-based mean force potentials derived from datasets of protein structures of known solubility to get a clearer picture of the mechanisms that drive protein solubility. In particular, we focused on the solubility dependence of all possible amino acid pair interactions, with the aim of understanding which and why some of them are more favorable in soluble than in weakly soluble proteins and vice versa. We also tested the ability of our new potentials to discriminate between soluble and aggregation-prone proteins, on different datasets and with different solubility definitions. The comprehension gained from such studies is of utmost importance for the rational design of proteins with increased solubility, a challenging goal in protein engineering. Indeed, it saves costly, time-consuming, wet lab experiments that are needed to reduce unwanted aggregate formation and increase solubility14,23,24.

Methods

Protein structure and solubility dataset

To investigate the relation between protein structure, energy properties and solubility, we constructed a dataset of high-resolution X-ray structures with known solubility value. The starting point was the eSOL database16 that contains aggregation propensities of about 70% of the entire proteome of the E. coli K-12 strain synthesized with the PURE system25, an in vitro reconstituted and chaperone-free translation system. For each protein, the solubility \({\mathscr{S}}\) (in %) was experimentally determined as the ratio between the supernatant protein fraction obtained after centrifugation of the translation mixture, and the total uncentrifuged protein fraction.

To map the gene accession IDs associated with the eSOL entries onto the corresponding 3D structures in the Protein Data Bank (PDB)26, we used the EcoGene server27, a functional and structural annotation database of E. coli. We selected only the PDB structures that have a sequence identity of 100% with the associated EcoGene entries, as evaluated with the sequence alignment software BLAST28. The protein-culling server PISCES29 was then used to further refine the structure dataset and avoid biases due to the inclusion of proteins of similar sequences. We chose a threshold value of 25% on the pairwise sequence identity and a structure resolution of 2.5 Å at most. Transmembrane proteins were also filtered out.

The resulting \({{\mathscr{D}}}^{{\rm{tot}}}\) set is composed of 412 proteins with experimental structure and solubility. To investigate how protein structural properties are related to solubility, we divided this dataset in two subsets with an equal number of proteins. The first set, called \({{\mathscr{D}}}^{{\rm{sol}}}\), contains all structures with solubility \({\mathscr{S}}\ge 64 \% \), while the \({{\mathscr{D}}}^{{\rm{insol}}}\) dataset is composed of aggregation-prone proteins with \({\mathscr{S}} < 64 \% \). The list of proteins in these sets and some of their characteristics are given in Table S1, the distribution of soluble and weakly soluble proteins in Fig. S1, and the relative frequency of the twenty amino acids in the two datasets in Fig. S2 of Supplementary Information.

Standard statistical residue-residue potentials

Knowledge-based statistical potentials were used to describe the interaction strength between two interacting residues. These potentials of mean force30,31,32,33 are widely used in a large variety of applications, from protein structure prediction to the analysis of the impact of mutations on protein stability. They are derived from the frequency of observation of associations of specific sequence-structure elements in a dataset of experimental 3D protein structures using the inverse Boltzmann law.

In this paper we focused on distance potentials, where the structure elements are the distances d between the side chain geometric centers of two amino acids. The sequence elements are amino acid types s and s′. The energy associated to a sequence-structure association \((s,s^{\prime} ,d)\) can be evaluated as31,33:

$${\rm{\Delta }}W(s,s^{\prime} )=-\,{k}_{B}T\,\mathrm{ln}\,\frac{P(s,s^{\prime} ,d)}{P(s,s^{\prime} )P(d)}$$
(1)

where kB is the Boltzmann constant and T the absolute temperature. \(P(s,s^{\prime} ,d)\) is the probability of observation of two amino acid types s and s′ at the spatial distance d, \(P(s,s^{\prime} )\) the probability of these two amino acid types at any distance, and P(d) the probability of any types of amino acids at the distance d. These probabilities are estimated from the relative frequencies F of observation of sequence-structure elements in a dataset of 3D protein structures, which are in turn derived from the number of occurrences n of these elements as:

$${\rm{\Delta }}W(s,s^{\prime} ,d)\cong -\,{k}_{B}T\,\mathrm{ln}\,\frac{F(s,s^{\prime} ,d)}{F(s,s^{\prime} )F(d)}=-\,{k}_{B}T\,\mathrm{ln}\,\frac{n\,{n}_{ss^{\prime} d}}{{n}_{ss^{\prime} }\,{n}_{d}}$$
(2)

where n is the total number of amino acid pairs. The distances d, between 3 and 10 Å, were divided into 35 bins of 0.2 Å width; the last bin contains all distances larger than 10 Å. The discretized d values correspond to the middle value of each bin. The frequencies were computed separately according to the separation along the sequence of the two amino acids s and s′. More precisely, if s and s′ are at positions i and j along the sequence, respectively, a separate potential is computed for each value of \(1 < |i-j|\le 8\), to take into account the effect of the protein chain. For \(|i-j| > 8\), where the effect of the chain can be considered as insignificant, all the frequencies are mixed into a single potential.

Solubility-dependent statistical potentials

A commonly alleged drawback of the statistical potential formalism defined in Eq. (2) is their bias towards the protein structure dataset from which they are derived. However, this drawback can be turned into an asset if these biases are utilized to better describe specific properties of the dataset. The temperature dependence of the amino acid interactions has been extensively analyzed using this technique in our earlier works34,35,36.

Here we used this strategy to deepen the analysis of protein solubility at the molecular level. The central idea is that the potentials obtained from the complete dataset \({{\mathscr{D}}}^{{\rm{tot}}}\) and from the datasets \({{\mathscr{D}}}^{{\rm{sol}}}\) and \({{\mathscr{D}}}^{{\rm{insol}}}\), which only contain protein structures with solubility values in a certain range, reflect the properties of the ensemble from which they are derived.

We defined three types of statistical potentials. The first, referred to as soluble protein potentials, are obtained from the dataset of soluble proteins \({{\mathscr{D}}}^{{\rm{sol}}}\) and the full set \({{\mathscr{D}}}^{{\rm{tot}}}\)34:

$${\rm{\Delta }}{W}^{{\rm{sol}}}(s,s^{\prime} ,d)\cong -\,{k}_{B}T\,\mathrm{ln}\,\frac{F(s,s^{\prime} ,d,{{\mathscr{D}}}^{sol})}{F(s,s^{\prime} ,{{\mathscr{D}}}^{tot})F(d,{{\mathscr{D}}}^{sol})}$$
(3)

where \(F(s,s^{\prime} ,d,{{\mathscr{D}}}^{{\rm{sol}}})\) and \(F(d,{{\mathscr{D}}}^{{\rm{sol}}})\) are observation frequencies computed in the \({{\mathscr{D}}}^{{\rm{sol}}}\) subset, while \(F(s,s^{\prime} ,{{\mathscr{D}}}^{{\rm{tot}}})\) are frequencies from the \({{\mathscr{D}}}^{{\rm{tot}}}\) set. In an analogous way, the second type of potentials, called for simplicity “insoluble” protein potentials, are derived from the \({{\mathscr{D}}}^{{\rm{insol}}}\) set of weakly soluble proteins and the total set \({{\mathscr{D}}}^{{\rm{tot}}}\):

$${\rm{\Delta }}{W}^{{\rm{insol}}}(s,s^{\prime} ,d)\cong -\,{k}_{B}T\,\mathrm{ln}\,\frac{F(s,s^{\prime} ,d,{{\mathscr{D}}}^{{\rm{insol}}})}{F(s,s^{\prime} ,{{\mathscr{D}}}^{{\rm{tot}}})F(d,{{\mathscr{D}}}^{{\rm{insol}}})}$$
(4)

The last potentials, referred to as total potentials, are computed from the complete set \({{\mathscr{D}}}^{{\rm{tot}}}\) only:

$${\rm{\Delta }}{W}^{{\rm{tot}}}(s,s^{\prime} ,d)\cong -\,{k}_{B}T\,\mathrm{ln}\,\frac{F(s,s^{\prime} ,d,{{\mathscr{D}}}^{{\rm{tot}}})}{F(s,s^{\prime} ,{{\mathscr{D}}}^{{\rm{tot}}})F(d,{{\mathscr{D}}}^{{\rm{tot}}})}$$
(5)

Coping with finite-size dataset effect

When estimating the probabilities in eq. (1) in terms of frequencies to obtain Eq. (2), the underlying assumption is that the number of protein structures contained in the dataset is large enough to yield statistically significant values. While this is in general a reasonable hypothesis for standard statistical potentials, which are derived from thousands of structures, it is less so for the potentials constructed here, since there are only a few hundreds of protein structures with experimentally characterized solubility. The relative smallness of the \({{\mathscr{D}}}^{{\rm{sol}}}\) and \({{\mathscr{D}}}^{{\rm{insol}}}\) sets is thus likely to introduce some distortions. To cope with these problems and get smooth and statistically significant potentials, we introduced two additional layers of computation.

The first layer consists in considering only the distance bins d that contain a sufficient number of occurrences. We chose the threshold value on nssd equal to 10. If this value is not reached, the potentials are set to zero. Eq. (2) thus becomes:

$$\begin{array}{llll}{\rm{\Delta }}W(s,s^{\prime} ,d) & = & -{k}_{B}T\,\mathrm{ln}\,\frac{n\,{n}_{ss^{\prime} d}}{{n}_{ss^{\prime} }\,{n}_{d}} & {\rm{if}}\,{n}_{ss^{\prime} d} > 10\\ {\rm{\Delta }}W(s,s^{\prime} ,d) & = & 0 & {\rm{otherwise}}\end{array}$$
(6)

The second layer is dedicated to achieving a smoother potential behavior through a smoothing procedure that consists in replacing the number of occurrences in a bin \((s,s^{\prime} ,d)\) with the weighted sum of the occurrences of the four neighborhood bins as:

$${\hat{n}}_{ss^{\prime} d}=\frac{1}{{\alpha }^{2}}{n}_{ss^{\prime} (d-2b)}+\frac{1}{\alpha }{n}_{ss^{\prime} (d-b)}+{n}_{ss^{\prime} d}+\frac{1}{\alpha }{n}_{ss^{\prime} (d+b)}+\frac{1}{{\alpha }^{2}}{n}_{ss^{\prime} (d+2b)}$$
(7)

where α is a constant larger than one, which we fixed here to 4/3, and b is the width of the distance bin, equal here to 0.2 Å. The four bins \(d\pm b\) and \(d\pm 2b\) correspond to the four bins that are the closest from the central bin d. The number of occurrences \({\hat{n}}_{ss}\) and \({\hat{n}}_{d}\) are obtained from \({\hat{n}}_{ssd}\) by summing over all distances and amino acid types, respectively.

Statistical significance analysis

To quantitatively determine whether the differences between soluble and insoluble potentials are statistically significant or due to random fluctuations, we computed two quantities: the mean \( {\mathcal M} \) difference between the two potentials, summed over all Nd distances bins:

$${ {\mathcal M} }_{ss^{\prime} }=\frac{1}{{N}_{d}}\,\sum _{d=1}^{{N}_{d}}\,({\rm{\Delta }}{W}^{{\rm{sol}}}(s,s^{\prime} ,d)-{\rm{\Delta }}{W}^{{\rm{insol}}}(s,s^{\prime} ,d))$$
(8)

and the variance \({\mathscr{V}}\) of these potentials:

$${{\mathscr{V}}}_{ss^{\prime} }=\frac{1}{{N}_{d}}\,\sum _{d=1}^{{N}_{d}}\,{({\rm{\Delta }}{W}^{{\rm{sol}}}(s,s^{\prime} ,d)-{\rm{\Delta }}{W}^{{\rm{insol}}}(s,s^{\prime} ,d))}^{2}$$
(9)

To test the significance of the differences between soluble and insoluble potentials for a given residue pair (s, s′), we compared \(|{ {\mathcal M} }_{ss^{\prime} }|\) and \({{\mathscr{V}}}_{ss^{\prime} }\) with the analogous quantities computed on sets obtained by randomly separating \({{\mathscr{D}}}^{{\rm{tot}}}\) into two subsets with an equal number of proteins. This random shuffling and \( {\mathcal M} \) and \({\mathscr{V}}\) computations were repeated 100 times. If the \(|{ {\mathcal M} }_{ss^{\prime} }|\) and/or \({{\mathscr{V}}}_{ss^{\prime} }\) values computed from the datasets \({{\mathscr{D}}}^{{\rm{sol}}}\) and \({{\mathscr{D}}}^{{\rm{insol}}}\) are higher than 95% of those computed from the randomized datasets, the interaction (s, s′) was considered to differ significantly between soluble and aggregation-prone proteins. We actually used two statistical significance criteria: a stricter one in which the fraction of randomly obtained \(|{ {\mathcal M} }_{ss^{\prime} }|\) and \({{\mathscr{V}}}_{ss^{\prime} }\) values that are smaller than the actual \(|{ {\mathcal M} }_{ss^{\prime} }|\) and \({{\mathscr{V}}}_{ss^{\prime} }\) values, denoted Sig\({ {\mathcal M} }_{ss^{\prime} }\) and Sig\({{\mathscr{V}}}_{ss^{\prime} }\), are both larger than 0.95, and a relaxed criterion in which Sig\({ {\mathcal M} }_{ss^{\prime} }\ge 0.95\) or Sig\({{\mathscr{V}}}_{ss^{\prime} }\ge 0.95\).

Solubility-dependent protein folding free energy

Three types of folding free energies were computed for proteins represented by their sequence S and 3D conformation C, using the three potentials derived from the soluble, insoluble and total protein datasets, as defined in eqs (3, 4 and 5):

$$\begin{array}{rcl}{\rm{\Delta }}{W}_{S,C}^{{\rm{sol}}} & = & \sum _{i=1}^{N}\,\sum _{j=i+2}^{N}\,{\rm{\Delta }}W({s}_{i},{s}_{j}^{\prime} ,d,{{\mathscr{D}}}^{sol})\\ {\rm{\Delta }}{W}_{S,C}^{{\rm{insol}}} & = & \sum _{i=1}^{N}\,\sum _{j=i+2}^{N}\,{\rm{\Delta }}W({s}_{i},{s}_{j}^{\prime} ,d,{{\mathscr{D}}}^{insol})\\ {\rm{\Delta }}{W}_{S,C}^{{\rm{tot}}} & = & \sum _{i=1}^{N}\,\sum _{j=i+2}^{N}\,{\rm{\Delta }}W({s}_{i},{s}_{j}^{\prime} ,d,{{\mathscr{D}}}^{tot})\end{array}$$
(10)

where si and \({s}_{j}^{\prime} \) are two amino acid types at positions i and j along the sequence, respectively; N is the sequence length. To avoid any overfitting, the folding free energies were computed using a leave-one-out cross validation strategy, consisting of removing the target protein \((\bar{S},\bar{C})\) from all the datasets \({{\mathscr{D}}}^{{\rm{sol}}}\), \({{\mathscr{D}}}^{{\rm{insol}}}\) and \({{\mathscr{D}}}^{{\rm{tot}}}\) when computing its folding free energies \({\rm{\Delta }}{W}_{\bar{S},\bar{C}}^{{\rm{sol}}}\), \({\rm{\Delta }}{W}_{\bar{S},\bar{C}}^{{\rm{insol}}}\) and \({\rm{\Delta }}{W}_{\bar{S},\bar{C}}^{{\rm{tot}}}\). Note that this cross validation procedure is very strict, since the datasets contain, by construction, no proteins with more than 25% sequence identity with any target \((\bar{S},\bar{C})\).

We also computed the soluble and insoluble folding free energy difference:

$${\rm{\Delta }}{W}_{S,C}^{\mathrm{insol}-\mathrm{sol}}={\rm{\Delta }}{W}_{S,C}^{{\rm{insol}}}-{\rm{\Delta }}{W}_{S,C}^{{\rm{sol}}}$$
(11)

It was used to estimate protein solubility.

Results and Discussion

We derived both classical and solubility-dependent statistical distance potentials from the three sets \({{\mathscr{D}}}^{{\rm{sol}}}\), \({{\mathscr{D}}}^{{\rm{insol}}}\) and \({{\mathscr{D}}}^{{\rm{tot}}}\) containing proteins with different solubility values, with the aim of quantifying the contribution of amino acid pair interactions to protein solubility. These novel potentials \({\rm{\Delta }}{W}^{{\rm{sol}}}\), \({\rm{\Delta }}{W}^{{\rm{insol}}}\) and \({\rm{\Delta }}{W}^{{\rm{tot}}}\) were computed and analyzed for all 210 residue-residue pairs. For each of them, we computed the folding free energy profiles as a function of the distance d between the residues, compared the profiles obtained with the three potentials, and identified the residue pairs for which the profiles differ significantly. In this way, we were able to highlight the interactions that contribute more strongly than the others to the increase or decrease of protein solubility. A first striking result is that the soluble and insoluble folding free energy profiles obtained with \({\rm{\Delta }}{W}^{{\rm{sol}}}\) and \({\rm{\Delta }}{W}^{{\rm{insol}}}\) differ for a large number of residue pairs, with the \({\rm{\Delta }}{W}^{{\rm{tot}}}\) profiles always in between these two extremes. An example is shown in Fig. 1 for lysine-aspartic acid interacting pairs. The interaction energy presents a clear minimum when the residue side chain centers are about 3–4 Å apart, which corresponds to a salt bridge interaction. Clearly, this interaction appears more favorable in soluble proteins than in aggregation-prone proteins, which means that they contribute more strongly to the stability of the native structure of soluble proteins.

Figure 1
figure 1

The folding free energy contribution \({\rm{\Delta }}W({\rm{Asp}},\,{\rm{Lys}},\,d)\) of the salt bridge interaction Asp-Lys differs according to whether the potentials are derived from soluble or weakly soluble proteins. The energies are in kcal/mol, the distance d (in Å) is computed between residue side chain centroids, and the residues are separated by at least 8 residues along the chain. Distance bins containing ten occurrences or less are not drawn (see Eq. (6)).

The whole set of energy profiles computed with the three types of potentials, for the 210 residue pairs, is shown in Fig. S3 of Supplementary Information. Tables 1 and 2 contain the insolubilizing and solubilizing pair interactions, respectively, which are estimated as statistically significant on the basis of both \( {\mathcal M} \) and \({\mathscr{V}}\), and Tables S2 and S3 those that are significant on the basis of \( {\mathcal M} \) or \({\mathscr{V}}\).

Table 1 Insolubilizing residue-residue interactions, defined by \( {\mathcal M} \, < \,0\) and the strict significance criteria requiring both \(| {\mathcal M} |\) and \({\mathscr{V}}\) values to be higher than 95% of the equivalent quantities computed from randomly shuffled datasets (Sig\( {\mathcal M} \) and Sig\({\mathscr{V}}\ge 0.95\)).
Table 2 Solubilizing residue-residue interactions, defined by \( {\mathcal M} \, > \,0\) and the strict significance criteria requiring both \(| {\mathcal M} |\) and \({\mathscr{V}}\) values to be higher than 95% of the equivalent quantities computed from randomly shuffled datasets (Sig\( {\mathcal M} \) and Sig\({\mathscr{V}}\ge 0.95\)).

In the next two subsections, the pair interactions that contribute most to the increase or decrease of protein solubility are extensively discussed. We grouped and analyzed together the residue pairs that share similar biophysical characteristics, in order to illustrate the solubility dependence of amino acid interactions, provide an overview of their contribution to protein solubility and unravel the underlying physical principles.

Interactions that decrease the solubility

There are 42 residue-residue interactions which are more favorable in aggregation-prone proteins than in soluble proteins (\({{\rm{Sig}}}_{ {\mathcal M} }\ge 0.95\) and \({{\rm{Sig}}}_{{\mathscr{V}}}\ge 0.95\)) (Table 1), and 58 if the less strict statistical significance criterion is used (\({{\rm{Sig}}}_{ {\mathcal M} }\ge 0.95\) or \({{\rm{Sig}}}_{{\mathscr{V}}}\ge 0.95\)) (Table S2).

The first result that falls up when looking at these tables is that almost all insolubilizing interactions involve side chain moieties with delocalized π-electrons37. Indeed, many involve the aromatic residues Phe, Tyr and Trp, as well as His which is also aromatic although usually considered separately as it carries a positive charge under some conditions. These aromatic residues have π-electrons that are delocalized below and above the plane of the aromatic moiety. The other residues that are overrepresented among desolubilizing interactions are: arginine, whose side chain carries a guanidinium cation that has three resonance forms with the positive charge delocalized on three N atoms; aspartic and glutamic acids, which possess a carboxylic acid anion with two resonating forms and the negative charge delocalized on the two O atoms; asparagine and glutamine, whose side chain has a neutral amide group with two resonating forms, one having a partial positive charge on the NH2 group and a partial negative charge on the O atom. We detail in what follows the different types of insolubilizing interactions that satisfied our statistical significance tests.

Aromatic-aromatic or π-π interactions

The interaction between two non-charged amino acids with aromatic side chains (Phe, Trp, Tyr) are known to be essential for the stabilization of protein structure and protein complexes38. Their attraction occurs through the interaction between the aromatic rings that contain delocalized π-electrons. Their interaction geometries are classified in three types, namely T-shaped, face-to-face and off-stacked38. Two kinds of physical forces stabilize these conformations, the electrostatic force that comes from the interaction between the quadrupole moments of the aromatic rings, and the London dispersion force that results from the π-electron delocalization on the ring and the overlap between the π-orbitals of the two aromatic moieties. The face-to-face geometry is mainly stabilized by the London force, which tends to compensate the electrostatic contribution that is unfavorable in this case. In the off-stacked and T-shaped conformations, both the electrostatic and dispersion contributions are stabilizing, which makes them usually more favorable and thus more frequent than face-to-face conformations. Note that the most favorable geometries also depend on the extracyclic atoms and thus on the type of amino acid.

The distance-dependent profiles of the six aromatic-aromatic interaction potentials (Phe-Phe, Phe-Tyr, Phe-Trp, Tyr-Tyr, Tyr-Trp, Trp-Trp) are clearly well separated according to whether they are computed from the soluble or insoluble protein potentials \({\rm{\Delta }}{W}^{{\rm{sol}}}\) and \({\rm{\Delta }}{W}^{{\rm{insol}}}\), as shown in Fig. S3. Since these individual interactions are ruled by the same physical effect, we combined them to define the Phe/Tyr/Trp-Phe/Tyr/Trp group potential; for this purpose, we shifted the inter-residue distances d of the larger residues towards smaller distances by subtracting the difference in radii between the larger amino acid and the smallest residue in the group; the minimum number of occurrences per bin was here chosen to be 20 instead of 10 (see Eq. (6)).

The aromatic-aromatic group potential is shown in Fig. 2A. The large separation between the two profiles, with the profile obtained from the soluble potential above the profile obtained from the insoluble potential for all distance bins, indicates that these interactions tend to reduce the solubility of the proteins, even though they remain important for promoting their structural stability. The minimum of both profiles is located at about 6.3 Å, which corresponds to the usual distance between the side chain centers of two interacting phenylalanines, the smallest aromatic amino acids in this group. The separation of the curves in this distance range is quite high, i.e. around 0.2 kcal/mol, which shows the significantly larger importance of this interaction in aggregation-prone proteins.

Figure 2
figure 2

Residue-residue group potentials derived from datasets of soluble, aggregation-prone and all proteins (\({{\mathscr{D}}}^{{\rm{sol}}}\), \({{\mathscr{D}}}^{{\rm{insol}}}\) and \({{\mathscr{D}}}^{{\rm{tot}}}\)). The energies are in kcal/mol, the distance d is computed between the residue side chain centroids of the smallest amino acids in the group, and the residue pairs are separated by at least 8 residues along the chain. Distance bins containing twenty occurrences or less are not drawn.

His-aromatic or His-π interactions

The aromatic amino acid histidine is quite special as its imidazole ring can be positively charged or neutral depending on the environmental conditions; its pKa is indeed equal to 6.8. Hence, when the histidine is neutral, its aromaticity allows it to form ππ interactions with itself and with the other aromatic residues Phe, Tyr, Trp, as well as cation-π interactions with the positively charged residues Lys and Arg. When the histidine is positively charged, it can play the role of cation in cation-π interactions with aromatic residues Phe, Tyr and Trp. These His-containing interactions are known to substantially contribute to protein stability39.

As expected from the similarity with the aromatic-aromatic interactions described in the previous subsection and the cation-π interactions presented in the next, His-aromatic interactions promote protein aggregation rather than solubility, as shown by the individual pair potentials (Fig. S3) and the group potential His-Phe/Tyr/Trp (Fig. 2B), obtained from the individual pair potentials in the same way as the π-π group potential.

Cation-π interactions

Cation-π interactions in proteins link the aromatic moiety of a Phe, Tyr, or Trp side chain with the cationic moiety of a Lys or Arg side chain, positioned above (or below) the aromatic ring where there is an excess of (delocalized) electrons. This interaction plays an important role in protein stabilization and contributes favorably to protein-protein binding and recognition40,41.

Here we make a distinction between the cation-π interactions involving lysines and arginines, since they differ in their solubility dependence. As shown in Tables 1 and S2 and Fig. S3, the Arg-π interactions are significantly more favorable in aggregation-prone than in soluble proteins, unlike Lys-π interactions; only Lys-Trp satisfies the statistical significance criteria.

The strong insolubilizing nature of Arg-π interactions is clearly shown in the group potential Arg-Phe/Tyr/Trp (Fig. 2C). The difference between the profiles obtained from the soluble and aggregation-prone protein datasets is about 0.2 kcal/mol, and thus highly significant.

The difference in behavior between Arg-π and Lys-π cation-π interactions is rooted in the intrinsic differences between the two positively charged amino acids: the positive charge in Lys is localized on the ammonium group, while the Arg charge is delocalized on the guanidinium group with three resonating forms. Thus in addition to the electrostatic contribution that is similar for Arg-π and Lys-π interactions, Arg-π is stabilized through the overlap of the molecular π-orbitals of Arg and the aromatic side chain, and thus by the London dispersion force42. As in the case of the π-π and π-His interactions, this type of force reduces the solubility and promotes aggregation.

Amino-aromatic or amino-π interactions

Amino-π interactions connect the aromatic side chain of Phe, Tyr or Trp with the side chain amide group of asparagine or glutamine43. The geometry of this interaction is quite similar to that of cation-π interactions, where the partial positive charge δ+ on the amino group of Asn or Gln (in one of the resonating forms) interacts with the δ located above or below the aromatic ring. However, in contrast to cation-π interactions, the electrostatic contribution is unfavorable in Asn/Gln-π. Instead, this interaction is exclusively stabilized by London dispersion forces, which involve electron correlation contributions. Note that the strength of the latter forces in Asn/Gln-π interactions are similar to that in Arg-π42.

The group potential Asn/Gln-Phe/Tyr/Trp is depicted in Fig. 2D. Amino-π interactions are found to be favorable in aggregation-prone proteins, and unfavorable in soluble ones. The distance between the soluble and insoluble energy profiles is here also about 0.2 kcal/mol.

Anion-aromatic or anion-π interactions

Anion-π interactions are established between a residue with an aromatic moiety and a residue with an anionic side chain, i.e. between Phe, Tyr or Trp and Asp or Glu. They are stabilized through anion-quadrupole interactions between the δ+ edge of the aromatic ring and the anion, as well as through the overlap of π-orbitals and thus London interactions. In our analysis, the anion-π interactions, like the other interactions involving delocalized π-electrons, promote insolubility and aggregation44,45.

Note however that we did not include Phe in the anion-π group potential showed in Fig. 2E, as the anion-Phe interactions behave differently from anion-Tyr and anion-Trp. Indeed, anion-Phe interactions are unfavorable in all distance ranges, as we can see in Fig. S3. Moreover, the difference between the profiles derived from soluble and aggregation-prone proteins seems more associated to a distance shift in Asp-Phe, with the residues more closely packed in the soluble proteins. This difference could be due to the marked hydrophobicity of Phe or to the absence of extracyclic atoms whose presence in Tyr and Trp anion-π could provide stabilization effects. Note also that Asp, but not Glu, satisfies the statistical significance criteria (Tables 1 and S2). The Glu-Tyr/Trp interactions show the same trend as Asp-Tyr/Trp but to a lesser extent.

Other interactions

The large majority of the other interactions that promote insolubility have at least one of the interacting residues that contain π-delocalized electrons. Among these, we find sulfur-aromatic interactions between a cysteine and an aromatic residue, especially Phe and Trp. Note that sulfur-aromatic interactions involving a methionine and Phe or Trp also promote insolubility, as seen in Fig. S3, but do not satisfy the statistical significance criteria. In these interactions, the partial negative charge δ on the sulfur group of the side chain of the Cys or Met side chain interacts with the δ+ on the edge of the aromatic ring46.

In this group we also find Arg-Arg interactions, which are obviously unfavorable because of the proximity of the two positive charges, but are significantly less unfavorable in insoluble than in soluble proteins. Again, this can be explained by the London dispersion force contributions due to the overlap of the π-orbitals of the arginines, which is less unfavorable in aggregation-prone proteins.

Similarly, the Asn-Gln interactions - and also the Asn-Asn and Gln-Gln even though they do not satisfy the statistical significance criteria -, which involve London dispersion forces, have more favorable energy profiles when computed from aggregation-prone proteins.

Relative orientation of the interacting π-planes

In view of deepening the understanding of the relation between π-π, His-π, Arg-π and amino-π interactions and solubility, we analyzed the geometry of their conformations in the soluble and insoluble protein datasets \({{\mathscr{D}}}^{{\rm{sol}}}\) and \({{\mathscr{D}}}^{{\rm{insol}}}\). For that purpose, we used an in-house program47 that detects these interactions and characterizes their geometry; in particular, it computes the angle between the π-planes. We found a significantly higher number of such interactions in insoluble than in soluble proteins - in agreement with their more favorable energy profiles -, but no significant difference between their conformational geometries. Thus, for aromatic-aromatic interactions, there does not seem to be a statistically significant preference for T-shaped, face-to-face or off-stacked geometries. There is also no preferred geometry for His-\(\pi \), Arg-\(\pi \) and amino-\(\pi \) interactions.

Interactions that increase solubility

The residue pairs for which the potential derived from soluble proteins is significantly more favorable than the potential derived from aggregation-prone proteins are listed in Tables 2 and S3 and shown in Fig. S3. There are 22 residue-residue interactions of this type with the statistical significance criterion \({{\rm{Sig}}}_{ {\mathcal M} }\ge 0.95\) and \({{\rm{Sig}}}_{{\mathscr{V}}}\ge 0.95\), and 27 if the less strict criterion Sig\({{\rm{Sig}}}_{ {\mathcal M} }\ge 0.95\) or \({{\rm{Sig}}}_{{\mathscr{V}}}\ge 0.95\) is used.

Two main conclusions can be drawn from these tables. The first is that aliphatic residues have the driving role for promoting solubility. Indeed, most interactions involve at least one aliphatic residue. The second conclusion is that lysine-involving salt bridges also favor solubility.

Aliphatic-aliphatic interactions

The four residues alanine, valine, isoleucine and leucine have only C heavy atoms on their side chain and are thus aliphatic. Their hydrophobicity increases with increasing number of C atoms. Ala can thus be found both in the protein core and at the surface, while the Val, Leu and Ile are predominantly in the core. Glycine, which has only an H atom as side chain, is often added to the aliphatic amino acid group.

The subset of aliphatic amino acids which are also hydrophobic (Val, Ile, Leu) are well known to play a fundamental role in the stabilization of the folded protein structure by contributing to the formation of the hydrophobic core48. Indeed, though these residues do not form physical interactions, they cluster together to avoid any contact with the solvent.

Our results show that the effective interactions between aliphatic residues are more favorable as their hydrophobicity increases, and appear stabler in soluble than in aggregation-prone proteins (Fig. S3; Tables 2 and S3). This suggests that the core is more hydrophobic and stable in soluble proteins. This characteristic is likely to help during the folding process to avoid some unwanted interactions between partially folded structures that could lead to aggregation phenomena.

There is, however, a counterexample to this rule: the aliphatic interactions involving leucine have a different behavior than those involving other aliphatic residues. Despite their similar chemical properties, the Leu-Leu interaction does not show any difference whether computed from the soluble or insoluble protein datasets (Fig. S3). This result could be put in relation with the different secondary structure propensities of Leu compared to Ile and Val, and also with its different thermal propensities34, but a deeper investigation is needed to explain this counterintuitive behavior. Therefore, we showed in Fig. 2F the group potential involving only Val and Ile residues.

At first sight, the understanding of the role of the hydrophobicity in promoting solubility seems unclear. Indeed, interactions between hydrophobic aliphatic residues (except Leu) are more frequent in soluble proteins whereas interactions between aromatic residues, which are also hydrophobic, are more frequent in aggregation-prone proteins. Different analyses reported in the literature actually reach contradictory conclusions on the role of hydrophobicity: indications that the average protein hydrophobicity is anti-correlated with its solubility is presented in an early study49, while more recent investigations point out that only exposed hydrophobic patches seems to be related to insolubility49,50. The key result of the present paper that allows reconciling these views is that it is not the hydrophobicity that matters for solubility, but rather the absence or presence of interactions involving delocalized π-electrons.

Note finally that, in an extensive amino acid sequence-based analysis16, no significant difference was observed between the relative content of aliphatic hydrophobic residues (Val, Ile, Leu) in soluble and insoluble proteins. However, the difference in protein length between the sets of soluble and insoluble proteins was overlooked. Indeed, soluble proteins are smaller than insoluble proteins (214 residues versus 287 on the average in the \({{\mathscr{D}}}^{{\rm{sol}}}\) and \({{\mathscr{D}}}^{{\rm{insol}}}\) sets). The percentage of Val, Ile and Leu residues is only marginally different in the two sets: 23.4% in \({{\mathscr{D}}}^{{\rm{sol}}}\) and 22.5% in \({{\mathscr{D}}}^{{\rm{insol}}}\), with a low statistical significance (Kolmogorov-Smirnov P-value = 0.03). However, the percentage of these residues that are in the protein core is about 40.1% and 36.7% in \({{\mathscr{D}}}^{{\rm{sol}}}\) and \({{\mathscr{D}}}^{{\rm{insol}}}\), respectively (Kolmogorov-Smirnov P-value < 10−5). This shows that the number of Val, Ile and Leu residues is about the same, but that the frequency of these residues is higher in the core of soluble proteins than in the core of aggregation-prone proteins.

Lys-containing salt bridges

A salt bridge is a short-range electrostatic interaction formed by two residues of opposite charge. An example of this interaction is shown in Fig. 1 for the Lys-Asp pair computed from the two datasets \({{\mathscr{D}}}^{{\rm{sol}}}\) and \({{\mathscr{D}}}^{{\rm{insol}}}\). The three other salt bridge pairs are Lys-Glu, Arg-Glu, Arg-Asp. The potentials for these four interactions have all a minimum located at an inter-residue distance of about 4 Å, which is the common distance associated to salt bridge formation.

We found that salt bridges involving lysine (Figs 2G and S3) are significantly more favorable in soluble proteins than in weakly soluble ones. For salt bridges involving arginine, on the contrary, no significant difference is observed between the energy profiles derived from both types of proteins.

These results, as well as those of the previous section showing that arginine favors aggregation propensities, are in agreement with the observed tendencies of the lysine/arginine ratio to be well correlated with an increased solubility18. They are also in agreement with the finding that large patches with a net positive charge disfavor protein solubility especially when there is an Arg prevalence in the patch17. The conclusion of the absence of correlation between the solubility and the positively charged residue content, found in16, does not contradict the results of this paper, since no difference is made between the chemical properties of Arg and Lys. Instead, they observed the statistically significant trend that Asp/Glu-rich proteins are more soluble than Asp/Glu-poor ones.

Correlation between solubility and stability

To test how the energies computed with the newly developed solubility-dependent statistical potentials correlate with solubility, we started by computing, for each protein of the \({{\mathscr{D}}}^{{\rm{tot}}}\) set, the three folding free energy values \({\rm{\Delta }}{W}_{S,C}^{{\rm{sol}}}\), \({\rm{\Delta }}{W}_{S,C}^{{\rm{insol}}}\) and \({\rm{\Delta }}{W}_{S,C}^{{\rm{all}}}\), defined in Eq. (10). These energies and the associated experimental solubility values \({\mathscr{S}}\) are reported in Table S1.

To evaluate the energy-solubility correlation, we used leave-one-out cross validation (see Methods). The Pearson correlation coefficient between the solubility \({\mathscr{S}}\) and the folding free energy values \({\rm{\Delta }}{W}_{S,C}^{{\rm{sol}}}-{\rm{\Delta }}{W}_{S,C}^{{\rm{insol}}}\) and \({\rm{\Delta }}{W}_{S,C}^{{\rm{all}}}\) are given in Table 3. We also computed the correlation of \({\mathscr{S}}\) with different sequence features, namely the protein length, the isoelectric point and the aliphatic index (defined as the relative volume of a protein occupied by aliphatic side chains)51, as they have been suggested to be related to solubility16,52,53.

Table 3 Correlation between experimental solubility, folding free energies and sequence-derived features.

Interestingly, we found that the folding free energy difference (\({\rm{\Delta }}{W}_{S,C}^{{\rm{insol}}}\) − \({\rm{\Delta }}{W}_{S,C}^{{\rm{sol}}}\)) correlates with \({\mathscr{S}}\) with quite a high correlation coefficient (r = 0.39), and outperforms all other features tested. This means that, the more favorable the energy computed with the potentials derived from soluble proteins compared to that obtained with aggregation-prone proteins, the more soluble the protein. This constitutes a strong check of the performance and robustness of our solubility-dependent statistical potentials that are able to accurately capture the solubility properties of proteins. Note that the energy \({\rm{\Delta }}{W}_{S,C}^{{\rm{tot}}}\) is also correlated with the solubility (r = 0.20), but less than our new solubility-dependent statistical potentials. Based on these results, we are currently using the energy difference (\({\rm{\Delta }}{W}_{S,C}^{{\rm{insol}}}\) − \({\rm{\Delta }}{W}_{S,C}^{{\rm{sol}}}\)) as a novel feature in developing a structure-based solubility predictor.

Among the three tested sequence-based features, the protein length has the best score: it is significantly anti-correlated with \({\mathscr{S}}\), with a correlation coefficient r = −0.31. This means that smaller proteins have the tendency to be more soluble, in agreement with earlier findings16,24. Protein length is therefore widely used as a feature in different solubility predictors15,19. Not surprisingly, protein length is anticorrelated with the free energy difference \(({\rm{\Delta }}{W}_{S,C}^{{\rm{insol}}}-{\rm{\Delta }}{W}_{S,C}^{{\rm{sol}}})\) (r = −0.33).

Finally, the correlation between \({\mathscr{S}}\) and the two other sequence-based quantities that are commonly considered as related to solubility is rather low. It is positive (r = 0.11) for the aliphatic index, which confirms the trends found from the analysis of aliphatic interactions (see previous subsection). The correlation is negative (r = −0.18) for the isoelectric point, as already observed earlier16,24. The low correlation could be attributed to the fact that no difference is made between Lys and Arg, which yet have different effects on the solubility.

Testing other datasets and solubility definitions

The solubility \({\mathscr{S}}\) (in %) used in this paper is the concentration of the soluble protein fraction over the total concentration of the protein, measured under fixed conditions16, and is possibly affected by the fact that the total concentration is not the same for all proteins. It may differ from the common definition of solubility \({{\mathscr{S}}}_{0}\) (in g/l), which is the concentration of protein in a saturated solution that is in equilibrium with a solid phase. As this quantity is often difficult to measure, a common strategy consists of adding precipitants in various concentrations and extrapolating the results to zero concentration. However, the results may depend on the type of precipitant and the validity of the extrapolation is questionable14.

The first solubility definition (\({\mathscr{S}}\)) was used to derive our solubility-dependent potentials, since it is compatible with large-scale analyses and thus with large datasets16. We assessed the performance of these potentials on other datasets described in the literature, which use the same or other solubility definitions, by computing the linear correlation coefficient r between the solubility values and the energy difference (\({\rm{\Delta }}{W}_{S,C}^{{\rm{insol}}}-{\rm{\Delta }}{W}_{S,C}^{{\rm{sol}}}\)). They have to be compared with \(r=0.39\) obtained in cross validation on the \({{\mathscr{D}}}^{{\rm{tot}}}\) set (Table 3). The results are summarized below:

  • Solubility \({\mathscr{S}}\): another dataset has recently been published, with solubility data of yeast proteins rather than E. coli proteins52. The correlation coefficient r on the subset of 54 proteins for which an experimental structure is available (obtained with the same criteria as for the construction of the \({{\mathscr{D}}}^{{\rm{tot}}}\) set), is equal to 0.41.

  • Solubility \({{\mathscr{S}}}_{0}\): the solubility of TEV protease, eight single mutants and a double mutant has been assayed by concentrating the proteins54. The r-value on this set is as high as 0.70.

  • Solubility \({{\mathscr{S}}}_{0}\) measured using precipitants: the solubility of seven proteins has been estimated using two different precipitants, polyethylene glycol (PEG) and ammonium sulfate14. For six out of the seven proteins, r is equal to 0.40 when the solubility is extrapolated to zero PEG concentration, and to 0.07 when extrapolated at zero ammonium sulfate concentration; this indicates that the type of precipitant has an effect on the measured solubility values. The correlation is much higher (0.59 and 0.67) with the solubilities measured at non-zero precipitant concentration, for both types of precipitant, which suggests possible inaccuracies due to the extrapolation.

Thus, our solubility-dependent potentials appear to be suitable for estimating the solubilities \({\mathscr{S}}\) and \({{\mathscr{S}}}_{0}\) on different datasets, except when the measured values depend too much on some added precipitant.

Conclusion

Even though the structural and stability properties of proteins are of fundamental importance for the biophysical understanding of solubility data, obtained for example from cell-free expression systems, their precise role is not yet clear. Sometimes, the literature even reports contradictory results. Due to the complexity of the problem, it is probably impossible to find a unique mechanism that promotes solubility or aggregation propensities. Instead, these properties are likely to be associated with an intricate combination of physical tendencies that can moreover be protein-, function- or environment-dependent.

In this paper, we tackled the solubility issue by defining new knowledge-based mean force potentials that depend on the protein solubility. They were derived from sets of proteins with known 3D structures and solubility, which were divided into subsets on the basis of their solubility values. These potentials were used to investigate the relation between the amino acid interactions and the solubility propensity. This is possible as these potentials are effective potentials and thus include the impact of the solvent on protein stability. Note that the solubility-dependent potentials that we obtained only marginally depend on the threshold values used for dividing the full protein set into soluble and aggregation-prone proteins. Indeed, as shown in Fig. S4, using stricter threshold values does not modify significantly the potentials.

The main quantitative results that we obtained pinpoint the role of charge delocalization. We indeed found that all the interactions that involve residues with delocalized π-electrons on their side chain disfavor solubility. This is the case of the aromatic residues Phe, Tyr and Trp, of the aromatic and sometimes positively charged residue His, of the positively charged Arg, of Gln and Asn that possess a side chain amide group, and of the negatively charged residues Asp and Glu. These residues make ππ, His-π, cation-π, amino-π, and anion-π interactions, which appear to stabilize more strongly insoluble than soluble proteins. In contrast, the interactions that promote protein solubility are salt bridges that involve Lys, aliphatic-aliphatic interactions, and some aliphatic-containing interactions. Note that none of the latter involve aromatic residues, His, Arg, Asn or Gln. Some however involve Glu or Asp, which indicates that these negatively charged residues promote aggregation only when interacting with other π-systems.

The biophysical explanation of these results is not totally clear. However, we can argue that interactions involving delocalized π-electrons are more prone to occur across protein-protein interfaces, and thus lead to aggregation phenomena. The frequent occurrence of cation-π and π-π interactions in protein-protein interactions has already been discussed38,55. In contrast, interactions between hydrophobic aliphatic residues are likely to favor the stability of the hydrophobic core in the folding process, hence avoid dangerous interactions between partially folded structures, and promote protein solubility. To check and fully understand these tendencies and interpretations, other experiments and/or quantum chemistry calculations are needed.

The present analysis is mainly focused on solubility values on the E. coli proteome, but our solubility-dependent potentials were also tested on the yeast proteome52 as well as on smaller datasets where the solubility is defined and experimentally measured in different ways14,54. The results are quite encouraging, but need to be further analyzed in view of setting up an efficient solubility predictor. Other features should possibly also be taken into account, such as the presence of intrinsically disordered sequence regions, which seem to favor aggregate formation in eukaryotes52.

The understanding of the solubilization and aggregation mechanisms and the role of specific residue interactions has a lot of extremely useful applications in rational protein design studies. Indeed, the solubility is often a bottleneck in academic, medical and industrial processes that require high concentrations of proteins. Although the present study is far from solving completely the solubility and aggregation issues, it is a significant step forward in this direction.