Introduction

The mutation is a phenomenon observed in living cells. It is considered the main feature of evolution, modifying the structure of proteins, as well as their biological activity.

The modification of protein structure aimed at generating proteins with the desired biological function is currently a very popular issue.

The consequences of point mutations are reported in context of unfolding process [1, 2]. Temperature-jump induced transition state of ubiquitin in unfolding dynamic in WT and mutant forms of this downhill protein revealed the existence of the intermediate state in thermal unfolding of this protein [35]. The influence of the particular mutations on the unfolding process was examined for titin revealing that the I27 mutation demonstrates the opposite effect on protein stability in respect to Y9P [2]. The decreased pressure and temperature stability, the crystal structure of bovine pancreatic ribonuclease A variants V47A, V54A, V57A, I81A, I106A, and V108A was detected experimentally revealing the individual response to mutations [6].

The data base oriented on the collection of mutants form has been organized to integrate the structures changed upon mutation (http://bioinformatics.eas.asu.edu/sprouts.html) [7]. Linearly forced elastic network model (LFENM) to characterize the mutational effects on structure appeared the general tool for the recognition of the observed pattern of structural divergence revealing that the normal modes dominate structural changes [8]. I-Mutant2.0 is a support vector machine (SVM)-based tool for the automatic prediction of protein stability changes upon single point mutations. I-Mutant2.0 can be used both as a classifier for predicting the sign of the protein stability change upon mutation and as a regression estimator for predicting the related ΔΔG values. The web interface allows the selection of a predictive mode that depends on the availability of the protein structure and/or sequence [9]. The cross-validated tests of a computational classifier, a support vector machine (SVM) was applied to classify the highly informative features of the best predictability of the functional annotation of the nucleotide sequence was presented in [10, 11]. The folding process influenced by mutation is the object of analysis [12, 13].

The set (the largest one found in PDB) of proteins representing different forms of the proteins belonging to antifreeze proteins is the object of analysis in this work. The attempt is undertaken to present the general model for quantitative and qualitative measurements of the consequences of the mutations. The structural changes are analyzed in respect to the model of folding process in silico. The two-step model treating the folding process as mediated by two intermediates (between unfolded state and the native one) is applied for comparable structural analysis [14, 15]. The structure of the first intermediate called early stage (ES) is assumed to be generated solely according to backbone conformation [16]. The traces of the ES intermediate characteristics is measured in the structures of proteins under consideration. The late stage (LS) intermediate is assumed to be generated as the effect of the influence of external force field of the hydrophobic character expressed by three-dimensional Gauss function representing the structure of hydrophobic core [17]. The accordance of the proteins structure with the hydrophobic core (the highest hydrophobicity density in the center of the protein and decreased with the increase of distance versus the center of the molecule body reaching values zero on the surface) and its changes are used to express the structural/functional changes. The biological activity seems to be affected by the changes of hydrophobic core structure.

Materials and methods

Two-step protein folding process

The protein folding process was recognized experimentally as multi-step process with unknown number of intermediates [14, 15]. The model presented in this work assumes two-step process:

$$ {\text{U}} \Rightarrow {\text{ES}} \Rightarrow {\text{LS}} \Rightarrow {\text{N}}, $$

where : U – unfolded, ES – early stage, LS – late stage and N – native structural form.

Early stage model

This model assumes the dominant role of backbone, the conformation of which is expressed by two geometric parameters [15, 16]. The first one is the V-angle – the dihedral angle between two sequential peptide bond planes, the value of which is close to 0 deg for helical forms and close to 180 degs for extended and β-like structures. The second one, which seems to be determined by the first one, is the radius of curvature R of the polypeptide fragment (pentapeptide), which is small for helical structures and large for β-structural forms. The relation between these two parameters, which may apparently be expressed using a second degree polynomial,

$$ \ln (R) = 0.0003{V^2} - 0.02009V + 0.848, $$
(1)

determines the optimal path on the Ramachandran map considered the complete conformational space. The elliptical path on the Phi-Psi map links the locations of all secondary structures. This path is assumed to represent the limited conformational sub-space available for the backbone in the ES step of the folding process. The agreement between the model and the protein is estimated by calculating the average distance (D average ) between the projected value of the radius of curvature and the one observed one for the appropriate V-angle value as it appears for particular residue in the polypeptide chain. The graphic interpretation of the ES model is given in Fig. 1.

Fig. 1
figure 1

The ES model definition. (a) the Ramachandran map with low energy area distinguished (b) the relation between V-angle (dihedral angle between two sequential peptide bond planes) and R – radius of curvature (in logarithmic scale to avoid large values for β-structural forms) as calculated for structures belonging to low energy fragments on Ramachandran map (shown in a) together with the approximation function (2nd degree polynomial). (c) the Ramachandran map with points representing the structures accordant with the approximation function shown in b). (d) the ellipse path assumed to represent the limited conformational sub-space for early-stage intermediate. (e) the ellipse path linking all secondary structures area

Late stage model

The tertiary structure of the protein in the LS step of the protein folding process as assumed to be reached during the generation of the hydrophobic core with a simultaneous optimization of all other non-bonding interactions (electrostatic, vdW and torsional potential). The presence of an external force field is expressed via the three-dimensional Gauss function [17]. Model extends the original one introduced by Kauzman [18]. The force field simulates the hydrophobic core of the “fuzzy oil drop” model with the highest concentration of hydrophobicity in the center of the ellipsoid with its decrease depending on the distance from the center of the ellipsoid and the concentration reaching zero on the surface of the “drop”, according to the Gauss function:

$$ \widetilde{H}{t_j} = \frac{1}{{\widetilde{H}{t_{sum}}}}{\text{exp}}\left( {\frac{{ - {{\left( {{x_j} - \overline x } \right)}^2}}}{{2\sigma _x^2}}} \right){\text{exp}}\left( {\frac{{ - {{\left( {{y_j} - \overline y } \right)}^2}}}{{2\sigma _y^2}}} \right){\text{exp}}\left( {\frac{{ - {{\left( {{z_j} - \overline z } \right)}^2}}}{{2\sigma _z^2}}} \right) $$
(2)

where \( \overline x ,\overline y ,\overline z \) are the coordinates of the geometric center of the molecule (usually located in the origin of the coordinate system). This is why these values can be considered equal to zero. The size of the molecule is expressed by the triple σx, σy, σz, which is calculated for each molecule individually provided that the orientation of the molecule with the longest possible inter-effective atoms distance is determined according to the appropriate coordinate system axis. The σ values are calculated as 1/3 of the longest distance between two effective atoms calculated along each axis. The value of the Gauss function at any point of protein body is treated as the idealized hydrophobic density defining the hydrophobic core.

The idealized hydrophobicity at any point of the “fuzzy oil drop” can be calculated according to the Gauss function for the molecule located with its geometric center as the origin of the coordinate system. On the other hand, the empirical hydrophobicity distribution is calculated according to the function presented by Levitt [19].

$$ \widetilde{H}{o_j} = \frac{1}{{\widetilde{H}{o_{sum}}}}\sum\limits_{i = 1}^N {\left( {H_i^r + H_j^r} \right)\left\{ {\begin{array}{*{20}{c}} {\left[ {1 - \frac{1}{2}\left( {7{{\left( {\frac{{{r_{ij}}}}{c}} \right)}^2} - 9{{\left( {\frac{{{r_{ij}}}}{c}} \right)}^4} + 5{{\left( {\frac{{{r_{ij}}}}{c}} \right)}^6} - {{\left( {\frac{{{r_{ij}}}}{c}} \right)}^8}} \right)} \right]{\text{ for }}{r_{ij}} \leqslant c} \hfill \\ {0{\text{ for }}{r_{ij}} > c} \hfill \\ \end{array} } \right.,} $$
(3)

where N expresses the number of amino acids in the protein (number of grid points), \( \widetilde{H}_i^r \) expresses the hydrophobicity of the ith residue according to the accepted hydrophobicity scale (the Aboderin scale was applied in this work [20]), r ij expresses the distance between the i-th and j-th interacting residues, and c expresses the cutoff distance, which according to the original paper [19] is assumed to be 9 Å. The values of \( \widetilde{H}{o_j} \)are standardized by dividing them by the coefficient \( \widetilde{H}{o_{sum}} \), which is the sum of all hydrophobicities attributed to grid points.

Hydrophobicity distribution in the molecule under consideration appeared to be highly consistent with the idealized one. However, the irregularities observed in many proteins appeared to be target-oriented and related to active sites, such as ligand binding sites or enzymatic active sites.

Kullback-Leibler information entropy

The accordance between the idealized and the observed hydrophobicity distribution is measured according to the Kullback-Leibler relative (divergence) entropy [21], which quantifies the distance between two distributions. The distance between the observed and the theoretical (O/T) distribution was calculated. This value can be estimated only with respect to other solutions. The random distribution of hydrophobicity represented the border case for which the distance (O/R) was calculated. The relation O/T < O/R was taken as evidence for a non-random distribution close to theoretical one.

$$ {D_{{KL}}}\left( {p\left| {{p^0}} \right.} \right) = \sum\limits_{{i = 1}}^N {{p_i}{\text{lo}}{{\text{g}}_2}} \left( {{p_i}/p_i^0} \right), $$
(4)

where: D KL – distance entropy, p – probability of a particular observed event, p 0 – probability in reference distribution. The index “i” denotes a particular amino acid. N denotes the number of amino acids in the polypeptide chain.

Results

The structural analysis of the mutants is performed in respect to the ES and LS structural characteristics using the VR model and “fuzzy oil drop” model with the distance entropy applied to quantitative measurements of the structural differences between two structures under consideration.

Structural analysis of proteins under consideration

A structural analysis of proteins under consideration with respect to the ES and LS is presented in Table 1.

Table 1 The ES and LS characteristics of proteins under consideration. The position of mutations is given in the second column, followed by the value of D (distance), which is a measure of the accordance with the adopted model of ES intermediate. The D average expresses the mean distance between the projected and observed values of parameters that describe the structure of ES intermediate. The proteins with D average values below 1.0 are considered consistent with the ES model. The protein of the relation O/T < O/R is interpreted as accordant with LS model O/T denotes the Kullback-Leibler entropy calculated for the observed (O) distribution of hydrophobicity density and theoretical one (T) treated as the target distribution in comparison with the O/R expressing the distance between observed one and random (R) treated as the target distribution. Chains A were taken for analysis in NMR technique determining the protein structure. The values given in bold denote the case of accordance with appropriate model

Applicability of the ES model

According to the ES model, structure is generated according to backbone preferences in terms of the V-angle and R-radius of curvature. This is why the values of V-angle and R-radius of curvature (in logarithmic units) as they appear in the crystal structures of proteins under consideration were analyzed versus the idealized curve. The D distance between the projected and observed values of parameters was calculated. It was arbitrarily assumed that proteins with average D below 1 exhibit a structure consistent with the model. However, in view of the availability of the final (LS stage) structures, a D average value above 1 does not imply that the model is inadequate. A low value of D suggests that the structural elements characteristic of the ES structural form have been preserved to a large degree in native (LS structure). All helical fragments are present in both the ES and the LS. That is why low values of D average may suggest a large participation of secondary structures of the helical type.

Two proteins representing extreme cases (large and low D values) are shown as examples in Fig. 2. The distribution of the observed values (V, ln(R)) in comparison to the idealized approximation curve is shown in Fig. 2.

Fig. 2
figure 2

The ES model applicability to 7MSI and 1MSI – proteins of lower and higher (respectively) discordance with the assumed model although both of them are treated as representing the structure not accordant with the ES model (Daverage above 1.). The dark blue symbols – theoretical dependence between V-angle and Ln(R), pink squares – observed parameters and yellow triangles – the residues of the higher than 1.0 unit difference between expected and observed values of ln(R) for particular V-angle

The 3-D structures with residues with D average above 1 are marked in red in this picture in order to visualize the character of the structural motif which is not consistent with the adopted model (Fig. 3).

Fig. 3
figure 3

The 3-D presentation of the 7MSI (left) and 1MSI (right) proteins differing their lower and higher (respectively) accordance with the LS model. The fragments marked in white – residues of difference higher than 1.0 unit shown in Fig. 1 as yellow triangles. The residues shown in red – mutations versus the wild type

The accordance of the crystal structure with the ES model is not typically expected. On the other hand, the crystal structure is usually consistent with the LS model, the ES to LS transition is the change of optimal backbone conformation toward the presence of a hydrophobic core. Thus, it is obvious that ES characteristics may be lost in the LS intermediate, although this is not always the case. 1J5B is the only example among the discussed antifreeze proteins (type I). Its structure is entirely helical, and appears to be highly consistent with the ES model. The distribution of hydrophobicity in this molecule is much closer to the random distribution than to the Gaussian one.

Applicability of the LS model

The LS model assumes that hydrophobicity distribution in the protein molecule is consistent with the idealized one, expressed by the three-dimensional Gauss function. The profile showing the hydrophobic interactions collected by effective atoms of each residue as the effect of interactions with other amino acids is shown in Fig. 4.

Fig. 4
figure 4

The hydrophobic density profile for 3MSI and 9MSI showing the idealized and observed distributions. The proteins were selected to show the lowest and the highest respectively accordance between the idealized (T) and observed (O) hydrophobicity distribution. The yellow line shows the random distribution (R). The residues mutated versus the wild type are shown by cyan circles

The 3-D presentation of protein molecules with residues (marked in white) with strongest hydrophobic interactions (responsible for the generation of the hydrophobic core) in two proteins selected to represent the best and the worst accordance with the model under consideration is shown in Fig. 5.

Fig. 5
figure 5

3-D presentation of 3MSI (left) and 9MSI (right) with the residues of hydrophobicity density differing more than 0.004 versus the expected one given in white. The residues shown in red – the mutated residues

The Kullback-Leibler distance entropy

The accordance between the observed and the idealized hydrophobic density distributions was expressed quantitatively using the Kullback-Leibler distance entropy (as shown in Materials and methods). The values measuring the distance between the observed and idealized (O/T) and the observed and the random (O/R) distributions are given in Table 1. The analysis of these values suggests that the structural changes do not influence the status of the structure (accordance with the idealized model is preserved). Some proteins undergo changes that result in structure no longer consistent with the adopted model, which suggests that the mutations destroy the hydrophobic core responsible for stabilizing the molecule.

A particular mutation in position 16 in 2MSI to 7MSI in respect to 1MSI appeared to affect the hydrophobic core to such a large extent that it lost its initial structure and became inconsistent with the idealized core structure.

Substituting Pro in positions 64 and 65 with Ala, which is absent in the other investigated proteins and their mutants, suggests that prolines play a critical role as far as hydrophobic core generation is concerned.

The investigated molecules are classified in Table 2 depending on accordance with ES and LS models.

Table 2 Protein classification with respect to the criteria describing/defining the early stage (ES) and late stage (LS) intermediates

The majority of the proteins under consideration are very similar (both in terms of sequence and structure), there is only one (1KDF – minimized averaged NMR structure) that satisfies the conditions of both models (ES and LS). This may suggest that the initial ES intermediate was not destroyed in the transition to LS.

The accordance with the LS model is the strongest one in 1KDE structure. The structural fluctuation of dynamic forms seems to be limited by the stabilization imposed by the hydrophobic core (in accordance with the three-dimensional Gauss function).

On the other hand, its four mutants (2MSI, 3MSI, 4MSI, 5MSI) are examples in which mutation prevented the formation of hydrophobic core, which is present in all other structural forms of other mutants of this protein.

Structural differences in pair-wise comparison

A comparison of the intensity of structural changes upon mutation in relation to other proteins of the same group is shown in Table 3. Such a ranking allows contrastive analysis, even more significantly so in this case due to identical (or similar) polypeptide chain length.

Table 3 Pair-wise comparison of selected mutants (AMI). The values under the diagonal – the RMS-D measurements: the values above the diagonal present the D KL distance entropy between two proteins (according to the column and row headers)

The LS model based comparative structural analysis was performed using the Kullback-Leibler divergence entropy treating one of the compared proteins as the target. The values received on the basis of these calculations were compared with traditionally used similarity scale expressed by RMS-D values. The appropriate values for selected mutants (group AMI) are given in Table 3.

The correction coefficient for D KL versus RMS-D as calculated using STATISTICA program is equal 0.2268 with p < 0.0001. The graphic presentation of this relation is shown in Fig. 6.

Fig. 6
figure 6

The relation between traditional similarity measurements expressed as RMS-D values and D KL measurements. The correlation coefficient calculated is equal to 0.2268 with the statistical significance on the level p < 0.0001

Conclusions

The molecules presented in this paper are examples of proteins with structure which seems to satisfy the adopted model of “fuzzy oil drop”. When folding, these molecules satisfy all the conditions defined by non-bonding interactions with simultaneous hydrophobic core formation. Hydrophobic residues located in the central part of the molecule and exposure of hydrophilic residues on the surface are the main tenets of the “oil drop” model introduced by Kauzmann [18]. The Kullback-Leibler entropy [21], which is a measure of the distance between the target distribution (idealized one) and the one observed in a particular molecule revealed good accordance of the observed hydrophobicity distribution with the idealized one.

The Kullback-Leibler entropy calculated for different mutants seems to quantitatively express the scale of structural differences in terms of the hydrophobic core structure.

The selected proteins are examples supporting the reliability of the “fuzzy-oil-drop” model. This model reproduces/imitates the mechanism of protein folding. The modification of the “fuzzy oil drop” model for proteins that are not consistent with this model is under consideration.

The loss of the accordance with the ES model in the LS step of protein folding is obvious, although some proteins with highly preserved secondary structures also exhibit this accordance in their late stage structural form.

It is difficult to verify the applicability of the presented model with respect to biological activity of the proteins under consideration. Their biological function requires high solubility, but no specific interactions understood as necessary formation of binding sites. The antifreeze proteins interact non-specifically and their role is to neutralize water’s tendency to be highly organized. The exposure of poorly hydrophobic (i.e., hydrophilic) residues on the protein surface very likely ensures such an effect.

The application of the presented model to the proteins with well-defined active sites may also reveal its ability to locate them. When used for mutants it may estimate the influence of mutation on the potential loss of biological activity [22]. The position of mutation and its relation to the location of residues engaged in biological function may easily be visualized when the \( \Delta \widetilde{H} \) profile is presented (\( \Delta \widetilde{H} \) expresses the difference between expected and observed hydrophobicity revealing the residues of significant difference between observation and the model). Such an analysis was presented in [22].

The influence of mutation on the structure and, subsequently, on biological activity was defined using the hydrophobic density distribution.

When hydrophobicity distribution in the protein molecule is consistent with the idealized one, the protein molecule exhibits high solubility, but no specific biological activity. It had been assumed in the past that such proteins with no biological function do not exist. However, the antifreeze proteins appeared to satisfy the above-mentioned conditions. That is why proteins from this group were selected as examples to visualize different forms of the accordance between the assumed model and the real structure in antifreeze proteins.

The pair-wise differences for mutants appeared of much higher magnitude in terms of the relation between the idealized and observed hydrophobicity distributions.

The opposite situation is observed in the group of peroxidases, where the pair-wise comparison reveals far smaller differences.

This paper was focused on good applicability of the Kullback-Leibler entropy as a measure of distance between two distributions.

This method is very simple and it seems to be a suitable tool for automatic analysis of large amounts of data (structures of mutants and/or structures of proteins with equal numbers of amino acids in polypeptide chains).

The protein 3BDN was taken to estimate the applicability to the larger proteins (above 200 amino acids) [23].

The applicability of Kullback-Leibler entropy for the set of proteins belonging to the antifreeze proteins revealed the high accordance of the structure characteristics of this group of proteins with the “fuzzy oil drop” model. It suggests that the hydrophobic core in proteins under consideration represents the structure (hydrophobic density distribution) of three-dimensional Gauss function. The consequence of this observation is that the presence of external force field in folding process simulation may be treated as the heuristic model for protein folding simulation. The other group of proteins were also recognized as proteins of structure in accordance with “fuzzy oil drop” model. They are: fast folding proteins, cold shock proteins and some proteins in the form of homodimers (currently under consideration). The protein of the structure assumed to represent the early stage step of folding process and its native structural form appeared to be well accordant with both ES and LS mode respectively [24]. The “fuzzy oil drop” model is able to explain the structural differentiation of two homologous proteins of significantly different structure (change of α-helix to the β-structural form). Although all proteins listed as accordant with the “fuzzy oil drop” model are of the category “easy predictable” (according to CASP classification [25]) the meaning of the presented model is its general character. The introduction of external force field and the accordance of structures of some proteins with the model suggests the significant role of the environment for folding process.