Shigella are Gram-negative rod-shaped bacteria, which are typically 0.3–1 \(\upmu \)m in diameter and 1–6 \(\upmu \)m in length, appearing singly, in pairs, and in chains, being closely related to E. coli, that invade the colonic mucosa [1]. Shigella are present wherever humans exist, and the direct spread from one person to another is a major route of transmission, which is facilitated by the very low inoculum of organisms required for infection [2]. An individual with dysentery and a presumed Shigella infection should be treated with an appropriate antimicrobial agent when first seen [2]. However, the increasing prevalence of multiple drug-resistant strains has made the selection of an appropriate agent increasingly difficult [3], making Shigella the second leading cause of diarrhoeal mortality in 2016 [4].

Fig. 1
figure 1

AB: The important structural motifs for cathepsin G (CG) and neutrophil elastase (NE) are shown in Panels A and B, respectively. The conserved cysteines are marked black, and the three residues of the catalytic triad, His-Asp-Ser, are shown as dark green. The S1 pockets are shown in red for CG and blue for NE. The glutamic acid belonging to the S1’ sub-site of CG is shown in yellow. The leucines belonging to the S2’ sub-sites of NE and CG are marked cyan. C: The sequences of CG and NE have been aligned [5]; an additional visualization of the sequence alignment is shown in Appendix 9. The different residues are marked following the same color scheme described in panels A and B. The residue numbers follow the chymotrypsinogen numbering, which is used to unify residue numbers of the protease family [6]

The inflammatory response initiated by Shigella is characterized by the recruitment of neutrophils, a part of a family of granulocytes that play an essential role in innate immunity. Neutrophil granules house critical enzymes that initially damage the colonic mucosa and exacerbate the infection but, ultimately, also control the infection by trapping Shigella in a phagosome where it is killed efficiently. Perhaps the best studied among these are the azurophilic granules [7], which contain high levels of bactericidal enzymes including the chymotrypsin-like serine proteases neutrophil elastase (NE) and cathepsin G (CG) [8,9,10,11], which use the Ser195/His57/Asp102 catalytic triad mechanism[12], where serine is the nucleophile, histidine is the general base and acid, and the aspartate helps orient the histidine residue and neutralize the charge that develops on the histidine during the transition states. This arrangement is conserved throughout the serine proteases family. The residue numbers are matched to those from chymotrypsinogen allowing for unifying the nomenclature throughout structures of the chymotrypsinogen protease family [6]. Enzymes of this protease family display a three-dimensional structure consisting of two homologous \(\beta \)-barrels and a C-terminal \(\alpha \)-helix. Each \(\beta \)-barrel contains six anti-parallel \(\beta \)-sheets connected through some linker segments. Residues of the catalytic triad are located at the junction of the two \(\beta \)-barrels, whereas the active site cleft runs perpendicular to this junction [6]. Such an arrangement of amino acids in the active site presumably allows a nucleophilic attack by Ser195 on the carbonyl carbon (C = O) of the substrate at which the hydrolysis cleavage occurs, thus setting off the catalysis process [12].

The following convention for the sections of the two proteins has been adopted from an earlier study [10]. The specific S1 pocket in individual chymotrypsin-like serine proteases is crucial in defining their substrate specificity. In NE, the S1 pocket is hemispherical and hydrophobic because of the presence of valine, phenylalanine, alanine, and the disulfide bridge Cys182-Cys201, see Fig. 1. In contrast, the S1 pocket in CG reveals a significantly stronger electrostatic character due to a negatively charged glutamic acid (S1’ sub-site) at the far end. The S2’ sub-site, which is located in the S2 pocket of the protein (Fig. 1C, cyan dotted lines) of NE and CG, is formed by leucine (Leu106, Leu105) residues (Fig. 1A–B) and the flat side of the imidazole ring of His57 from the catalytic triad, which is rather hydrophobic. X-ray structural analysis showed that CG displays two positively charged residues (Arg41 and Arg143) in close vicinity to the active S2’ sub-site that could contribute to the specificity of the protease [13]. Visualization of the sequence alignment of NE and CG and a graphical illustration of the spatial arrangement of the aforementioned sites is shown in Fig. 1.

The structure-function analysis has established several regions in NE as essential for cleaving Shigella virulence proteins [13, 14]. For example, it was hypothesized that Shigella virulence factor specificity belongs to a distinct region close to NE’s active site. In order to thoroughly determine the reasons for the different functional specificity in NE and CG and to find the subtle indicators of differences between these two proteins, atomistic simulations can be employed. For this purpose, we suggest and utilize a novel protocol to probe the differences in dynamics of the two protein structures, explicitly dealing with structures with a different number of amino acids residues.

The motivation for the study primarily comes from the fact that although NE and CG sequences are only about 32% similar sequence-wise [5], their X-ray crystal structures are remarkably similar [10, 11]. The sequence alignment is shown in Fig. 1 as well as in Appendix 9. Despite that, solely NE is a key host defense protein responsible for controlling Shigella, as well as Salmonella and Yersinia, the causative agents of typhoid fever and the plague [15]. Why does CG lack the NE specificity for bacterial virulence factors? Our results suggest that understanding the role of neutrophils in disease mechanisms might be facilitated by computational methodologies.


Molecular dynamics simulation and stability analysis

The known crystal structure of boar elastase (NE) (pdbID: 1ela) [11] and human cathepsin G (CG) (pdbID: 1cgh) [10] were simulated employing all-atom molecular dynamics (MD) approach.

Molecular dynamics (MD) simulations were conducted using the NAMD [16, 17] software; the interatomic interactions were described using the CHARMM36 force field with CMAP corrections [18,19,20,21,22]. The simulations were set up employing the online-platform VIKING [23]. The same simulation parameters were used for the NE and CG structures. The structures were solvated in a water box with 20 Å padding distance around the protein structure. The system was neutralized and ionized using NaCl to reach a concentration of 0.2 mol/l, representing the typical salt concentration inside a cell. The temperature of the system was set to 315 K, and a pressure of 1 bar was chosen. Following the 10,000 energy minimization steps, equilibration simulations were performed. Equilibrations were carried out for 21 ns at a timestep of 1 fs with varying constraints: In the first nanosecond, everything but water and ions was restrained, while the next 10 ns had a restrained protein backbone, allowing the amino acid side chains to reach a favorable conformation. The last 10 ns of the equilibration protocol was done without any restraints. Following equilibration, the production simulations were carried out for 400 ns with rigid hydrogen bonds in the NVT (constant number of particles, volume, and temperature) ensemble.

The simulated structures were probed for their stability using the radius of gyration, R\(_g\), which was determined for every timestep for the two structures, respectively, by calculating

$$\begin{aligned} R_{g}=\sqrt{\frac{1}{K}\sum _{k=1}^{K}\Vert \mathbf {r}_{i}-\mathbf {r}_{c}\Vert ^{2}}, \end{aligned}$$

with K denoting the number of atoms in the protein, \(\mathbf {r}_i\) being the location of the ith atom and \(\mathbf {r}_c\) the geometric center of the protein.

Small angle neutron scattering

To substantiate the results of MD simulations, small-angle neutron scattering (SANS) was performed to compare the shape of NE and CG dissolved in \(D_2O\). The SANS allows testing of NE and CG conformations without inhibitors, while crystal structures were obtained only with inhibitors [10, 11]. Additionally, the molecular weight of 25.98 kDa and 26.58 kDa for CG and NE, respectively, limits the possibility of using NMR, making SANS a favorable method in this study. Furthermore, although very high protein concentrations were necessary to obtain a reasonable signal-to-noise ratio, enzymatic assay results showed that both proteins were still active even after 72 hours of exposure to the neutron beam.

Experiments were carried out using V4 at the Helmholtz-Zentrum Berlin für Materialien und Energie (Berlin, Germany) in standard geometry at \(25\ ^{\circ }\text {C}\) [24]. The samples were placed in quartz cells of 1 mm thickness and filled with the buffer as immersion liquid. In all cases, circular cadmium diaphragms of 14 mm diameter defined the volume exposed to the incident beam. The samples were measured with a wavelength of 6 Å at three different sample-detector distances of 0.98 m, 4 m, and 12 m, covering a momentum transfer range between 0.04 nm\(^{-1}\) and 3.5 nm\(^{-1}\).

NE and CG were purchased from Elastin Products Company, Owensville, Missouri, USA. Their enzymatic activity was determined per the manufacturer’s protocols. For this experiment, NE and CG were both dissolved in 100 mM Tris D\(_2\)O buffer at a concentration of 10 mg/mL.

Theory of dynamical trajectory comparison

Alignment of two trajectories describing protein dynamics, followed by their comparison, utilized the so-called Kullback–Leibler divergence [25]. The essential idea of the methodology employed is described below.

General aspects

Qualitative comparison of two trajectories describing the dynamics of two proteins requires the structural alignment of both structures that evolve in time. The alignment is meant to diminish the effect of global movements and rotations of the evolving structures in order to focus on internal rearrangements within the protein structures. A common strategy to align two protein structures is the Kabsch algorithm [26], which rotates and translates the structures to achieve the smallest relative root mean square deviation (RMSD) value for the two structures. The Kabsch algorithm requires the same number of residues to be present in the two studied structures, which is not applicable for the NE and CG structures; the NE structure is comprised of 240 residues, while the CG structure contains 224 residues. A multi-step alignment scheme was therefore employed.

At first, a single frame alignment using the MultiSeq tool in VMD [27, 28] was performed, and 134 pairs of residues from both structures were manually selected following a visual inspection. Next, the Kabsch alignment was used to line up the protein structures according to the selected 134 reference residues. The procedure followed earlier studies [29, 30], in which the NE and CG structures were aligned iteratively to their own average structure for all MD timesteps until the changes became insignificant. The aligned trajectory of CG was then aligned to the aligned trajectory of NE for all time steps.

The Kabsch-aligned structures were then used in a less parameter-dependent approach, which permits aligning two protein structures independently of a manual choice of an equal number of residues, utilizing a two-step algorithm based on the Fréchet distance as proposed by Jiang et al. [31]. The Fréchet distance analyzes the position of the backbone atoms in a protein structure and interprets them as an ordered path in the three-dimensional space. While traversing the two paths of the protein backbone, one should then search for the shortest possible connection between the two paths. From these shortest connections, the Fréchet distance is defined as the length of the longest connection. The Fréchet distance has the advantage that the two discrete ordered sequences of data points do not have to be of the same length for a successful Fréchet distance calculation, thus permitting a comparison of the 224 residues of CG directly to the 240 residues of NE. The concept of the Fréchet distance calculation is illustrated in Fig. 2. The orange cylinders show the shortest connections that atoms from one structure can have to the atoms in the other structure. The purple connection marks the longest possible connection from all the possible pairs of atoms. The length of the purple connection is then the result of the Fréchet distance calculation.

Fig. 2
figure 2

The Fréchet distance calculation for two protein backbone structures is shown. The two protein backbones are indicated in red and blue; the spheres show the backbone atoms in order. The orange cylinders show the shortest connections for each atom to an atom in the other structure. From all these shortest connections, the purple cylinder indicates the longest one. The length of this longest connection is the Fréchet distance between the two protein backbones

The two-step algorithm described by Jiang et al. [31] was used to align all backbone atoms in the protein structures. First, both protein structures have been transposed, such that their geometric center of mass coincides with the origin point at (0, 0, 0). The following two steps were repeated for each MD simulation snapshot. In the first step, the first atom (the N atom of the first residue) and the last atom (the C atom of the last residue) of the three-dimensional protein backbone, together with the origin point, form a triangle. Considering such triangles for the NE and CG structures, one performs rotations in a way that the triangles become coplanar and their heights align, as illustrated in Fig. 3. The coplanar triangles guarantee a sensible initial orientation of the two protein backbones relative to each other for comparing the Fréchet distances. The first and the last residues of the compared structures are always considered when traversing the backbone to compute the distance. Therefore, both residues have a great influence on the resulting distance which is initially minimized by the triangle construction. In the second step of the alignment procedure, one structure (CG, reference) is kept rigid while the other (NE, alignee) is rotated. Two atoms from the alignee structure are chosen at random. The rotation of the alignee structure is then attempted by a small angle around the axis defined by the two atoms. The rotation is performed if the overall Fréchet distance between the reference and the alignee structure decreases and is rejected otherwise. The rotation procedure is repeated for a fixed number of iterations.

Fig. 3
figure 3

Triangle alignment. A The first and the last atom of the two protein structures (CG and NE) are shown for a single simulation snapshot, and the triangle defined by these two atoms and the origin (0, 0, 0) are indicated. The triangles are initially not sharing a plane. B After rotation, the triangles are put in the same plane within the three-dimensional space, and the heights of both triangles are set parallel. This configuration is the mandatory starting configuration for the alignment procedure. The backbone of the NE (blue) and the CG (red) proteins are shown as semi-transparent structures

Note that the alignment algorithm is not deterministic, and it is possible that a successive rotation would decrease the Fréchet distance between the two structures without reaching the best relative alignment if the random rotation axis was not chosen optimally. The algorithm always follows the decrease in the Fréchet distance at each iteration, and therefore, it might follow a potentially wrong rotation direction. In order to decrease the chance of a non-optimal fit, the Fréchet alignment was repeated ten times from the initial coplanar triangle configuration. Finally, the alignment with the lowest Fréchet distance was chosen for each simulation snapshot.

In this study, the rotation angle was set to 0.0006 degrees, and 500 iterations were performed to find a better alignment of the two studied structures. The rather rotation small angle was assumed to correct the alignment established with the Kabsch algorithm in the previous alignment step. In each rotation, the average maximum displacement of a single residue was just 0.00014 Å. Alternative strategies that involved multiple rotation angles as well as a scheme, that would alter the rotation angle depending on the improvements achieved in prior steps were also considered. For the size of neutrophil elastase and cathepsin G, such methods, however, did not show any improvement in the merit of the alignment procedure.

Residues coupling: generating relative displacements

Every amino acid residue can be described through its three backbone atoms for the alignment procedure. However, considering all backbone atoms is computationally demanding and largely unnecessary, as the description of every amino acid residue can be further reduced to only the C\(_{\alpha }\) atoms. Their central location in each residue makes the C\(_{\alpha }\) atoms a suitable anchor point for the description of residue motion.

In contrast to the structural alignment, where all atoms of interest for a single simulation snapshot were considered, now every C\(_{\alpha }\) atom for all simulation snapshots is taken into account. To perform comparative measurements, one needs to establish pairs of residues between both structures for which the differences in the position are to be analyzed. These pairs are necessary in order to compare each fragment of the proteins with sufficient precision. Since the two studied structures do not have the same number of residues, the problem can be addressed through the dynamic time warping algorithm [32], which creates a \(224 \times 240\) penalty matrix from the locations of the C\(_{\alpha }\) atoms in CG and NE protein structures. The penalties are assigned by the dynamic time warping algorithm explained in greater detail in Appendix A. In the matrix, every row corresponds to an amino acid residue in CG, while every column corresponds to a residue in NE.

Fig. 4
figure 4

A An exemplary extract from the \(224\times 240\) penalty matrix calculated by the dynamic time warping algorithm. The penalty values (in \(\mathrm{nm}^2\)) are shown for each entry in the matrix, as well as the corresponding residue pairs (which are described by their C\(_{\alpha }\) atoms) for the NE and CG structures. The pairs of atoms to be coupled for the similarity measurement can be found by backtracking through the penalty matrix, which yields the indicated route (red). The route through the matrix can either go up, left, or diagonally up left, always following the direction with the smallest element as highlighted by the cyan square. Within the square, one starts with the pair (23, 23) and continues to the smallest penalty value within the square, which is 0.73, corresponding to the residue pair (22, 22). Note, however, that a single residue can be coupled with more than one residue in the other structure, e.g., residue number 17 in CG is coupled with residue numbers 17 and 18 in NE; see the purple circle. B The backbone of NE (blue) and CG (red) is shown for the first residues of the protein structures. The yellow connections show the pairs between the C\(_{\alpha }\) atoms of CG and NE. The atoms between the linked C\(_{\alpha }\) are the nitrogens and carbons that form the backbone of the protein structures. An example of a double paired connection is indicated by the purple circle

Each element in the penalty matrix corresponds to a pair of residues, assigning one residue from each structure. The procedure of finding the suitable residue pairs is therefore equivalent to establishing a route through the matrix, which passes every row and every column at least once. For example, for a square penalty matrix, i.e., describing two protein structures with the same number of residues, the route could follow the matrix’ diagonal. Since the studied proteins (NE and CG) have a different number of residues, a more elaborate route through the penalty matrix has to be determined. A fragment of the route for the first nine residues (residue numbers 16 to 24) of the CG and NE structures is shown in Fig. 4D. Assume that the residue 23 in the CG and NE structures should be used for comparison. In the next step, one has to evaluate the penalties that would arise upon computing the structural deviation in the residue pairs (22, 23), (23, 22), and (22, 22), as indicated with the cyan square in Fig. 4A. The relevant pair to be used for comparison should naturally have the lowest penalty value associated with it. In the considered example, the penalty turns out to be 0.73 (being smaller than 1.33 and 1.02), which corresponds to the (22, 22) residue pair. The next pair for comparison is then evaluated from the (22, 22) element of the penalty matrix, and the procedure repeats until the residue pair (16, 16) is reached, which corresponds to the first residues in both protein structures considered in the example. This so-called backtracking algorithm is visualized in Fig. 4A. The choices of residue pairs extracted from the penalty matrix are shown in Fig. 4B, where the backbones of the NE and CG structures are shown in blue and red, respectively. The yellow connections between the C\(_{\alpha }\) atoms illustrate the pairs established from the penalty matrix calculation. A single residue can be matched with multiple residues in the other structure. For instance, residue Ile17 in CG is paired with both Val17 and Gly18 in the NE. The purple circle illustrates the pair in Fig. 4 with both pairs linked in yellow. The pairs of atoms that belong to the different proteins can be used to quantitatively compare the dynamic evolution of the two structures.

A description of the dynamic time warping algorithm with further details on how the penalty matrix is populated and how the backtracking algorithm is performed is given in Appendix A.

The established pairs of atoms are influenced by the dynamics of the proteins during the course of the simulation, and the pairs established for one timestep might not be the best for another. Therefore, the coupled pairs between CG and NE were formed on the basis of every timestep of the performed MD simulations. The similarity between both protein structures was then calculated for each coupling, and the cumulative similarity result is achieved as the arithmetic mean computed for all similarities for all determined residue couplings. The repetition of the process where the couplings are obtained from different snapshots is introduced to circumvent the need of choosing a specific reference frame for the calculation of the penalty matrix couplings.

The Kullback–Leibler divergence

After the proteins have been structurally aligned and the to-be-compared pairs have been established, the comparison of the two structures can commence. In general, the Kullback–Leibler divergence (KLD) [25] permits quantifying a difference between two arbitrary normalized distributions. Let us illustrate how the KLD can be used for the studied protein structures. For the sake of coherence, let us consider the residues marked by the purple ellipsoid in Fig. 4. Consider the residue \(\text {Ile17}_{\text {CG}}\) in CG. For all \(N=80{,}000\) MD snapshots, the location vector for the C\(_{\alpha }\) atom of this \(\text {Ile17}_{\text {CG}}\) residue, containing the x, y, and z coordinates, is then considered:

$$\begin{aligned} \left\{ \begin{pmatrix}x_1\\ y_1,\\ z_1\end{pmatrix}, \begin{pmatrix}x_2\\ y_2,\\ z_2\end{pmatrix}, \begin{pmatrix}x_3\\ y_3,\\ z_3\end{pmatrix}, \cdots , \begin{pmatrix}x_N\\ y_N,\\ z_N\end{pmatrix}\right\} . \end{aligned}$$

The vectors in Eq. (2) permit constructing three one-dimensional arrays by grouping all the x, y, and z coordinates, respectively:

$$\begin{aligned}&X=\{x_1,x_2,x_3,\ldots ,x_N\},\ Y=\{y_1,y_2,y_3\dots y_N\},\nonumber \\&\quad Z=\{z_1,z_2,z_3,\ldots ,z_N\}. \end{aligned}$$

These three lists are then used to form normalized distributions of the \(x,\ y,\ z\) coordinates. For \(\text {Ile17}_{\text {CG}}\), the distribution of its x coordinates would then be denoted as \(D(\text {Ile17}_{\text {CG}},x)\).

Consider now two residues, e.g., \(\text {Ile17}_{\text {CG}}\) in CG, and \(\text {Val17}_{\text {NE}}\) in NE. Based on the distributions introduced above, one defines \(KLD(\text {Ile17}_{\text {CG}},\text {Val17}_{\text {NE}})\) as

$$\begin{aligned}&KLD(\text {Ile17}_{\text {CG}},\text {Val17}_{\text {NE}})\nonumber \\&=\sum _{k\in \{x,y,z\}}\sum _{i=0}^{N-1}D(\text {Ile17}_{\text {CG}},k)_i \ln \left( \frac{D(\text {Ile17}_{\text {CG}},k)_i}{D(\text {Val17}_{\text {NE}},k)_i}\right) ,\nonumber \\ \end{aligned}$$

where \(D(\cdot ,k)_i\) denotes the ith discrete entry in the distribution of the kth coordinate [33].

In general, the KLD is not symmetric, i.e.,

$$\begin{aligned} KLD(\text {Ile17}_{\text {CG}},\text {Val17}_{\text {NE}})\ne KLD(\text {Val17}_{\text {NE}},\text {Ile17}_{\text {CG}}). \end{aligned}$$

The measure can, however, be made symmetric by considering the arithmetic mean of KLD computed in two directions. Thus, the similarity S between a \(\text {Ile17}_{\text {CG}}\) and \(\text {Val17}_{\text {NE}}\) residues is defined as

$$\begin{aligned}&S(\text {Ile17}_{\text {CG}},\text {Val17}_{\text {NE}})\nonumber \\&=\frac{KLD(\text {Ile17}_{\text {CG}},\text {Val17}_{\text {NE}})+ KLD(\text {Val17}_{\text {NE}},\text {Ile17}_{\text {CG}})}{2}.\nonumber \\ \end{aligned}$$

As a final step, similarity for every residue has to be evaluated according to the number of its partners. For instance, one notices that \(\text {Ile17}_{\text {CG}}\) is actually paired with both \(\text {Val17}_{\text {NE}}\) and \(\text {Gly18}_{\text {NE}}\). On the other hand, \(\text {Val17}_{\text {NE}}\) is only forming the single pair to \(\text {Ile17}_{\text {CG}}\) as illustrated in Fig. 4. The final similarity value for \(\text {Ile17}_{\text {CG}}\) is therefore calculated as

$$\begin{aligned} \hat{S}(\text {Ile17}_{\text {CG}})=\frac{S(\text {Ile17}_{\text {CG}},\text {Val17}_{\text {NE}})+S(\text {Ile17}_{\text {CG}},\text {Gly18}_{\text {NE}})}{2}, \end{aligned}$$

while the similarity for \(\text {Val17}_{\text {NE}}\) would be defined as

$$\begin{aligned} \hat{S}(\text {Val17}_{\text {NE}})=S(\text {Ile17}_{\text {CG}},\text {Val17}_{\text {NE}}). \end{aligned}$$

In general, if one assumes a residue a in NE to be paired with residues \(b_1,\dots ,b_l\) in CG, then the similarity for the residue a is defined as

$$\begin{aligned} \hat{S}(a)=\frac{1}{l}\sum _{i=1}^{l} S(a,b_{i}). \end{aligned}$$

The similarity of a residue in CG is calculated analogously by considering all the pairs to residues in NE.

Intuitively, one would assume that NE is as different to CG as CG is different to NE and therefore, the overall similarity between the two structures should not depend on whether CG is compared to NE or NE is compared to CG. The symmetry is, however, violated since it is possible to have residues in one protein that couple to several residues in the other protein (see Eq. (6) and (7)). Therefore, similarities in both directions have to be considered for analyses, i.e., CG\(\rightarrow \)NE and NE\(\rightarrow \)CG.

By repeating the KLD procedure for all residue pairs, their similarity can be calculated and analyzed. In summary, the KLD quantifies the difference between two residues in the two studied proteins over the time span of the whole simulation trajectory and yield a single number. The smaller the resulting number is, the more similar the two residues are expected to be.

Results and discussion

Radius of gyration and comparison with small-angle neutron scattering data

The radius of gyration (\(R_g\)) values were calculated for the two simulated NE and CG structures, and their respective time evolution is visualized in Fig. 5. The average radius of gyration is \(17.1\pm 0.1\) Å for CG and \(16.9\pm 0.1\) Å for NE. The values fluctuate narrowly around the average, which indicates stable structures. A difference of 0.2 Å is considered neglectable in terms of \(R_g\) [34], once more demonstrating the overall similarity in structure between the NE and CG proteins. Additionally, the stability of the protein structures was assessed based on analysis of the temporal evolution of the RMSD for the proteins’ backbone as well as the potential energy of each protein structure without the solvent contribution. The small fluctuation in the RMSD as well as the normal distributed energies indicate equilibrated and stable structures. The RMSD and energy data are shown in Appendix Fig. 10.

Fig. 5
figure 5

Radius of gyration, see Eq. (1), for CG and NE obtained from the simulated MD trajectories. Both plots feature oscillations around an average value, indicating CG and NE’s stability throughout the simulations. The average \(\langle R_g\rangle \) value for CG is 17.1 Å and 16.9 Å for NE

The small-angle neutron scattering (SANS) data were used to compare the shapes of NE and CG. The variation of the absolute scattering intensity I with the momentum transfer rate q for the samples in a 10 mg/ml solution is shown in Fig. 6.

Differences between NE and CG are observed primarily for \(q \le 0.02\) Å\(^{-1}\). This region of the scattering curve is sensitive to internal molecular effects, describing protein–protein interactions. Three factors were considered to be important for the proteins’ affinity to form oligomers. NE has a molecular charge of \(4\ \mid e^{-} \mid \), while the total electrostatic charge of CG is \(23\ \mid e^{-}\mid \); both are positively charged, and therefore, repulsive interaction is expected between several proteins. Both proteins, however, also possess a strong dipole moment, which was computed as the average value over the entire MD simulation for both cases; CG exhibits a higher dipole moment value (305 Debye) than NE (238 Debye). Lastly, the solvent-accessible surface area (SASA) was calculated for all hydrophobic residues in both proteins and averaged over the course of the MD simulations. NE exhibits a greater average SASA value of 2990 Å\(^2\), while CG has a value of 2128 Å\(^{2}\). NE might therefore bind more favorably than CG through hydrophobic interactions. Additionally, the electrostatic potential was mapped onto the protein structures as visualized in Appendix Fig. 11. The figure shows that the backbones of the two proteins are folded remarkably similar, while the electrostatic potential around the proteins is clearly different. A notable difference can be spotted near the active triad, where NE exhibits a more pronounced negative potential compared to CG. This conclusion is consistent with the observation of the lower charge of NE as compared to CG. Finally, the existence of a dipole moment in both proteins supports the rationale further. All put together one may conclude that the differences in I(q) for \(q < 0.02\) Å\(^{-1}\) in Fig. 6 arise due to a higher oligomerization rate of NE proteins as compared to CG. Furthermore, applying the power law, \(I(q)\cong (A*q^{-d}+I_0)\), to model the low-q region, yielded a d-value of \(d_{CG} = 3.3 \pm 0.2\) and \(d_{NE}=1.2 \pm 0.1\) for the CG and NE proteins, respectively, which supports the hypothesis that aggregation occurs differently in both samples.

Fig. 6
figure 6

Experimental one-dimensional neutron scattering of NE (blue) and CG (red) versus the momentum transfer q plotted in logarithmic scale. The data are separated into two parts, as indicated by the vertical dashed line. For \(q \ge \) 0.02 Å\(^{-1}\)  CG is fitted using the software SasView [35] with a model described by two lognormal distributions in radius (R) and length (L). For \(q \le \) 0.02 Å\(^{-1}\)  the observed behavior is consistent with the formation of larger aggregates that can be interpreted as protein clamping (aggregation), artistically illustrated on the left, which can be attributed to the difference in charge and hydrophobic binding affinity of NE and CG. On the right, a single NE (blue) and CG (red) structure is superimposed, visualizing the similarity observed in the \(q\ge 0.02\) Å\(^{-1}\) regime

Figure 6 suggests that NE and CG are similar for \(q \ge 0.02\) Å\(^{-1}\), therefore permitting us to discuss the structural complexity of CG only. The CG crystal structure was previously described as an oblate ellipsoid by Hof et al. [10]. The model permitted to fit the scattering curve I(q) using a cylinder model with two lognormal distributions in radius (R) and length (L). For the data in Fig. 6, an average \(R =14.03 \pm 0.10\) Å  and average length \(L = 58.90 \pm 1.50\) Å were obtained, yielding a radius of gyration equal to [36]

$$\begin{aligned} R_{g}=\sqrt{\frac{4\cdot R^{2}}{8}+\frac{L^{2}}{12}}=19.7\pm 0.3{\mathring{\mathrm{A}}}. \end{aligned}$$

The fitting was carried out employing the software SasView [35]. From the SANS data in Fig. 6, the same value is expected for the NE structure as I(q) match for \(q > 0.02\) Å\(^{-1}\).

Comparing the radii of gyration obtained from the MD simulations (\(17.1\pm 0.1\) Å for CG, \(16.9\pm 0.1\) Å for NE) and the SANS data, a reasonable agreement can be observed. Considering the size of the protein structures and their classification as \(\beta \)-proteins based on the SCOP (Structural Classification of Proteins) classification [37, 38], the \(R_g\) value is expected to be between 16.7 Å and 20.0 Å [34]. The radii obtained from the SANS experiment and the MD simulation are within this range.

Dissimilarities between NE and CG

Figure 7 shows the result of the Kullback–Leibler divergence analysis for all residues with their respective partners. The differences comparing CG to NE (Panel A) and NE to CG (Panel B) are shown. Additionally, a visualization of the values shown in Fig. 7 mapped onto the protein structures is supplied in Fig. 8.

Fig. 7
figure 7

Kullback–Leibler divergence computed between CG and NE. A peak in the graph is a point where the similarity exceeds a value of 0.9 quantile indicated by the horizontal black dashed line. The black vertical lines correspond to the three active-site residues for CG and NE, respectively. The gray bars indicate the residues of the three loops that are decisive for the protein’s function [14]. According to the respective Panel, the solid lines (red and blue) show the average similarity and its standard deviation which is mapped onto the protein structures in Fig. 8. The dashed colored line (red and blue) indicates the difference in the original crystal structures, which shows the residues in which the initial configuration already shows minor differences. As these differences are tiny, they were scaled by a factor of 100 for visibility. The lower panels show the root mean square fluctuation (RMSF) of individual residues computed over the individual simulation trajectories, indicating the intrinsically flexible areas. The protein renderings show the respective protein structures color-coded according to the residue number shown on the color scales. The areas highlighted in gray in the protein structures correspond to the three loops colored gray in the plots

As indicated by the gray bars in Fig. 7, the large variance observed for the three loops decisive for individual specificity of the NE and CG enzymes highlights the clear relation between structure and function in the NE and CG proteins. However, no large variance (greater than the 0.9 quantile) was observed in the region of the three key residues of the active site, namely His57, Asp102, and Ser195, as indicated by the labeled vertical black lines. On the other hand, while both His57 and Asp102 are located in a trough, i.e., residues at which the two structures are more similar, the similarity of CG and NE at the Ser195 residue decreases (as seen through a corresponding peak in Fig. 7). In the active state, Ser195 is expected to undergo a slight conformational change to stabilize the catalytic process [39, 40]. As NE and CG were simulated inactively, the slight difference between the position of Ser195 could explain the difference in the functionality of the proteins and might therefore facilitate or hinder the nucleophilic attack that initiates the catalysis process.

Fig. 8
figure 8

The two protein structures are shown individually. Each protein is colored according to its similarity value calculated by means of the Kullback–Leibler divergence and shown in Fig. 7. Residues that are very similar are colored white, while a color scale in blue for NE and red for CG is applied, corresponding to the similarity values

Lastly, Fig. 7 shows a dashed line for CG (red) and NE (blue), which marks the Kullback–Leibler divergence for CG and NE for the crystal structure only. If one were to see differences in the crystal structure, it could indicate a predisposition to a conformational change. However, the differences between the crystal structures are not only visually small but also based on the similarity measure. Only by scaling the difference by a factor of 100, they become detectable in Fig. 7.


Neutrophil serine proteases are essential for the function of the neutrophils, and although they are relatively well-characterized, several important unanswered questions remain. In particular, even though the role of NE and CG in antibacterial defense is relatively well-established, their extended specificities have not yet been fully determined.

One main question comes from the fact that although X-ray crystallography reveals that the structures of the inhibited proteins are very similar, they have different biophysical specificity; only NE specifically destroys Shigella virulence factors [8]. Given that the NE and CG molecules are flexible and soft, small changes within the molecular structure as well as slight differences in electrostatic potentials, dipole moments and charges may play a significant role for their functionality. The similarity in the crystal structures of NE and CG could stem from the introduction of inhibitors, which are necessary for a successful determination of the structure; the specific inhibitors for NE and CG mimic the peptide substrates and block the enzymatic activity of NE and CG. To test the hypothesis that the conformations of the two studied proteins, while not inhibited and in solution, are in reality different, an experiment was performed comparing NE and CG using small-angle neutron scattering (SANS). Due to the natural aggregation of the NE system, the information that can be extracted from the SANS data becomes difficult to be evaluated. To overcome such difficulties, computational methodologies were used to relate the structure and function of NE and CG. The basic idea is that subtle conformational changes might not be easily detected through (even advanced) experimental approaches. To achieve this goal, we have developed a matching method that uses a dynamic time warping algorithm enabling a pairwise evaluation of the similarity of the protein dynamics trajectory based on the symmetrized mean Kullback–Leibler divergence. Furthermore, it was demonstrated that the experimentally determined radius of gyration, \(R_g\), obtained for CG is close to the value calculated from the MD simulations, implying that the value calculated for NE has also been established reliably and that the NE and CG structures are expected to be similar in solution. The MD simulation allowed the identification of subtle changes in the dynamics of the two proteins. These subtle differences between NE and CG in the solvent could be responsible for the difference in proteins’ specificity. Additionally, it has been established that the electrostatic nature of the active sites varies significantly between NE and CG, facilitating the difference in their catalytic behavior [14].

In conclusion, we argue that elaborate analyses based on MD simulations provide a detailed molecular understanding of the two studied protein systems and that not only electrostatic, but also conformational differences can be established, even if the crystal structures are very similar. Of course, the MD analyses have to be backed by experimental data, e.g., SANS data, and vice versa. The MD studies can suggest, for instance, possible mutations to be tested to avoid unnecessary costly experiments.