The same, but different, but still the same: structural and dynamical differences of neutrophil elastase and cathepsin G

Although the general mechanism for serine protease catalysis is well established, some questions still remain. For instance, the two enzymes, neutrophil elastase and cathepsin G, have a lot of structural resemblances. However, elastase degrades virulence factors, while cathepsin G does not. This paper studies both enzymes computationally to probe for their conformational differences. In the process, a methodology is established to not only quantify similarities between the protein trajectories describing proteins’ temporal evolution but also account for a varying number of amino acid residues comprising each structure. Our results indicate slight differences in the behavior of the active sites of neutrophil elastase and cathepsin G in the solvent. These subtle changes could indicate differences in the general behavior responsible for the different specificity of the two enzymes.


Introduction
Shigella are Gram-negative rod-shaped bacteria, which are typically 0.3-1 µm in diameter and 1-6 µm in length, appearing singly, in pairs, and in chains, being closely related to E. coli, that invade the colonic mucosa [1]. Shigella are present wherever humans exist, and the direct spread from one person to another is a major route of transmission, which is facilitated by the very low inoculum of organisms required for infection [2]. An individual with dysentery and a presumed Shigella infection should be treated with an appropriate antimicrobial agent when first seen [2]. However, the increasing prevalence of multiple drug-resistant strains has made the selection of an appropriate agent increasingly difficult [3], making Shigella the second leading cause of diarrhoeal mortality in 2016 [4].
The inflammatory response initiated by Shigella is characterized by the recruitment of neutrophils, a part of a family of granulocytes that play an essential role in innate immunity. Neutrophil granules house critical enzymes that initially damage the colonic mucosa and exacerbate the infection but, ultimately, also control the infection by trapping Shigella in a phagosome where it is killed efficiently. Perhaps the best studied a e-mail: ilia.solovyov@uol.de (corresponding author) among these are the azurophilic granules [7], which contain high levels of bactericidal enzymes including the chymotrypsin-like serine proteases neutrophil elastase (NE) and cathepsin G (CG) [8][9][10][11], which use the Ser195/His57/Asp102 catalytic triad mechanism [12], where serine is the nucleophile, histidine is the general base and acid, and the aspartate helps orient the histidine residue and neutralize the charge that develops on the histidine during the transition states. This arrangement is conserved throughout the serine proteases family. The residue numbers are matched to those from chymotrypsinogen allowing for unifying the nomenclature throughout structures of the chymotrypsinogen protease family [6]. Enzymes of this protease family display a three-dimensional structure consisting of two homologous β-barrels and a C-terminal α-helix. Each β-barrel contains six anti-parallel β-sheets connected through some linker segments. Residues of the catalytic triad are located at the junction of the two β-barrels, whereas the active site cleft runs perpendicular to this junction [6]. Such an arrangement of amino acids in the active site presumably allows a nucleophilic attack by Ser195 on the carbonyl carbon (C = O) of the substrate at which the hydrolysis cleavage occurs, thus setting off the catalysis process [12].
The following convention for the sections of the two proteins has been adopted from an earlier study [10]. The important structural motifs for cathepsin G (CG) and neutrophil elastase (NE) are shown in Panels A and B, respectively. The conserved cysteines are marked black, and the three residues of the catalytic triad, His-Asp-Ser, are shown as dark green. The S1 pockets are shown in red for CG and blue for NE. The glutamic acid belonging to the S1' sub-site of CG is shown in yellow. The leucines belonging to the S2' sub-sites of NE and CG are marked cyan. C: The sequences of CG and NE have been aligned [5]; an additional visualization of the sequence alignment is shown in Appendix 9. The different residues are marked following the same color scheme described in panels A and B. The residue numbers follow the chymotrypsinogen numbering, which is used to unify residue numbers of the protease family [6] The specific S1 pocket in individual chymotrypsinlike serine proteases is crucial in defining their substrate specificity. In NE, the S1 pocket is hemispherical and hydrophobic because of the presence of valine, phenylalanine, alanine, and the disulfide bridge Cys182-Cys201, see Fig. 1. In contrast, the S1 pocket in CG reveals a significantly stronger electrostatic character due to a negatively charged glutamic acid (S1' sub-site) at the far end. The S2' sub-site, which is located in the S2 pocket of the protein (Fig. 1C, cyan dotted lines) of NE and CG, is formed by leucine (Leu106, Leu105) residues ( Fig. 1A-B) and the flat side of the imidazole ring of His57 from the catalytic triad, which is rather hydrophobic. X-ray structural analysis showed that CG displays two positively charged residues (Arg41 and Arg143) in close vicinity to the active S2' sub-site that could contribute to the specificity of the protease [13]. Visualization of the sequence alignment of NE and CG and a graphical illustration of the spatial arrangement of the aforementioned sites is shown in Fig. 1.
The structure-function analysis has established several regions in NE as essential for cleaving Shigella virulence proteins [13,14]. For example, it was hypothe-sized that Shigella virulence factor specificity belongs to a distinct region close to NE's active site. In order to thoroughly determine the reasons for the different functional specificity in NE and CG and to find the subtle indicators of differences between these two proteins, atomistic simulations can be employed. For this purpose, we suggest and utilize a novel protocol to probe the differences in dynamics of the two protein structures, explicitly dealing with structures with a different number of amino acids residues.
The motivation for the study primarily comes from the fact that although NE and CG sequences are only about 32% similar sequence-wise [5], their X-ray crystal structures are remarkably similar [10,11]. The sequence alignment is shown in Fig. 1 as well as in Appendix 9. Despite that, solely NE is a key host defense protein responsible for controlling Shigella, as well as Salmonella and Yersinia, the causative agents of typhoid fever and the plague [15]. Why does CG lack the NE specificity for bacterial virulence factors? Our results suggest that understanding the role of neutrophils in disease mechanisms might be facilitated by computational methodologies.
Molecular dynamics (MD) simulations were conducted using the NAMD [16,17] software; the interatomic interactions were described using the CHARMM36 force field with CMAP corrections [18][19][20][21][22]. The simulations were set up employing the onlineplatform VIKING [23]. The same simulation parameters were used for the NE and CG structures. The structures were solvated in a water box with 20Å padding distance around the protein structure. The system was neutralized and ionized using NaCl to reach a concentration of 0.2 mol/l, representing the typical salt concentration inside a cell. The temperature of the system was set to 315 K, and a pressure of 1 bar was chosen. Following the 10,000 energy minimization steps, equilibration simulations were performed. Equilibrations were carried out for 21 ns at a timestep of 1 fs with varying constraints: In the first nanosecond, everything but water and ions was restrained, while the next 10 ns had a restrained protein backbone, allowing the amino acid side chains to reach a favorable conformation. The last 10 ns of the equilibration protocol was done without any restraints. Following equilibration, the production simulations were carried out for 400 ns with rigid hydrogen bonds in the NVT (constant number of particles, volume, and temperature) ensemble.
The simulated structures were probed for their stability using the radius of gyration, R g , which was determined for every timestep for the two structures, respectively, by calculating with K denoting the number of atoms in the protein, r i being the location of the ith atom and r c the geometric center of the protein.

Small angle neutron scattering
To substantiate the results of MD simulations, smallangle neutron scattering (SANS) was performed to compare the shape of NE and CG dissolved in D 2 O. The SANS allows testing of NE and CG conformations without inhibitors, while crystal structures were obtained only with inhibitors [10,11]. Additionally, the molecular weight of 25.98 kDa and 26.58 kDa for CG and NE, respectively, limits the possibility of using NMR, making SANS a favorable method in this study. Furthermore, although very high protein concentrations were necessary to obtain a reasonable signal-to-noise ratio, enzymatic assay results showed that both pro-teins were still active even after 72 hours of exposure to the neutron beam. Experiments were carried out using V4 at the Helmholtz-Zentrum Berlin für Materialien und Energie (Berlin, Germany) in standard geometry at 25 • C [24]. The samples were placed in quartz cells of 1 mm thickness and filled with the buffer as immersion liquid. In all cases, circular cadmium diaphragms of 14 mm diameter defined the volume exposed to the incident beam. The samples were measured with a wavelength of 6Å at three different sample-detector distances of 0.98 m, 4 m, and 12 m, covering a momentum transfer range between 0.04 nm −1 and 3.5 nm −1 .
NE and CG were purchased from Elastin Products Company, Owensville, Missouri, USA. Their enzymatic activity was determined per the manufacturer's protocols. For this experiment, NE and CG were both dissolved in 100 mM Tris D 2 O buffer at a concentration of 10 mg/mL.

Theory of dynamical trajectory comparison
Alignment of two trajectories describing protein dynamics, followed by their comparison, utilized the so-called Kullback-Leibler divergence [25]. The essential idea of the methodology employed is described below.

General aspects
Qualitative comparison of two trajectories describing the dynamics of two proteins requires the structural alignment of both structures that evolve in time. The alignment is meant to diminish the effect of global movements and rotations of the evolving structures in order to focus on internal rearrangements within the protein structures. A common strategy to align two protein structures is the Kabsch algorithm [26], which rotates and translates the structures to achieve the smallest relative root mean square deviation (RMSD) value for the two structures. The Kabsch algorithm requires the same number of residues to be present in the two studied structures, which is not applicable for the NE and CG structures; the NE structure is comprised of 240 residues, while the CG structure contains 224 residues. A multi-step alignment scheme was therefore employed.
At first, a single frame alignment using the MultiSeq tool in VMD [27,28] was performed, and 134 pairs of residues from both structures were manually selected following a visual inspection. Next, the Kabsch alignment was used to line up the protein structures according to the selected 134 reference residues. The procedure followed earlier studies [29,30], in which the NE and CG structures were aligned iteratively to their own average structure for all MD timesteps until the changes became insignificant. The aligned trajectory of CG was then aligned to the aligned trajectory of NE for all time steps. The Kabsch-aligned structures were then used in a less parameter-dependent approach, which permits aligning two protein structures independently of a manual choice of an equal number of residues, utilizing a two-step algorithm based on the Fréchet distance as proposed by Jiang et al. [31]. The Fréchet distance analyzes the position of the backbone atoms in a protein structure and interprets them as an ordered path in the three-dimensional space. While traversing the two paths of the protein backbone, one should then search for the shortest possible connection between the two paths. From these shortest connections, the Fréchet distance is defined as the length of the longest connection. The Fréchet distance has the advantage that the two discrete ordered sequences of data points do not have to be of the same length for a successful Fréchet distance calculation, thus permitting a comparison of the 224 residues of CG directly to the 240 residues of NE. The concept of the Fréchet distance calculation is illustrated in Fig. 2. The orange cylinders show the shortest connections that atoms from one structure can have to the atoms in the other structure. The purple connection marks the longest possible connection from all the possible pairs of atoms. The length of the purple connection is then the result of the Fréchet distance calculation.
The two-step algorithm described by Jiang et al. [31] was used to align all backbone atoms in the protein structures. First, both protein structures have been transposed, such that their geometric center of mass coincides with the origin point at (0, 0, 0). The following two steps were repeated for each MD simulation snapshot. In the first step, the first atom (the N atom of the first residue) and the last atom (the C atom of the last residue) of the three-dimensional protein backbone, together with the origin point, form a triangle. Considering such triangles for the NE and CG structures, one performs rotations in a way that the triangles become coplanar and their heights align, as illustrated in Fig. 3. The coplanar triangles guarantee a sensible initial orientation of the two protein backbones relative to each other for comparing the Fréchet distances. The first and the last residues of the compared structures are always considered when traversing the backbone to compute the distance. Therefore, both residues have a great influence on the resulting distance which is initially minimized by the triangle construction. In the second step of the alignment procedure, one structure (CG, reference) is kept rigid while the other (NE, alignee) is rotated. Two atoms from the alignee structure are chosen at random. The rotation of the alignee structure is then attempted by a small angle around the axis defined by the two atoms. The rotation is performed if the overall Fréchet distance between the reference and the alignee structure decreases and is rejected otherwise. The rotation procedure is repeated for a fixed number of iterations.
Note that the alignment algorithm is not deterministic, and it is possible that a successive rotation would decrease the Fréchet distance between the two structures without reaching the best relative alignment if the random rotation axis was not chosen optimally. The algorithm always follows the decrease in the Fréchet distance at each iteration, and therefore, it might follow a potentially wrong rotation direction. In order to decrease the chance of a non-optimal fit, the Fréchet alignment was repeated ten times from the initial coplanar triangle configuration. Finally, the alignment with the lowest Fréchet distance was chosen for each simulation snapshot.
In this study, the rotation angle was set to 0.0006 degrees, and 500 iterations were performed to find a better alignment of the two studied structures. The rather rotation small angle was assumed to correct the alignment established with the Kabsch algorithm in the previous alignment step. In each rotation, the average maximum displacement of a single residue was just 0.00014Å. Alternative strategies that involved multiple rotation angles as well as a scheme, that would alter the rotation angle depending on the improvements achieved in prior steps were also considered. For the size of neutrophil elastase and cathepsin G, such methods, however, did not show any improvement in the merit of the alignment procedure.

Residues coupling: generating relative displacements
Every amino acid residue can be described through its three backbone atoms for the alignment procedure. However, considering all backbone atoms is computationally demanding and largely unnecessary, as the description of every amino acid residue can be further Fig. 3 Triangle alignment. A The first and the last atom of the two protein structures (CG and NE) are shown for a single simulation snapshot, and the triangle defined by these two atoms and the origin (0, 0, 0) are indicated. The triangles are initially not sharing a plane. B After rotation, the triangles are put in the same plane within the three-dimensional space, and the heights of both triangles are set parallel. This configuration is the mandatory starting configuration for the alignment procedure. The backbone of the NE (blue) and the CG (red) proteins are shown as semi-transparent structures reduced to only the C α atoms. Their central location in each residue makes the C α atoms a suitable anchor point for the description of residue motion.
In contrast to the structural alignment, where all atoms of interest for a single simulation snapshot were considered, now every C α atom for all simulation snapshots is taken into account. To perform comparative measurements, one needs to establish pairs of residues between both structures for which the differences in the position are to be analyzed. These pairs are necessary in order to compare each fragment of the proteins with sufficient precision. Since the two studied structures do not have the same number of residues, the problem can be addressed through the dynamic time warping algorithm [32], which creates a 224 × 240 penalty matrix from the locations of the C α atoms in CG and NE protein structures. The penalties are assigned by the dynamic time warping algorithm explained in greater detail in Appendix A. In the matrix, every row corresponds to an amino acid residue in CG, while every column corresponds to a residue in NE.
Each element in the penalty matrix corresponds to a pair of residues, assigning one residue from each structure. The procedure of finding the suitable residue pairs is therefore equivalent to establishing a route through the matrix, which passes every row and every column at least once. For example, for a square penalty matrix, i.e., describing two protein structures with the same number of residues, the route could follow the matrix' diagonal. Since the studied proteins (NE and CG) have a different number of residues, a more elaborate route through the penalty matrix has to be determined. A fragment of the route for the first nine residues (residue numbers 16 to 24) of the CG and NE structures is shown in Fig. 4D. Assume that the residue 23 in the CG and NE structures should be used for comparison. In the next step, one has to evaluate the penalties that would arise upon computing the structural deviation in the residue pairs (22,23), (23,22), and (22,22), as indicated with the cyan square in Fig. 4A. The rele-vant pair to be used for comparison should naturally have the lowest penalty value associated with it. In the considered example, the penalty turns out to be 0.73 (being smaller than 1.33 and 1.02), which corresponds to the (22,22) residue pair. The next pair for comparison is then evaluated from the (22, 22) element of the penalty matrix, and the procedure repeats until the residue pair (16,16) is reached, which corresponds to the first residues in both protein structures considered in the example. This so-called backtracking algorithm is visualized in Fig. 4A. The choices of residue pairs extracted from the penalty matrix are shown in Fig. 4B, where the backbones of the NE and CG structures are shown in blue and red, respectively. The yellow connections between the C α atoms illustrate the pairs established from the penalty matrix calculation. A single residue can be matched with multiple residues in the other structure. For instance, residue Ile17 in CG is paired with both Val17 and Gly18 in the NE. The purple circle illustrates the pair in Fig. 4 with both pairs linked in yellow. The pairs of atoms that belong to the different proteins can be used to quantitatively compare the dynamic evolution of the two structures.
A description of the dynamic time warping algorithm with further details on how the penalty matrix is populated and how the backtracking algorithm is performed is given in Appendix A.
The established pairs of atoms are influenced by the dynamics of the proteins during the course of the simulation, and the pairs established for one timestep might not be the best for another. Therefore, the coupled pairs between CG and NE were formed on the basis of every timestep of the performed MD simulations. The similarity between both protein structures was then calculated for each coupling, and the cumulative similarity result is achieved as the arithmetic mean computed for all similarities for all determined residue couplings. The repetition of the process where the couplings are obtained from different snapshots is introduced to cir- The penalty values (in nm 2 ) are shown for each entry in the matrix, as well as the corresponding residue pairs (which are described by their Cα atoms) for the NE and CG structures. The pairs of atoms to be coupled for the similarity measurement can be found by backtracking through the penalty matrix, which yields the indicated route (red). The route through the matrix can either go up, left, or diagonally up left, always following the direction with the smallest element as highlighted by the cyan square. Within the square, one starts with the pair (23,23) and continues to the smallest penalty value within the square, which is 0.73, corresponding to the residue pair (22,22). Note, however, that a single residue can be coupled with more than one residue in the other structure, e.g., residue number 17 in CG is coupled with residue numbers 17 and 18 in NE; see the purple circle. B The backbone of NE (blue) and CG (red) is shown for the first residues of the protein structures. The yellow connections show the pairs between the Cα atoms of CG and NE. The atoms between the linked Cα are the nitrogens and carbons that form the backbone of the protein structures. An example of a double paired connection is indicated by the purple circle cumvent the need of choosing a specific reference frame for the calculation of the penalty matrix couplings.

The Kullback-Leibler divergence
After the proteins have been structurally aligned and the to-be-compared pairs have been established, the comparison of the two structures can commence. In general, the Kullback-Leibler divergence (KLD) [25] permits quantifying a difference between two arbitrary normalized distributions. Let us illustrate how the KLD can be used for the studied protein structures. For the sake of coherence, let us consider the residues marked by the purple ellipsoid in Fig. 4. Consider the residue Ile17 CG in CG. For all N = 80,000 MD snapshots, the location vector for the C α atom of this Ile17 CG residue, containing the x, y, and z coordinates, is then considered: The vectors in Eq. (2) permit constructing three onedimensional arrays by grouping all the x, y, and z coor-dinates, respectively: These three lists are then used to form normalized distributions of the x, y, z coordinates. For Ile17 CG , the distribution of its x coordinates would then be denoted as D(Ile17 CG , x). Consider now two residues, e.g., Ile17 CG in CG, and Val17 NE in NE. Based on the distributions introduced above, one defines KLD(Ile17 CG , Val17 NE ) as where D(·, k) i denotes the ith discrete entry in the distribution of the kth coordinate [33].
In general, the KLD is not symmetric, i.e., The measure can, however, be made symmetric by considering the arithmetic mean of KLD computed in two directions. Thus, the similarity S between a Ile17 CG and Val17 NE residues is defined as As a final step, similarity for every residue has to be evaluated according to the number of its partners. For instance, one notices that Ile17 CG is actually paired with both Val17 NE and Gly18 NE . On the other hand, Val17 NE is only forming the single pair to Ile17 CG as illustrated in Fig. 4. The final similarity value for Ile17 CG is therefore calculated aŝ while the similarity for Val17 NE would be defined aŝ In general, if one assumes a residue a in NE to be paired with residues b 1 , . . . , b l in CG, then the similarity for the residue a is defined aŝ The similarity of a residue in CG is calculated analogously by considering all the pairs to residues in NE. Intuitively, one would assume that NE is as different to CG as CG is different to NE and therefore, the overall similarity between the two structures should not depend on whether CG is compared to NE or NE is compared to CG. The symmetry is, however, violated since it is possible to have residues in one protein that couple to several residues in the other protein (see Eq. (6) and (7)). Therefore, similarities in both directions have to be considered for analyses, i.e., CG→NE and NE→CG.
By repeating the KLD procedure for all residue pairs, their similarity can be calculated and analyzed. In summary, the KLD quantifies the difference between two residues in the two studied proteins over the time span of the whole simulation trajectory and yield a single number. The smaller the resulting number is, the more similar the two residues are expected to be.

Radius of gyration and comparison with small-angle neutron scattering data
The radius of gyration (R g ) values were calculated for the two simulated NE and CG structures, and their respective time evolution is visualized in Fig. 5. The average radius of gyration is 17.1 ± 0.1Å for CG and 16.9 ± 0.1Å for NE. The values fluctuate narrowly around the average, which indicates stable structures. A difference of 0.2Å is considered neglectable in terms of R g [34], once more demonstrating the overall similarity in structure between the NE and CG proteins. Additionally, the stability of the protein structures was assessed based on analysis of the temporal evolution of the RMSD for the proteins' backbone as well as the potential energy of each protein structure without the solvent contribution. The small fluctuation in the RMSD as well as the normal distributed energies indicate equilibrated and stable structures. The RMSD and energy data are shown in Appendix Fig. 10.
The small-angle neutron scattering (SANS) data were used to compare the shapes of NE and CG. The variation of the absolute scattering intensity I with the momentum transfer rate q for the samples in a 10 mg/ml solution is shown in Fig. 6.
Differences between NE and CG are observed primarily for q ≤ 0.02Å −1 . This region of the scattering curve is sensitive to internal molecular effects, describing protein-protein interactions. Three factors were considered to be important for the proteins' affinity to form oligomers. NE has a molecular charge of 4 | e − |, while the total electrostatic charge of CG is 23 | e − |; both are positively charged, and therefore, repulsive interaction is expected between several proteins. Both Fig. 6 Experimental one-dimensional neutron scattering of NE (blue) and CG (red) versus the momentum transfer q plotted in logarithmic scale. The data are separated into two parts, as indicated by the vertical dashed line. For q ≥ 0.02Å −1 CG is fitted using the software SasView [35] with a model described by two lognormal distributions in radius (R) and length (L). For q ≤ 0.02Å −1 the observed behavior is consistent with the formation of larger aggregates that can be interpreted as protein clamping (aggregation), artistically illustrated on the left, which can be attributed to the difference in charge and hydrophobic binding affinity of NE and CG. On the right, a single NE (blue) and CG (red) structure is superimposed, visualizing the similarity observed in the q ≥ 0.02Å −1 regime proteins, however, also possess a strong dipole moment, which was computed as the average value over the entire MD simulation for both cases; CG exhibits a higher dipole moment value (305 Debye) than NE (238 Debye). Lastly, the solvent-accessible surface area (SASA) was calculated for all hydrophobic residues in both proteins and averaged over the course of the MD simulations. NE exhibits a greater average SASA value of 2990Å 2 , while CG has a value of 2128Å 2 . NE might therefore bind more favorably than CG through hydrophobic interactions. Additionally, the electrostatic potential was mapped onto the protein structures as visualized in Appendix Fig. 11. The figure shows that the backbones of the two proteins are folded remarkably similar, while the electrostatic potential around the proteins is clearly different. A notable difference can be spotted near the active triad, where NE exhibits a more pronounced negative potential compared to CG. This conclusion is consistent with the observation of the lower charge of NE as compared to CG. Finally, the existence of a dipole moment in both proteins supports the rationale further. All put together one may conclude that the differences in I(q) for q < 0.02Å −1 in Fig. 6 arise due to a higher oligomerization rate of NE proteins as compared to CG. Furthermore, applying the power law, I(q) ∼ = (A * q −d +I 0 ), to model the low-q region, yielded a d-value of d CG = 3.3±0.2 and d NE = 1.2±0.1 for the CG and NE proteins, respectively, which supports the hypothesis that aggregation occurs differently in both samples. Figure 6 suggests that NE and CG are similar for q ≥ 0.02Å −1 , therefore permitting us to discuss the structural complexity of CG only. The CG crystal struc-ture was previously described as an oblate ellipsoid by Hof et al. [10]. The model permitted to fit the scattering curve I(q) using a cylinder model with two lognormal distributions in radius (R) and length (L). For the data in Fig. 6, an average R = 14.03 ± 0.10Å and average length L = 58.90 ± 1.50Å were obtained, yielding a radius of gyration equal to [36] The fitting was carried out employing the software SasView [35]. From the SANS data in Fig. 6, the same value is expected for the NE structure as I(q) match for q > 0.02Å −1 .
Comparing the radii of gyration obtained from the MD simulations (17.1 ± 0.1Å for CG, 16.9 ± 0.1Å for NE) and the SANS data, a reasonable agreement can be observed. Considering the size of the protein structures and their classification as β-proteins based on the SCOP (Structural Classification of Proteins) classification [37,38], the R g value is expected to be between 16.7Å and 20.0Å [34]. The radii obtained from the SANS experiment and the MD simulation are within this range. Figure 7 shows the result of the Kullback-Leibler divergence analysis for all residues with their respective partners. The differences comparing CG to NE (Panel A) and NE to CG (Panel B) are shown. Additionally, a visualization of the values shown in Fig. 7 mapped onto the protein structures is supplied in Fig. 8.

Dissimilarities between NE and CG
As indicated by the gray bars in Fig. 7, the large variance observed for the three loops decisive for individual specificity of the NE and CG enzymes highlights the clear relation between structure and function in the NE and CG proteins. However, no large variance (greater than the 0.9 quantile) was observed in the region of the three key residues of the active site, namely His57, Asp102, and Ser195, as indicated by the labeled vertical black lines. On the other hand, while both His57 and Asp102 are located in a trough, i.e., residues at which the two structures are more similar, the similarity of CG and NE at the Ser195 residue decreases (as seen through a corresponding peak in Fig. 7). In the active state, Ser195 is expected to undergo a slight conformational change to stabilize the catalytic process [39,40]. As NE and CG were simulated inactively, the slight difference between the position of Ser195 could explain the difference in the functionality of the proteins and might therefore facilitate or hinder the nucleophilic attack that initiates the catalysis process.
Lastly, Fig. 7 shows a dashed line for CG (red) and NE (blue), which marks the Kullback-Leibler divergence for CG and NE for the crystal structure only. If one were to see differences in the crystal structure, it could indicate a predisposition to a conformational change. However, the differences between the crystal Fig. 7 Kullback-Leibler divergence computed between CG and NE. A peak in the graph is a point where the similarity exceeds a value of 0.9 quantile indicated by the horizontal black dashed line. The black vertical lines correspond to the three active-site residues for CG and NE, respectively. The gray bars indicate the residues of the three loops that are decisive for the protein's function [14]. According to the respective Panel, the solid lines (red and blue) show the average similarity and its standard deviation which is mapped onto the protein structures in Fig. 8. The dashed colored line (red and blue) indicates the difference in the original crystal structures, which shows the residues in which the initial configuration already shows minor differences. As these differences are tiny, they were scaled by a factor of 100 for visibility. The lower panels show the root mean square fluctuation (RMSF) of individual residues computed over the individual simulation trajectories, indicating the intrinsically flexible areas. The protein renderings show the respective protein structures color-coded according to the residue number shown on the color scales. The areas highlighted in gray in the protein structures correspond to the three loops colored gray in the plots structures are not only visually small but also based on the similarity measure. Only by scaling the difference by a factor of 100, they become detectable in Fig. 7.

Conclusion
Neutrophil serine proteases are essential for the function of the neutrophils, and although they are relatively well-characterized, several important unanswered questions remain. In particular, even though the role of NE and CG in antibacterial defense is relatively wellestablished, their extended specificities have not yet been fully determined.
One main question comes from the fact that although X-ray crystallography reveals that the structures of the inhibited proteins are very similar, they have different biophysical specificity; only NE specifically destroys Shigella virulence factors [8]. Given that the NE and CG molecules are flexible and soft, small changes within the molecular structure as well as slight differences in electrostatic potentials, dipole moments and charges may play a significant role for their functionality. The similarity in the crystal structures of NE and CG could stem from the introduction of inhibitors, which are necessary for a successful determination of the structure; the specific inhibitors for NE and CG mimic the peptide substrates and block the enzymatic activity of NE and CG. To test the hypothesis that the conformations of the two studied proteins, while not inhibited and in solution, are in reality different, an experiment was performed comparing NE and CG using small-angle neutron scattering (SANS). Due to the natural aggregation of the NE system, the information that can be extracted from the SANS data becomes difficult to be evaluated. To overcome such difficulties, computational methodologies were used to relate the structure and function of NE and CG. The basic idea is that subtle conformational changes might not be easily detected through (even advanced) experimental approaches. To achieve this goal, we have developed a matching method that uses a dynamic time warping algorithm enabling a pairwise evaluation of the similarity of the protein dynamics trajectory based on the symmetrized mean Kullback-Leibler divergence. Furthermore, it was demonstrated that the experimentally determined radius of gyration, R g , obtained for CG is close to the value calculated from the MD simulations, implying that the value calculated for NE has also been established reliably and that the NE and CG structures are expected to be similar in solution. The MD simulation allowed the identification of subtle changes in the dynamics of the two proteins. These subtle differences between NE and CG in the solvent could be responsible for the difference in proteins' specificity. Additionally, it has been established that the electrostatic nature of the active sites varies significantly between NE and CG, facilitating the difference in their catalytic behavior [14]. In conclusion, we argue that elaborate analyses based on MD simulations provide a detailed molecular understanding of the two studied protein systems and that not only electrostatic, but also conformational differences can be established, even if the crystal structures are very similar. Of course, the MD analyses have to be backed by experimental data, e.g., SANS data, and vice versa. The MD studies can suggest, for instance, possible mutations to be tested to avoid unnecessary costly experiments. berg Professorship to IAS), the DFG, German Research Foundation, (GRK1885-Molecular Basis of Sensory Biology and SFB 1372-Magnetoreception and Navigation in Vertebrates), and the Ministry for science and culture of Lower Saxony (Simulations meet experiments on the nanoscale: Opening up the quantum world to artificial intelligence (SMART)). Computational resources for the simulations were provided by the CARL Cluster at the Carlvon-Ossietzky University Oldenburg, which is supported by the DFG and the ministry for science and culture of Lower Saxony. The work was also supported by the North-German Supercomputing Alliance (HLRN). We thank the Helmholtz-Zentrum Berlin für Materialien und Energie (former Hahn-Meitner Institut) for the allocation of neutron beamtime.

Author contributions
FS contributed to conceptualization, methodology, software, formal analysis, writing-original draft, visualiza-tion. XT contributed to investigation, writing-review & editing, visualization. LG contributed to validation, writing-review & editing. HNB contributed to conceptualization, investigation, resources, data curation, writing-original draft, supervision. IAS contributed to resources, data curation, writing-review & editing, supervision, funding acquisition.
Funding Open Access funding enabled and organized by Projekt DEAL.

Data Availability Statement
This manuscript has no associated data or the data will not be deposited. [Authors' comment: All data generated during this study are contained in this published article.]

Conflict of interest
The authors have no competing interests to declare that are relevant to the content of this article.
Graphical Software Protein renderings were done in VMD, and plots were generated employing the Matplotlib package in Python and Origin. Postproduction was done in Adobe Photoshop.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.

Appendix A. Dynamic time warping
The Dynamic Time Warping algorithm was originally designed to handle speech recognition problems, as the similarity of two phrases needed to be analyzed without obscuring the order [32]. Here, it was generalized and applied to study the dynamics of proteins.
Assume two ordered sequences of coordinate points (i.e., atomistic locations) with n and m elements, respectively. The dynamic time warping algorithm between two ordered sequences A = {a 1 , . . . , a n } and B = {b 1 , . . . , b m } creates an n × m penalty matrix P in three steps; here, P (i, j) denotes the entry in the ith row and the jth column of P .
First, one sets P (1, 1) = a 1 − b 1 2 with · being the Euclidean norm. As a second, the first row and first column are traversed, such that respectively. Having computed the elements of the first row and the first column of the matrix P , the remaining entries are determined by setting In other words, the dynamic time warping algorithm is designed to calculate a penalty value that depends on the distance between the points that build the corresponding pair, built from an element in a ∈ A and b ∈ B. The penalty for the pair (a, b) is calculated while considering the penalties of previous steps that needed to be traversed in the ordered lists to reach the pair (a, b). The result for the dynamic time warping distance between the A and B sequences of coordinates corresponds to the P (n, m) element in the penalty matrix. A smaller value of P (n, m) indicates a higher similarity of the two ordered A and B sequences. The primary interest in the case of the studied proteins is not the dynamic time warping distance but the indices of the paired residues (couplings) generated in the process, which can be backtracked considering the whole penalty matrix P . A protein-specific introduction of the backtracking is visualized in Fig. 4. A route through the penalty matrix is defined as a list of tuples containing the row and column number of an element in the penalty matrix. The route must include the tuples (n, m) as well as (1,1). Additionally, each row number and each column number must appear at least once within the tuples comprising the route.
The route is constructed iteratively by consecutively adding the next steps. Starting at P (n, m), the tuple (n, m) is added to the route. From there, one considers the elements in the penalty matrix P (n − 1, m), P (n, m − 1), and P (n − 1, m − 1), choosing the element with the smallest value and adding the corresponding tuple to the route. The procedure is repeated from the new element of the penalty matrix analogously. For a general P (i, j) with i ≤ n, j ≤ m, one sequentially selects one of the P (i−1, j), P (i, j −1), or P (i−1, j −1) elements at every step of the algorithm until the P (1, 1) element is reached and the tuple (1, 1) is added to the route. This backtracking procedure, also visualized as an example in Fig. 4, yields the route through the matrix.

Appendix B. Sequence alignment of NE and CG
See Fig. 9.  Electrostatic differences in CG and NE. Panels A and B show the CG and NE protein with their electrostatic potential mapped onto the surface of the two protein structures. The color scale ranges from the smallest potential (red) to the largest potential (blue). The active triad is shown as a reference to the orientation of the protein structures with His57 in cyan, Asp102 in black, and Ser195 in yellow. Panels C and D show the cartoon representation of the protein structures (CG in red, NE in blue) in the same orientation as the surface mappings to visualize the alignment of the structures in all four panels. Noticeably, close to the active triad, a difference can be detected. NE features a higher negative potential (Panel B)exhibits a higher negative charge. Overall, the intensity of charges is greater in NE compared to CG