Consequences of Genetic Recombination on Protein Folding Stability

Genetic recombination is a common evolutionary mechanism that produces molecular diversity. However, its consequences on protein folding stability have not attracted the same attention as in the case of point mutations. Here, we studied the effects of homologous recombination on the computationally predicted protein folding stability for several protein families, finding less detrimental effects than we previously expected. Although recombination can affect multiple protein sites, we found that the fraction of recombined proteins that are eliminated by negative selection because of insufficient stability is not significantly larger than the corresponding fraction of proteins produced by mutation events. Indeed, although recombination disrupts epistatic interactions, the mean stability of recombinant proteins is not lower than that of their parents. On the other hand, the difference of stability between recombined proteins is amplified with respect to the parents, promoting phenotypic diversity. As a result, at least one third of recombined proteins present stability between those of their parents, and a substantial fraction have higher or lower stability than those of both parents. As expected, we found that parents with similar sequences tend to produce recombined proteins with stability close to that of the parents. Finally, the simulation of protein evolution along the ancestral recombination graph with empirical substitution models commonly used in phylogenetics, which ignore constraints on protein folding stability, showed that recombination favors the decrease of folding stability, supporting the convenience of adopting structurally constrained models when possible for inferences of protein evolutionary histories with recombination. Supplementary Information The online version contains supplementary material available at 10.1007/s00239-022-10080-2.

. Folding free energy variation caused by recombination events at every breakpoint position in the protein family DDL. For every breakpoint position, the figure shows boxplots with the variation of free energy caused by recombination (difference between folding free energy of parental and descendant proteins). Note that the site boxplot distributions overlap, indicating an overall lack of statistical differences. However, breakpoints located at extreme regions of the sequences showed a trend of reducing effects of recombination on the folding free energy. Figure S2. Folding free energy variation caused by recombination events at every breakpoint position in the protein family DNAK. For every breakpoint position, the figure shows boxplots with the variation of free energy caused by recombination (difference between folding free energy of parental and descendant proteins). Note that the site boxplot distributions overlap, indicating an overall lack of statistical differences. However, breakpoints located at extreme regions of the sequences showed a trend of reducing effects of recombination on the folding free energy. Figure S3. Folding free energy variation caused by recombination events at every breakpoint position in the protein family TPIS. For every breakpoint position, the figure shows boxplots with the variation of free energy caused by recombination (difference between folding free energy of parental and descendant proteins). Note that the site boxplot distributions overlap, indicating an overall lack of statistical differences. However, breakpoints located at extreme regions of the sequences showed a trend of reducing effects of recombination on the folding free energy. Figure S4. Folding free energy variation caused by recombination events at every breakpoint position in the protein family TRPA. For every breakpoint position, the figure shows boxplots with the variation of free energy caused by recombination (difference between folding free energy of parental and descendant proteins). Note that the site boxplot distributions overlap, indicating an overall lack of statistical differences. However, breakpoints located at extreme regions of the sequences showed a trend of reducing effects of recombination on the folding free energy. Figure S5. Folding free energy variation caused by recombination events at every breakpoint position in the protein family TRXB. For every breakpoint position, the figure shows boxplots with the variation of free energy caused by recombination (difference between folding free energy of parental and descendant proteins). Note that the site boxplot distributions overlap, indicating an overall lack of statistical differences. However, breakpoints located at extreme regions of the sequences showed a trend of reducing effects of recombination on the folding free energy. Figure S6. Folding free energy of recombined protein sequences as a function of folding free energy of parental protein sequences for the protein family DDL. The plots show the mean of folding free energy of parent protein sequences (y axis) as a function of the mean of folding free energy of recombined protein sequences (x axis) in a total of 1,000 recombination events. Above plot considers recombination events at every breakpoint position (mean); correlation coefficient = 0.989, p value < 2.2e -16 . Below plot, relationship considering recombination events with breakpoint position located in the middle of the sequences (correlation coefficient = 0.984, p value < 2.2e -16 ). Figure S7. Folding free energy of recombined protein sequences as a function of folding free energy of parental protein sequences for the protein family DNAK. The plots show the mean of folding free energy of parent protein sequences (y axis) as a function of the mean of folding free energy of recombined protein sequences (x axis) in a total of 1,000 recombination events. Above plot considers recombination events at every breakpoint position (mean); correlation coefficient = 0.988, p value < 2.2e -16 . Below plot, relationship considering recombination events with breakpoint position located in the middle of the sequences (correlation coefficient = 0.954, p value < 2.2e -16 ). Figure S8. Folding free energy of recombined protein sequences as a function of folding free energy of parental protein sequences for the protein family TPIS. The plots show the mean of folding free energy of parent protein sequences (y axis) as a function of the mean of folding free energy of recombined protein sequences (x axis) in a total of 1,000 recombination events. Above plot considers recombination events at every breakpoint position (mean); correlation coefficient = 0.977, p value < 2.2e -16 . Below plot, relationship considering recombination events with breakpoint position located in the middle of the sequences (correlation coefficient = 0.908, p value < 2.2e -16 ). Figure S9. Folding free energy of recombined protein sequences as a function of folding free energy of parental protein sequences for the protein family TRPA. The plots show the mean of folding free energy of parent protein sequences (y axis) as a function of the mean of folding free energy of recombined protein sequences (x axis) in a total of 1,000 recombination events. Above plot considers recombination events at every breakpoint position (mean); correlation coefficient = 0.985, p value < 2.2e -16 . Below plot, relationship considering recombination events with breakpoint position located in the middle of the sequences (correlation coefficient = 0.986, p value < 2.2e -16 ). Figure S10. Folding free energy of recombined protein sequences as a function of folding free energy of parental protein sequences for the protein family TRXB. The plots show the mean of folding free energy of parent protein sequences (y axis) as a function of the mean of folding free energy of recombined protein sequences (x axis) in a total of 1,000 recombination events. Above plot considers recombination events at every breakpoint position (mean); correlation coefficient = 0.979, p value < 2.2e -16 . Below plot, relationship considering recombination events with breakpoint position located in the middle of the sequences (correlation coefficient = 0.965, p value < 2.2e -16 ).

Figure S11. Variation of folding free energy between descendant proteins as a function of the variation of folding free energy between parental proteins for the protein family DDL.
Every point refers to a recombination event. Left: Boxplots for intervals of folding free energy variation between the parental proteins. For every interval, the number of recombination events N falling in the interval and its fraction respect to the total (all intervals) number of recombination events (shown in parenthesis) is included. Results for recombination breakpoints in all the positions and in only the middle position are shown above and below, respectively.
13 Figure S12. Variation of folding free energy between descendant proteins as a function of the variation of folding free energy between parental proteins for the protein family DNAK. Every point refers to a recombination event. Left: Boxplots for intervals of folding free energy variation between the parental proteins. For every interval, the number of recombination events N falling in the interval and its fraction respect to the total (all intervals) number of recombination events (shown in parenthesis) is included. Results for recombination breakpoints in all the positions and in only the middle position are shown above and below, respectively.

Figure S13. Variation of folding free energy between descendant proteins as a function of the variation of folding free energy between parental proteins for the protein family TPIS.
Every point refers to a recombination event. Left: Boxplots for intervals of folding free energy variation between the parental proteins. For every interval, the number of recombination events N falling in the interval and its fraction respect to the total (all intervals) number of recombination events (shown in parenthesis) is included. Results for recombination breakpoints in all the positions and in only the middle position are shown above and below, respectively. Figure S14. Variation of folding free energy between descendant proteins as a function of the variation of folding free energy between parental proteins for the protein family TRPA. Every point refers to a recombination event. Left: Boxplots for intervals of folding free energy variation between the parental proteins. For every interval, the number of recombination events N falling in the interval and its fraction respect to the total (all intervals) number of recombination events (shown in parenthesis) is included. Results for recombination breakpoints in all the positions and in only the middle position are shown above and below, respectively.
16 Figure S15. Variation of folding free energy between descendant proteins as a function of the variation of folding free energy between parental proteins for the protein family TRXB. Every point refers to a recombination event. Left: Boxplots for intervals of folding free energy variation between the parental proteins. For every interval, the number of recombination events N falling in the interval and its fraction respect to the total (all intervals) number of recombination events (shown in parenthesis) is included. Results for recombination breakpoints in all the positions and in only the middle position are shown above and below, respectively. Figure S16. Acceptance rates of mutated and recombined sequences (breakpoints only located in the middle of sequences) in several protein families. The acceptation of a mutation or recombination event was defined as meeting ∆Gs  t∆Gr, where ∆Gs is the folding stability of the tested protein (i.e., generated by a mutation or recombination event), ∆Gr is the folding stability of the real protein (Table 1) and t is a user-specified threshold. In this figure, the threshold is 0.95. The figure shows the acceptance rates of mutated sequences and recombined sequences, as well as the rates of recombination events accepting only one recombined sequence and both recombined sequences. Error bars correspond to the standard error of the mean of the respective mutation or recombination events. Results for the same analysis but focused on recombination events with breakpoints occurring in all the positions are shown in Figure 2. Figure S17. Evaluation of the protein folding stability caused by recombination and mutation events in several protein families. For every studied protein family, the figure shows the rate of accepted mutation events that increases the predicted protein stability, the rate of accepted recombination events producing both descendant (recombined) proteins more stable or unstable than both parental proteins and, the rate of recombination events producing one descendant protein more stable or unstable than both parental proteins. Results obtained considering a threshold of 0.95 to accept mutation and recombination events. This evaluation considered recombination events with breakpoints located in all the protein sites. Error bars indicate standard error of the mean of the corresponding mutation and recombination events. Results for the same analysis but focused on recombination events with breakpoints occurring only in the middle position of sequences are shown in Figure S18. Figure S18. Evaluation of the protein folding stability caused by recombination (breakpoints only located in the middle of sequences) and mutation events in several protein families. For every studied protein family, the figure shows the rate of accepted mutation events that increases the predicted protein stability, the rate of accepted recombination events producing both descendant (recombined) proteins more stable or unstable than both parental proteins and, the rate of recombination events producing one descendant protein more stable or unstable than both parental proteins. Results obtained considering a threshold of 0.95 to accept mutation and recombination events. This evaluation considered recombination events with breakpoints located in all the protein sites. Error bars indicate standard error of the mean of the corresponding mutation and recombination events. Results for the same analysis but focused on recombination events with breakpoints occurring in all the positions are shown in Figure S17. Figure S19. Rates of accepted mutated and recombined sequences that are more stable or unstable than their parent sequences for different protein families. The figure shows the rate of mutated sequences more stable than their parent sequences and the rates of recombined (descendant) sequences that are more stable or unstable than both or one of the parental sequences. Results obtained considering a threshold of 0.95 to accept mutation and recombination events. Error bars indicate standard error of the mean of the corresponding mutation and recombination events. Above: Recombination breakpoints located in all the positions. Below: Recombination breakpoints located only in the middle of sequences. Figure S20. Rates of accepted mutated and recombined sequences (breakpoints only located in the middle of sequences) that are more stable or unstable than their parent sequences at diverse selection levels. The figure shows the rate of mutated sequences more stable than their parent sequences and the rates of recombined (descendant) sequences that are more stable or unstable than both or one of the parental sequences. Results based on simulations of the DDL protein family. Error bars indicate standard error of the mean of the corresponding mutation and recombination events. This evaluation considers recombination events with breakpoints only located in the middle of sequences. Results for the same analysis but focused on recombination events with breakpoints occurring in all the positions are shown in Figure 4. Figure S21. Influence of sequence identity between parental sequences on the folding free energy caused by recombination in the protein family DNAK. The figure shows the folding free energy variation produced by recombination (∆∆G) between recombinant (parental) and recombined (descendant) sequences. Negative values mean that the two sequences before recombining are more stable (mean) than the two sequences after recombining (mean), and the opposite for positive values, as a function of the sequence identity (shown on the right by intervals) between the parental sequences. Results based on a selection threshold of 0.95. The above plots refer to recombination events occurring in all the breakpoint positions (mean) and plots below refer to recombination events with breakpoint position only located in the middle of the sequences. Figure S22. Influence of sequence identity between parental sequences on the folding free energy caused by recombination in the protein family TPIS. The figure shows the folding free energy variation produced by recombination (∆∆G) between recombinant (parental) and recombined (descendant) sequences. Negative values mean that the two sequences before recombining are more stable (mean) than the two sequences after recombining (mean), and the opposite for positive values, as a function of the sequence identity (shown on the right by intervals) between the parental sequences. Results based on a selection threshold of 0.95. The above plots refer to recombination events occurring in all the breakpoint positions (mean) and plots below refer to recombination events with breakpoint position only located in the middle of the sequences. Figure S23. Influence of sequence identity between parental sequences on the folding free energy caused by recombination in the protein family TRPA. The figure shows the folding free energy variation produced by recombination (∆∆G) between recombinant (parental) and recombined (descendant) sequences. Negative values mean that the two sequences before recombining are more stable (mean) than the two sequences after recombining (mean), and the opposite for positive values, as a function of the sequence identity (shown on the right by intervals) between the parental sequences. Results based on a selection threshold of 0.95. The above plots refer to recombination events occurring in all the breakpoint positions (mean) and plots below refer to recombination events with breakpoint position only located in the middle of the sequences. Figure S24. Influence of sequence identity between parental sequences on the folding free energy caused by recombination in the protein family TRXB. The figure shows the folding free energy variation produced by recombination (∆∆G) between recombinant (parental) and recombined (descendant) sequences. Negative values mean that the two sequences before recombining are more stable (mean) than the two sequences after recombining (mean), and the opposite for positive values, as a function of the sequence identity (shown on the right by intervals) between the parental sequences. Results based on a selection threshold of 0.95. The above plots refer to recombination events occurring in all the breakpoint positions (mean) and plots below refer to recombination events with breakpoint position only located in the middle of the sequences. Figure S25. Folding free energy of parent and descendant protein sequences involved in real recombination events. Folding free energy (∆G) of protein sequences involved in real recombination events detected in 4 illustrative datasets from viruses (Table S1). Each column refers to a recombination event (note that the Dataset 1 presented 3 recombination events) with the detected breakpoints shown in the x-axis. The folding stability of the PDB representative for the studied proteins of every dataset is shown with a dashed line. Figure S26. Influence of sequence identity between parental sequences on the folding free energy caused by illustrative real recombination events. The figure shows the folding free energy variation produced by recombination (∆∆G) between recombinant (parental) and recombined (descendant) sequences in real recombination events detected in 4 illustrative datasets from viruses (Table S1). Figure S27. Folding free energy of protein sequences simulated upon a phylogenetic tree under empirical and structurally constrained substitution models for diverse protein families. Folding free energy (∆G) of protein sequences simulated without recombination under the bestfitting empirical substitution model (Table 1) (squares) and the structurally constrained substitution (SCS) model (circles) at different times (internal and tip nodes of the phylogenetic tree). The root (time to root = 0) corresponds to the extant PDB protein structure chosen as a representative structure of the protein family. Error bars correspond to the 95% confidence interval (CI) of the mean from 100 computer simulations. Comparing with the ∆G of the extant PDB protein structure, the empirical substitution model generates unrealistically unstable protein sequences. Results for the protein family DDL are shown above while results for the other protein families (TPIS, DNAK, TRPA and TRXB) are shown below. DDL 29 Figure S28. Folding free energy of TPIS, DNAK, TRPA and TRXB proteins simulated upon coalescent trees with diverse combinations of population substitution and recombination rates. Folding free energy (∆G) of proteins simulated upon coalescent trees previously simulated under a variety of combinations of population substitution rate () and population recombination rate () and where the protein sequences evolved under the best-fitting empirical substitution model (Table 1). The dashed line corresponds to the ∆G of the extant protein structure of the protein family (Table 1). Error bars represent the 95% confidence interval among the mean of 100 computer simulations. Results for the protein family DDL are presented in Figure 6. Figure S29. Variation of protein folding stability between parental (recombinant) and descendant (recombined) proteins as a function of the sequence identity between the parental proteins for the DDL protein family. Every point represents a recombination event that was simulated under a particular combination of substitution and recombination rates. Figure S30. Variation of protein folding stability between parental (recombinant) and descendant (recombined) proteins as a function of the sequence identity between the parental proteins for the DNAK protein family. Every point represents a recombination event that was simulated under a particular combination of substitution and recombination rates. Figure S31. Variation of protein folding stability between parental (recombinant) and descendant (recombined) proteins as a function of the sequence identity between the parental proteins for the TPIS protein family. Every point represents a recombination event that was simulated under a particular combination of substitution and recombination rates. Figure S32. Variation of protein folding stability between parental (recombinant) and descendant (recombined) proteins as a function of the sequence identity between the parental proteins for the TRPA protein family. Every point represents a recombination event that was simulated under a particular combination of substitution and recombination rates. Figure S33. Variation of protein folding stability between parental (recombinant) and descendant (recombined) proteins as a function of the sequence identity between the parental proteins for the TRXB protein family. Every point represents a recombination event that was simulated under a particular combination of substitution and recombination rates. Table S1. Illustrative examples of real data and results from their recombination analyses. For each dataset, the table shows protein and organism, Popset code, representative protein (PDB structure), sequence length (number of amino acids), sample size (number of sequences), sequence identity at amino acid and nucleotide levels, recombination tests indicating the presence of a recombination event (including P value), Genbank code of sequences that recombined (parents), sequence identity at amino acid and nucleotide levels between the parent sequences, recombination breakpoints and reference of the study dataset. Note that the first dataset presented 3 recombination events.  (Jazayeri, et al., 2004)