Substrate specificity of 2-deoxy-D-ribose 5-phosphate aldolase (DERA) assessed by different protein engineering and machine learning methods

Abstract In this work, deoxyribose-5-phosphate aldolase (Ec DERA, EC 4.1.2.4) from Escherichia coli was chosen as the protein engineering target for improving the substrate preference towards smaller, non-phosphorylated aldehyde donor substrates, in particular towards acetaldehyde. The initial broad set of mutations was directed to 24 amino acid positions in the active site or in the close vicinity, based on the 3D complex structure of the E. coli DERA wild-type aldolase. The specific activity of the DERA variants containing one to three amino acid mutations was characterised using three different substrates. A novel machine learning (ML) model utilising Gaussian processes and feature learning was applied for the 3rd mutagenesis round to predict new beneficial mutant combinations. This led to the most clear-cut (two- to threefold) improvement in acetaldehyde (C2) addition capability with the concomitant abolishment of the activity towards the natural donor molecule glyceraldehyde-3-phosphate (C3P) as well as the non-phosphorylated equivalent (C3). The Ec DERA variants were also tested on aldol reaction utilising formaldehyde (C1) as the donor. Ec DERA wild-type was shown to be able to carry out this reaction, and furthermore, some of the improved variants on acetaldehyde addition reaction turned out to have also improved activity on formaldehyde. Key points • DERA aldolases are promiscuous enzymes. • Synthetic utility of DERA aldolase was improved by protein engineering approaches. • Machine learning methods aid the protein engineering of DERA. Supplementary Information The online version of this article (10.1007/s00253-020-10960-x) contains supplementary material, which is available to authorized users.


SUPPLEMENTARY DATA
Text S1 Detailed description of the development of machine learning (ML) models for DERA mutant screening.

Fig S1
Setting up an LC assay for measuring the sequential acetaldehyde addition reaction by DERA.
UPLC chromatogram was monitored at 217 nm for the product formation. Samples at time points 0 and 20 h containing either Ec DERA wild-type or BSA (as a negative control) are shown. This method was used to measure the acetaldehyde addition reaction of all the variants in this work.

Fig S4
The trained substitution matrix weights against the three substrate specificities of DERA variants (columns DRP, DR, C2), a higher value having more predictive information towards the specificity. Only 11 substitution models (rows) had non-negative weights. For DR specificity prediction, amino acid interactions was the most important feature, while contact energies were the most informative feature for acetaldehyde specificity prediction.

Fig S5
The 5-fold cross-validation test performance of the ML model on the characterization data from the first two mutagenesis rounds (altogether 131 Ec DERA mutants). The method achieves high correlation for DR specificity, and moderate correlation for DRP and acetaldehyde.
Table S2 X-ray data collection and refinement statistics for the Ec DERA variants.

Fig S6
The empirical accuracy of the ML-guided 18 Ec DERA variants from the 3 rd mutagenesis round, which were predicted to have low activity on DRP and DR, and maximal activity on acetaldehyde. Due to the removal of the DRP and DR specificity, the predicted performance is high.
The 18 Ec DERA variants estimated to have a high acetaldehyde specificity had a correlation of 0.956 with the observed specificities.

Fig S7
UPLC chromatogram at 360 nm for detection of the 2,4-DNPH derivatised products of DERA catalyzed addition of formaldehyde and acetaldehyde (blue line) and Kp PDOR catalysed oxidation of 1,3-propanediol (red line). The reaction products of both DERA activity and PDOR activity can be seen at around 0.6 min. Both DERA substrates, formaldehyde and acetaldehyde, are present after 20 h incubation, and they can be seen at around 1 min and 1.5 min elution time, respectively. 600 MHz 1 H NMR spectrum of the products 3-hydroxypropanal (A) and its hydrated form (B) of the DERA reaction of formaldehyde with acetaldehyde in 50 mM Na-phosphate buffer, pH 6.8, at 22°C.
The assignments of the product signals to the 1 H atoms are indicated by the corresponding numbers.

Fig S9
The best Ec DERA variants on formaldehyde and acetaldehyde aldol addition reaction.
Activities on four different assays, i.e. on 2-deoxyribose 5-phosphate (DRP) and 2-deoxyribose (DR) cleavage, acetaldehyde addition, and on formaldehyde + acetaldehyde addition activity are shown as relative activities compared to the wild-type Ec DERA (WT).

Table S3
The residue substitution matrices used in the model from AAindex database. Text S1 Detailed description of the development of machine learning (ML) models for DERA mutant screening.
Machine learning (ML) models were trained to automatically predict substrate specificities of DERA mutants based on Gaussian processes.

Gaussian processes
We used a Gaussian process (GP) prior for function to predict the substrate specificities ( ) ∈ of a DERA protein mutant variant . Each specificity was modeled as a separate function. Gaussian processes are a family of non-parametric, non-linear Bayesian models (Rasmussen and Willams 2006  ( , ). The key property of Gaussian processes is that they encode functions that predict similar specificity values ( ), ( ′) for protein variants , ′ that are similar, where the similarity is encoded by the kernel ( , ′). The key part of GP modelling is then to infer a kernel that measures the mutation's effects to the specificities.
Moreover we assumed an additive noise model where the observation was assumed to follow the true latent function ( ) corrupted by zero-mean Gaussian noise with variance 2 . We assumed a Gaussian process prior for the latent function . A dataset of noisy specificity values into a vector ∈ corresponding to the protein mutant = ( ) =1 was collected, which then followed a Gaussian likelihood ( | , 2 ).
We were interested in modelling the specificity of a new protein variant * . A Gaussian process defines a joint distribution over the observed values of variants , and the unknown function value ( * ) of the unseen variant * , where * = * ∈ is a kernel vector with elements ( , * ) for all = 1, . . . , , and ( * , * ) is the self-kernel. A conditional distribution gives the posterior distribution of the specificity prediction as where the prediction mean and variance are Hence, in GP regression the specificity predictions ( * ) ± ( * ) will come with uncertainty estimates in the form of Gaussian standard deviations. See Figure 1 for an illustration.

Graph kernel
Next, we considered how to compute the similarity function ( , ′) between DERA variants. The 3D structural information of the protein variants were encoded as a contact map based on the PDB structure 1JCL, and their similarity was measured by the formalism of graph kernels (Vishwanathan et al. 2008).
Two residues were considered to be in contact if their closest atoms were within 5Å of each other in the PDB structure, which is illustrated in Figure 1. All variants of the same protein have the same length, with only different residues at mutating positions. We also assumed all variants to share the wild-type protein contact map.
The kernel iterated over all positions i and compared for each of them their residues through a substitution matrix ∈ 20×20 . Furthermore, the similarity of the residues at each position was multiplied by the average similarity of the neighborhood, defined as a product of neighborhood residue substitutions and neighborhood contact substitution matrices ∈ 400×400 . Hence, the kernel defined the similarity of two protein variants as the average position and neighbourhood similarity over all positions.
The above WDK kernel allowed us to compare the effects of multiple simultaneous mutations.
However, as the wild-type Ec DERA protein structure was used for all protein variants, changes that the mutations may cause to the protein structure were not taken into consideration. This may have caused problems if mutations that alter the protein structure significantly were introducedespecially if many of them were introduced simultaneously. On the other hand, substitution matrices that have their basis in sequence comparisons, and take these effects into account to some extent as these kinds of mutations are usually highly destabilising and do not occur often in nature. In the next section, we will discuss how we utilise different substitution matrices with multiple kernel learning.

Substitution matrices and multiple kernel learning (MKL)
The BLOSUM substitution models have been a common choice for protein models (Giguère et al. 2013), while mixtures of substitution models were proposed by (Cichonska et al. 2017). BLOSUM matrices score amino acid substitutions by their appearances throughout evolution, as they compare the frequencies of different mutations in similar blocks of sequences (Henikoff and Henikoff 1992).
However, there are also different ways to score amino acids substitutions, such as chemical similarity and neighborhood selectivity (Tomii and Kanehisa M 1996). When the stability effects of mutations are evaluated, the frequency of an amino acid substitution in nature may not be the most important factor.
To take into account different measures of similarity between amino acids, we employed all 92 amino acid substitution matrices with no missing entries gathered from AAindex2 database (genome.jp/aaindex) (Kawashima et al. 2008) (See Table S3). We also gathered all 43 valid contact potential matrices from AAindex3 database (See Table S4). We ensured the positive definiteness of the substitution matrices by making the symmetric by averaging, truncating any possible negative eigenvalues, and scaling them to range [0, 1]. These substitution matrices were used for computing the residue and contact matrices. Finally, multiple kernel learning (MKL) was used to find an optimal combination of the base kernels of form where ≥ 0 and ≥ 0 are the substitution-specific weights. A zero weight indicates that the corresponding substitution matrix was not used. We observed empirically that the optimal kernel weights wm tended to be sparse (See Figure S6).
The selected substitution matrices are listed in Figure S6. These matrices have different basis and through multiple kernel learning (MKL) our model learns which of these are important for inferring the specificity effects that mutations cause on different proteins. The Figure S6 illustrates this by showing the average weights of the base kernel matrices obtained via the multiple kernel learning.

Parameter Inference
The complete model has three parameter sets = ( , , ) for a total of 136 parameter values to infer, where the noise variance 2 ∈ + represents the observation uncertainty, and the residue substitution and contact potential weights ∈ 92 and ∈ 43 parameterise the optimal kernel model. In a Gaussian process regression model these can be tractably optimised by the marginal log likelihood log ( | , ) = log ∫ ( | , ) ( | ) ∝ − 1 2 ( + 2 ) −1 − 1 2 log | + 2 | which automatically balances model fit (the square term) and the model complexity (the determinant) to avoid overfitting (Rasmussen and Willams 2006). The parameters can be optimised by maximising the above marginal log likelihood using gradient ascent, since the marginal likelihood can be differentiated analytically. We utilised a limited-memory projected quasi-Newton algorithm (minConf-TMP 2 ), described by (Schmidt et al. 2009).

Validation
The accuracy of our predictions was evaluated using the same metrics that have been used by many otherscorrelation between the predicted and experimentally measured specificity values and the root mean square error. Marginal likelihood maximisation was used to infer model parameters and

Fig S4
The trained substitution matrix weights against the three substrate specificities of DERA variants (columns DRP, DR, C2), a higher value having more predictive information towards the specificity. Only 11 substitution models (rows) had non-negative weights. For DR specificity prediction, amino acid interactions was the most important feature, while contact energies were the most informative feature for acetaldehyde specificity prediction.

Fig S5
The 5-fold cross-validation test performance of the ML model on the characterization data from the first two mutagenesis rounds (altogether 131 Ec DERA mutants). The method achieves high correlation for DR specificity, and moderate correlation for DRP and acetaldehyde.

Fig S6
The empirical accuracy of the ML-guided 18 Ec DERA variants from the 3 rd mutagenesis round, which were predicted to have low activity on DRP and DR, and maximal activity on acetaldehyde. Due to the removal of the DRP and DR specificity, the predicted performance is high.
The 18 Ec DERA variants estimated to have a high acetaldehyde specificity had a correlation of 0.956 with the observed specificities. Numbers in the brackets are for the highest resolution shells. The assignments of the product signals to the 1 H atoms are indicated by the corresponding numbers.

Fig S9
The best Ec DERA variants on formaldehyde and acetaldehyde aldol addition reaction.
Activities on four different assays, i.e. on 2-deoxyribose 5-phosphate (DRP) and 2-deoxyribose (DR) cleavage, acetaldehyde addition, and on formaldehyde + acetaldehyde addition activity are shown as relative activities compared to the wild-type Ec DERA (WT).

Table S3
The residue substitution matrices used in the model from AAindex database.
List of 92 residue matrices from AAindex ver 9.1. (www.genome.jp/aaindex) ALTS910101 The PAM-120 matrix (Altschul, 1991) AZAE970101 The single residue substitution matrix from interchanges of AZAE970102 The substitution matrix derived from spatially conserved motifs Quasichemical energy of transfer of amino acids from water to the protein MIYS990107 Quasichemical energy of interactions in an average buried environment MOOG990101 Quasichemical potential derived from interfacial regions of protein-protein SIMK990101 Distance-dependent statistical potential (contacts within 0-5 Angstrooms) SIMK990102 Distance-dependent statistical potential (contacts within 5-7.5 Angstrooms) SIMK990103 Distance-dependent statistical potential (contacts within 7.5-10 Angstrooms) SIMK990104 Distance-dependent statistical potential (contacts within 10-12 Angstrooms) SIMK990105 Distance-dependent statistical potential (contacts longer than 12 Angstrooms) SKOJ000101 Statistical quasichemical potential with the partially composition-corrected SKOJ000102 Statistical quasichemical potential with the composition-corrected pair scale SKOJ970101 Statistical potential derived by the quasichemical approximation TANS760101 Statistical contact potential derived from 25 x-ray protein structures TANS760102 Number of contacts between side chains derived from 25 x-ray protein structures THOP960101 Mixed quasichemical and optimization-based protein contact potential TOBD000101 Optimization-derived potential obtained for small set of decoys TOBD000102 Optimization-derived potential obtained for large set of decoys VENM980101 Statistical potential derived by the maximization of the perceptron criterion ZHAC000101 Environment-dependent residue contact energies (rows = helix, cols = helix) ZHAC000102 Environment-dependent residue contact energies (rows = helix, cols = strand) ZHAC000103 Environment-dependent residue contact energies (rows = helix, cols = coil) ZHAC000104 Environment-dependent residue contact energies (rows = strand, cols = strand) ZHAC000105 Environment-dependent residue contact energies (rows = strand, cols = coil) ZHAC000106 Environment-dependent residue contact energies (rows = coil, cols = coil)