Protein crystal quality oriented disulfide bond engineering

A disulfide bond that formed between the thiol groups of two spatially close cysteine residues is essential for protein folding, stability, and function (Creighton et al., 1995) (Fass, 2012). Driven by conformational entropy, native disulfide bonds stabilize the conformation of protein molecules (Dill, 1990), while removal of native disulfides usually causes reduced stability of the target protein (Liu and Cowburn, 2016). Previous studies showed that proper introduction of disulfide bonds could stabilize the flexible region of target proteins and reduce the conformational entropy by locking the protein into single desired conformation (Matsumura et al., 1989; Craig and Dombkowski, 2013). Entropy is one of the essential features for protein crystallization (Shaw et al., 2007). Properly engineered disulfide bonds have been shown to decrease protein’s entropy, thus frequently used as strategy for high-resolution structure determination. G protein-coupled receptors (GPCRs) are a family of membrane proteins that include seven membrane-spanning α-helices (7TM) connected by three loops on each side. Most GPCR crystals were obtained by replacing the N-termini or the 2nd or 3rd intracellular loops with fusion proteins. Introducing a disulfide bond into fusion protein T4 lysozyme (T4L) had not only stabilized the fusion partner itself but also improved the crystal quality of the GPCR-T4L fusion protein (Thorsen et al., 2014). Moreover, disulfide bond has also been applied directly in the extracellular portion of lysophosphatidic acid receptor 1 (LPA1) to solve its highresolution structure (Chrencik et al., 2015). To precisely predict sites for disulfide bonds, efforts have been made to analyze the key features and several computational methods have been developed (Ceroni et al., 2006; Tsai et al., 2007). Several software predicted native disulfide bonds with high accuracy (∼85%), but predictions of engineered disulfide bonds were not experimentally validated (Ferre and Clote, 2005). Here we developed a comprehensive disulfide bond prediction algorithm that not just increased the success rate of predictions but also improved the quality of crystallized target proteins. New parameters were incorporated in the algorithm, including chemical environment of predicted sites, overall stabilities and conformational entropy changes, the geometric deviations with pre-existing native disulfide bonds in solved high-resolution protein structures. All those parameters were combined into a weighted scoring algorithm where machine learning and data mining of the structures deposited in Protein Data Bank (PDB) were used to train and optimize the weighting scheme. We applied our method on two proteins which were previously determined to high resolution and frequently used as fusion partners for GPCR crystallization, cytochrome b562 (BRIL) and Flavodoxin, and verified our prediction by solving the crystal structures of the wild type (WT) proteins and mutants. Furthermore, our algorithm was applied to a previously unsolved GPCR and we successfully solved its highresolution structure. We analyzed the features of native disulfide bonds from experimental data set and incorporated the output into our algorithm (Supplementary). These features include: i) the distances between each pair of C, O, N, Cα, Cβ, Sγ atoms and the dihedral angles between each plane of C/Cα/Cβ, Cα/Cβ/N, and C/Cα/N (Fig. S1), ii) the five χ angles (χ , χ, χ, χ, and χ) (Fig. S2), iii) the local environment preference of disulfide bonds (Fig. S3). A schematic of the approach is briefly shown in Fig. 1A. For any two given residues, three functions were required for evaluating the possibility of forming disulfide bond between them: 1) PGeom, the geometrical probability; 2) PRMSD, the RMSD between predicted disulfide-linked cysteines and the geometrically closest naturally occurring ones in solved structures; 3) PΔS, the conformational entropy change induced by the engineered disulfide bond. The details of the calculation are described in Supplementary. Using the PRMSD, PΔS, and PGeom calculated from known disulfide bonds as variables, the prediction model was trained and optimized by implementing the Support Vector Machine method (SVM) (Lin and Chang 2011). In order to test our prediction program, two proteins, BRIL and Flavodoxin, were used as examples for disulfide bond engineering. For each target protein, every pair of residues was treated as potential disulfide bond candidate. The PRMSD, PΔS, and PGeom were calculated for each pair of candidate residues. Those pairs with either PΔS = 0 or PGeom = 0 were removed from candidate list. Then SVM prediction was performed and the final results were sorted by PGeom. The program generated lists of potential pairs of


Supplementary Figure 5 | Prediction of disulfide bonds on unknown structure.
A, Thermo-SEC characterization of the disulfide I317C-G361C construct. The purified proteins (apo) of I317C-G361C and the control (no mutation) were both kept at 37°C for 5 min and then loaded to the size exclusion chromatography. B, Crystal structure of GLP-1R and the electron densities around the engineered disulfide bonds.

Data sets
We chose 4,722 non-redundant protein structures from PDB with sequence identity <90% that contained at least one disulfide bond. Those structures were obtained using X-ray crystallography and the resolution of those experimental diffraction data is better than 2.5 Å. We then extracted the list of those native disulfide bonds from PDB header files of these non-redundant protein structures. Finally, 18,696 native disulfide bonds were collected from PDB, in which 241 disulfide bonds were from membrane proteins. The data were divided into training (90%) and validation (10%) samples.

Performance measure of the algorithm
The performance of connection prediction was evaluated by inputting the test set into our model and introducing artificial disulfides into BRIL and Flavodoxin. We also took into account the DSE (Yi and Khosla, 2016), and the probability of the neighbor preference to evaluate the output results of our models. The neighbor residue preference and the solvent accessibilities were calculated using Visual Molecular Dynamics (VMD) (Humphrey et al., 1996). To calculate the value of DSE, the form of empirical formula is (Bryan Schmidt, 2006;Schmidt et al., 2006): / = 8.37× 1 + cos 3 ℎ 1 + 8.37× 1 + cos 3 ℎ 5 + 4.18× 1 + cos 3 ℎ 2 + 4.18× 1 + cos 3 ℎ 4 + 14.64× 1 + cos 2 ℎ 3 + 2.51× 1 + cos 3 ℎ 3 The neighbor preference was calculated by where N is the order of neighbor residue; P is the probability of occurrence for possible amino acid characterized by statistics analysis; (!"#$%&) is the probability of the neighbor preference of potential mutant sites.

Prediction algorithm
The P Geom is defined as the following: where the P Cβ and P O are distance distribution probabilities between paired C β and carbonyl O of disulfide bonding cysteines, respectively, while P C/Cα/Cβ , P C/Cα/Cβ and P Cα/Cβ/N represent the dihedral angle distribution probabilities between paired corresponding planes. PΔS is defined as the following: where n is the number of residues closed by loop resulted by the new disulfide bond and R is the universal gas constant (Pace et al., 1988).
The clustering of geometrical difference of disulfide bonds from the PDB was detected by MMTSB (Multiscale Modeling Tools for Structural Biology) and assessed by RMSD (Feig et al., 2004). The formula for calculating P RMSD is： where !"#,! is the object value, !"#$%,! is the value from the statistics analysis and n represent the number of structure of disulfide bonds from our data set. By using hierarchical clustering algorithm based on RMSD, six clusters from the dataset were generated. The RMSD between those six cluster centers and the target positions are calculated.

Molecular Dynamics Simulations
To investigate the changes of the mutated BRIL and Flavodoxin structures under room temperature 303K, we performed all atom molecular dynamics simulations for each system. The crystal structures were used as the starting conformation, and the simulation systems were constructed using Charmm-GUI Webserver (Jo et al., 2008;Lee et al., 2016). Constant pressure and temperature (NPT) MD simulations were performed with integration time step of 2 femtoseconds. Charmm36 force field (Mallajosyula et al., 2012) was used for the modeling of systems, and simulation trajectories were generated by using Gromacs 5.1.2 (Berendsen et al., 1995;Lindahl et al., 2001). The rectangular water box was added to ensure that the distance from protein to edge of box is 20 Å. We used TIP3P as the water model to solvate protein systems. Sodium chloride ions were added neutralize the systems and excess ions were used to obtain ion concentration of 150 mM NaCl. The temperature is controlled by Nosé-Hoover thermostat at 303K (Nosé, 1984;Hoover, 1985). The disulfide bonds for mutants were connected manually to generate the appropriate models. The simulation trajectories are equivalent to 200 nano seconds for each system.

Plasmid construction
The full-length BRIL and Flavodoxin genes were amplified from pfastbac plasmid (synthetized by GENEWIZ). The resultant gene fragments were ligated into pMCSG7 encoding His-TEV-BRIL and His-TEV-Flavodoxin. Mutations were designed based on this plasmid pMCSG7-BRIL and pMCSG7-Flavodoxin by using a QuikChange TM site-directed mutagenesis kit following the manufacturer's instructions (Stratagene, La Jolla, CA, USA). All the plasmids were confirmed by sequencing. The construct, expression, purification and crystallization of GLP-1R were described elsewhere (Song et al., 2017).

Protein Purification
Escherichia coli strain BL21-DE3 containing plasmid of protein with His-tag and mutations were grown while shaking (250 rpm) at 37°C in 1L of Luria-Bertani medium supplemented with ampicillin (100 µg/mL) until the optical density at 600 nm reached 1.2-1.5, at which time 1 mM isopropy-β-D-thiogalactoside was added. Then the TEV protease was added into the protein solution with a molar rate of 1:50, and placed at 4°C overnight.
After detected by protein gel electrophoresis, the protein and TEV protease mixture was added into a new Ni column and the protein was collected in the flow through, then the protein was concentrated to less than 1 ml before loaded to the gel filtration column Superdex 75 10/300GL column (GE Healthcare). The purified homogenous BRIL and Flavodoxin proteins were collected and concentrated to 15-25 mg/ml in buffer 50 mM Tris-HCl (pH 8.0) and 20 mM NaCl.

Crystallization and data collection
The WT BRIL and its mutants were crystallized by the hanging drop diffusion technique. The protein, at a concentration of 15-25 mg/ml, was stored in 50 mM Tris-HCl (pH 8.0) and 20 mM NaCl. The well solution contains 3.2 M ammonium sulfate, 0.1 M bicine (pH 9.0) N-octanoylsucrose (Reagent 17 of the Detergent Screen, Hampton Research) was added to the crystallization drops to a final concentration of 2.44 mM (Chu et al., 2002). The final volume of crystallization drops was 4 µl, containing protein solution, well solution, and additive solution at a ratio of 5:4:1 (by vol.). The WT Flavodoxin and its mutants were also crystallized by the hanging drop diffusion technique. Crystals of Flavodoxin were grown in a buffer containing 3.1 M or 3.2 M (NH 4 ) 2 SO 4 and 0.1 M Tris-HCl pH 7.5 or pH 7.0 (Ross A. Reynolds, 2001).
High diffraction crystals of Flavodoxin appeared after 3-4 weeks with concentrations between 15 and 20 mg/ml. Crystals were frozen in liquid nitrogen prior to diffraction testing and data collection.

Data processing and structure determination
Native diffraction data were collected at a wavelength of 0.979 Å at beamline BL17U1 at SSRF. The dataset was indexed, integrated, and scaled using the HKL2000 software package (Otwinowski and Minor, 1997). The structure was solved by molecular replacement method (McCoy et al., 2007) using wild-type BRIL (PDB:1M6T) and Flavodoxin (PDB: 1J8Q) as search models respectively. Refinement was carried out using PHENIX Refine (Adams et al., 2010). The refinement parameters were summaried in Supplementary Table 1 and 2.

Mass Spectrum experiment
The protein samples were diluted with PBS to adjust to final disulfide concentration of 40 µM in 1 ml volume. After addition of 50 µl 4 mM DTDP reagent, the sample was immediately mixed and the mix sample reacted in room temperature for 2 h.
Then the sample was injected into 6320 TOF LC/MS system (Agilent Technologies, USA). The data were analyzed by using Agilent MassHunter Qualitative analysis.