Determining structural ensembles of flexible multi-domain proteins using small-angle X-ray scattering and molecular dynamics simulations

A multi-domain protein consists of two or more well-folded domains connected by flexible linkers, which may lead to large scale inter-domain motions related to the protein function. Therefore, it is appropriate to represent a flexible multi-domain protein by an ensemble of structures containing multiple conformational states (Bernadó and Blackledge, 2010). Although X-ray crystallography is currently the most popular technique for structure determination, it would be rather challenging to solve structure of such a flexible multidomain protein since it is hard to obtain high-quality crystals if the protein does not stabilize in a dominant conformation. Solution nuclear magnetic resonance (NMR) is able to investigate structure and dynamics of flexible multi-domain proteins with modest sizes, but it may not be easy to obtain enough long-range NMR restraints in order to determine orientations between the domains. Small-angle X-ray scattering (SAXS) has made substantial progress over the past decades (Graewert and Svergun, 2013), which can provide valuable structural information, such as the sizes and shapes, of proteins. SAXS is particularly useful in characterizing the flexibility of a multi-domain protein because the scattering profile contains the information of multiple conformations of the protein and their relative population in solution. Although the resolution of SAXS is inherently low since the complex protein structure is reduced to a onedimensional orientationally averaged profile, it can serve as a complementary experiment to these high-resolution techniques. For example, structures of individual domains can be solved by NMR, and SAXS provides a restraint between the domains (Grishaev et al., 2005). Computer simulations play important roles in combining the high and low-resolution structural data to investigate flexible multi-domain proteins (Bernadó and Blackledge, 2010; Yang, 2014). In such applications, many groups have used a screening-after-sampling strategy (Bernadó et al., 2007; Pelikan et al., 2009; Różycki et al., 2011; Schneidman-Duhovny et al., 2012; Wen et al., 2014; Yang et al., 2010), that is, a structure pool of the protein with a large number of different conformations is generated by simulations beforehand, and then an ensemble of structures is screened out of the pool to best reproduce the SAXS profile. Among various computational methods, molecular dynamics (MD) simulation is widely used in studying protein dynamics (Dror et al., 2012). With the increasing computer power and development of advanced algorithms, nowadays MD simulations can reach a time scale of microsecond (Arkhipov et al., 2013) to millisecond (Ma and Schulten, 2015). In this work, we show the potential of combining experimental SAXSdata and extensiveMDsimulations to determine structural ensembles of multi-domain proteins. The formin binding protein 21 (FBP21) is a structural component of the mammalian spliceosomal A/B complex and has functionality in pre-mRNA splicing (Bedford et al., 1998). NMR structure of its tandem WW domains (denoted as FBP21-WWs) has been solved that consists of 75 amino acid residues (Huang et al., 2009). The two WW domains, called WW1 (residues 6–32) and WW2 (residues 47–73), respectively, are connected by a flexible linker, which leads to high mobility between them and enables their cooperative interactions with different ligands. By observing 20 structural models in the NMR ensemble (PDB entry 2JXW), the internal structures of both WW1 and WW2 are converged, but their relative orientations are various (Fig. 1A). The N relaxation experiments also suggest that there exists significant inter-domain dynamics in FBP21-WWs. Therefore, it is important to explore relative domain orientations in the protein, which is however un-determined due to the lack of NMR restraints between the domains. We collected SAXS data of FBP21-WWs. The one-dimensional SAXS profile (red curve in Fig. 1B) shows a high signal-noise ratio with q up to 0.5 Å. The peak of pair distances distribution function (PDDF) is located at around 16.3 Å, but the maximal distance Dmax can be longer than 62.0 Å (Fig. 1C). This tailing shape of PDDF suggests that the average conformation of FBP21-WWs in solution is not

Multiple MD simulations of FBP21-WWs a Initial structure of the MD simulation taken from the NMR ensemble; b Simulation time for each set of MD run; c&d The two 100 ns MD simulations of the same model were run independently starting from different initial atomic velocities.

Protein cloning, expression and purification
The gene encoding the tandem WW domains of human FBP21 (residues 122-196) was amplified by PCR from the human brain cDNA library, and cloned into the NdeI/XhoI sites of

SAXS data
The SAXS data of FBP21-WWs was collected at the beamline 12ID-B of the Advanced Photon Sources (APS) at Argonne National Laboratory, with a X-ray wavelength of 1.033 Å.
Data were acquired from three concentrations (1.0, 3.0, and 5.0 mg/ml), which were then processed by the ATSAS package (Konarev et al., 2006;Petoukhov et al., 2012). After subtracting buffer scattering, the data curves from different concentrations were scaled and merged using PRIMUS (Konarev et al., 2003). The R g of the protein was derived by Guinier analysis using AutoRg. GNOM (Semenyuk and Svergun, 1991) was employed for calculating the PDDF.

MD simulations
Each MD simulation was performed using the GROMACS-4.5.5 package  and the CHARMM27 force field (MacKerell et al., 1998). The protein was firstly put in a rhombic dodecahedron box with periodic boundary condition. The minimum distance between the solute and the box boundary was 1.2 nm. The box was then filled with TIP3P water molecules (Jorgensen et al., 1983). The system was energy-minimized by the steepest descent method, until the maximum force on any atom was smaller than 1000 kJ mol -1 nm -1 .
After adding seven Na + ions to neutralize the charges on the protein, the system was energy-minimized again using the steepest descent and then the conjugate gradient method, until the maximum force on any atom was smaller than 50 kJ mol -1 nm -1 . The simulation was conducted by using the Verlet integration scheme with a 2 fs time step (Hockney et al., 1974).
Before the production run, a 100 ps equilibration simulation with positional restraint was performed, using a force constant of 1000 kJ mol -1 nm -2 . The initial atomic velocities were generated according to a Maxwell distribution at 310 K. The simulation was performed under the constant NPT condition. The protein, solvent and ions were coupled separately to a temperature bath of 310K using the velocity rescaling algorithm (Bussi et al., 2007), with a relaxation time of 0.1 ps. The pressure was kept to 1 bar with a relaxation time of 0.5 ps and the compressibility of 4.5×10 -5 bar -1 (Berendsen et al., 1984). The P-LINCS algorithm (Hess, 2008) was used to constrain all covalent bonds. The twin-range cutoff distances for the van der Waals interactions were set to be 0.9 and 1.4 nm, respectively, and the neighbor list was updated every 10 fs. The long-range electrostatic interactions were treated by the PME algorithm, with a tolerance of 10 -5 and an interpolation order of 4 (Essmann et al., 1995).
The MD trajectories were analyzed by these tools in the GROMACS package. All the structures were created by VMD (Humphrey et al., 1996).

Principal component analysis
PCA on the 6 µs trajectory of FBP21-WWs consisted of following steps.
(1) Snapshots were taken from the trajectory every 200 ps, so 30,000 conformations in total were used for PCA.
Only 54 C α atoms in the two WW domains were considered.
(2) All the conformations were  (Berendsen and Hayward, 2000;Kitao and Go, 1999). These modes are also termed essential modes, and the subspace spanned by the essential modes is called essential subspace (Amadei et al., 1993).

Ensemble optimization method
EOM (Bernadó et al., 2007) where K is the number of data points, and 4 sin q     is the momentum transfer where 2θ is the scattering angle and λ is the wavelength. σ(q) are standard deviations, and μ is a scaling factor. I(q) is the average of SAXS profiles from the ensemble where N is the number of conformations in the ensemble that would be much smaller than those in the trajectory. I n (q) is the theoretical SAXS profile of a single structure n calculated by CRYSOL (Svergun et al., 1995). In EOM, χ (Eqn. 3) is minimized by using the genetic algorithm, and then the optimal ensemble is picked.