CHDOCK: a hierarchical docking approach for modeling C n symmetric homo-oligomeric complexes

Protein–protein interactions are crucial in many biological processes. Therefore, determining the complex structure between proteins is valuable for understanding the molecular mechanism and developing drugs. Many proteins like ion channels are formed by symmetric homo-oligomers. In this study, we have proposed a hierarchical docking algorithm to predict the structure of C n symmetric protein complexes, which is referred to as CHDOCK. The symmetric binding modes were ﬁrst constructed by an FFT-based docking algorithm and then optimized by our iterative scoring function for protein–protein interactions. When tested on a symmetric protein docking benchmark of 212 homo-oligomeric complexes with C n symmetry, CHDOCK obtained a signiﬁcantly better performance in binding mode predictions than three state-of-the-art symmetric docking methods, M-ZDOCK, SAM, and SymmDock. When the top 10 predictions were considered, CHDOCK achieved a success rate of 44.81% and 72.17% for unbound docking and bound docking, respectively in comparison to those of 36.79% and 65.09% for M-ZDOCK, 31.60% and 54.25% for SAM, and 30.66% and 31.60% for SymmDock. CHDOCK is computationally efﬁcient and can normally complete a symmetric docking calculation within 30 min. The CHDOCK can be freely accessed by a web server at http://huanglab.phys.hust.edu. cn/hsymdock/.


INTRODUCTION
Protein-protein interactions are crucial in many biological processes like signal transduction, intracellular trafficking, and immune recognition. Among all proteinprotein interactions, a significant portion is formed by symmetric homo-oligomers (Andre et al. 2008;Goodsell and Olson 2000;Poupon and Janin 2010). According to the Protein Data Bank (PDB) (Berman et al. 2000), more than one third of the proteins have some types of symmetry. For example, many transmembrane proteins like ion channels are formed by symmetric homooligomer assemblies. The symmetry of homo-oligomeric proteins is thought to be associated with many potential benefits like greater stability, reduced aggregation, and robustness to errors in synthesis (Andre et al. 2008;Goodsell and Olson 2000). The interface between symmetric homo-oligomers is often the targeting site for regulating the biological processes (Petsalaki and Russell 2008). Therefore, determining the complex structure of symmetric proteins is important (Lensink et al. 2016(Lensink et al. , 2018. Theoretically, one can use a general protein-protein docking approach to predict the complex structure of symmetric homo-oligomers by docking one monomer against the other (Comeau et al. 2004;de Vries et al. 2010de Vries et al. , 2015Torchala et al. 2013;Tovchigrechko and Vakser 2006). However, such a general docking strategy is not efficient for symmetric homooligomers. On one hand, the general protein-protein docking approach treats two interacting partners as different proteins and therefore often don't generate the complex structures with strict symmetry; On the other hand, general protein-protein docking normally don't consider the symmetry restraints during the docking process, and therefore is not computationally efficient. Therefore, specialized protein-protein docking algorithms are needed for predicting the complex structure of symmetric protein homo-oligomers.
One important symmetry in proteins is cyclic symmetry (C n ), for which the oligomeric structure can be constructed by n consecutive rotations of 360°/n around a single rotational axis of one subunit (Andre et al. 2008). Despite the importance of symmetric protein homo-oligomers, only a few algorithms have been developed for symmetric protein docking. Wolfson et al. developed a fast docking algorithm for cyclically symmetric complexes through local feature matching, which is referred to as SymmDock (Schneidman-Duhovny et al. 2005). SymmDock constructs the symmetric homooligomer complexes by restricting the search to symmetric cyclic transformations. The Weng group developed an FFT-based algorithm for symmetric protein-protein docking by restricting the search space with cyclic symmetry (M-ZDOCK) (Pierce et al. 2005). Based on the symmetric protein complexes in the PDB, several web servers that use template-based methods like ROBETTA (DiMaio et al. 2011), SWISS-MODEL (Biasini et al. 2014, and GalaxyGemini (Lee et al. 2013) have also been proposed to predict the homo-oligomeric structure. In addition, Ritchie and Grudinin presented a fast docking algorithm, which is named SAM, for predicting the symmetrical models of protein complexes with arbitrary point group symmetry through a spherical polar FFT-based algorithm (Ritchie and Grudinin 2016). Very recently, the Seok group has developed a combination modeling approach, GalaxyHomomer, for homo-oligomer structure prediction from a monomer sequence or structure by template-based modeling if homologous complexes are available in the PDB or ab initio docking (Baek et al. 2017).
However, despite the significant progress in the development of symmetric docking algorithms, there is still much room in improving the docking accuracy. Recently, we have developed a new pairwise shapebased scoring approach to consider long-range interactions (LSC) of protein atoms by an exponential form in FFT-based protein-protein docking. Tested on general protein-protein complexes, our LSC approach showed a significant advantage over the traditional grid-based method (Yan and Huang 2018). Extending the LSC approach to symmetric complexes, we have here developed a fast ab initio docking approach for the symmetric docking of homo-oligomers with C n symmetry by an FFT-based search algorithm with LSC, which is referred to as CHDOCK.

Comparison with other programs
We have tested our symmetric docking algorithm CHDOCK on the bound and unbound structures of our symmetric protein docking benchmark of 212 C n targets (Yan and Huang 2019). Table 1 lists the success rates of CHDOCK in binding mode predictions for bound and unbound docking on the 212 cases with C n symmetry when the top 1, 10, and 100 predictions are considered. The corresponding results are also shown in Fig. 1 Similar trends can also be observed in the results for unbound docking, though the performance differences among different algorithms are not as much as those for bound docking due to the impact of conformational changes in the unbound structures. Namely, CHDOCK also performed significantly better than the other three methods for unbound docking and obtained a success rate of 30.66%, 44.81%, and 68.40% for top 1, 10, and 100 predictions, respectively, in comparison to those of 26.42%, 36.79%, and 66.51% for M-ZDOCK, 19.34%, 31.60%, and 63.68% for SAM, and 11.79%, 30.66%, and 58.49% for SymmDock (Table 1 and Fig. 1).
Besides the success rate of docking, we have also compared the average root mean square deviation (RMSD) of 'hit(s)' (i.e., successful binding mode predictions) for both bound docking and unbound docking with the other three programs when the top 1, 10, 100 predictions were considered. The results are listed in Table 2 and the corresponding results are shown in Fig. 2. From Table 2 and Fig. 2, we can see that CHDOCK also performed much better and obtained more accurate binding modes than the other three programs for both bound docking and unbound docking. For bound docking, CHDOCK obtained an average RMSD of 1.10, 1.51 and 2.11 Å for top 1, 10 and 100 predictions, respectively, in comparison to those of 2.27, 2.80 and 3.46 Å for the second-best method M-ZDOCK. As for unbound docking, similar results can also be observed. CHDOCK obtained an average RMSD of 2.54, 3.12 and 4.07 Å for top 1, 10, 100 predictions, respectively, while M-ZDOCK obtained a higher RMSD of 3.26, 3.86 and 5.07 Å. Interestingly, one can also note that among the four docking programs, if a method performs better in the success rate of binding mode prediction, it also performs better in the average RMSD of 'hits'. That means, the performance comes from both the number and the quality of successful predictions.

Performance of scoring function
To investigate the performance of our scoring function, we also tested our pure FFT-based docking, named CHDOCK_lite, on the benchmark, which only uses the    Fig. 3. It can be seen from the figure that CHDOCK performed much better than CHDOCK_lite. With the help of our scoring function ITScorePP (Huang and Zou 2008), the success rate of bound docking for top 1 prediction increased from 21.70% to 55.19% and for unbound docking, the success rate increased from 11.32% to 30.66%. The great improvement of CHDOCK compared to CHDOCK_lite demonstrates the important role of our scoring function.

Discussions
CHDOCK and M-ZDOCK are both the three-dimensional (3D) FFT-based docking algorithms and adopt the similar sampling strategy. However, the difference between CHDOCK and M-ZDOCK is that CHDOCK adopts a better shape complementarity score LSC (Yan and Huang 2018) and a more powerful scoring function ITScorePP (Huang and Zou 2008). In our previous study on hetero protein complexes (Yan and Huang 2018), LSC has shown its better performance than PSC (Chen and Weng 2003) used in M-ZDOCK. ITScorePP also showed a better performance in scoring decoys and finding the near native structures (Huang and Zou 2008). Therefore, the better performance of CHDOCK than M-ZDOCK would be attributed to both the shape complementarity score LSC and our scoring function ITScorePP. Although CHDOCK has achieved better performance than the other three docking programs, the success rate for top 1 prediction is still not high, especially for unbound docking. There are much room to improve the existing Examples of the docking model Figure 4 shows the top binding modes predicted by our CHDOCK for both bound and unbound docking on three example targets. It can be seen from the figure that the predicted complexes overlap well with the experimental native structures, and give a ligand RMSD of 0.42 and 4.03 Å for C 2 symmetric target 1MSC, 0.92 and 3.38 Å for C 4 symmetric target 1OVO, and 0.95 and 1.20 Å for C 6 symmetric target 1KQ1, respectively. The good consistency between the predicted and native structures in both bound and unbound docking demonstrates the reliability of our CHDOCK.

CONCLUSION
We have developed a hierarchical docking algorithm for predicting the complex structures of homo-oligomers with C n symmetry, which referred to as CHDOCK. The C n symmetric binding modes were first generated by an FFT-based docking algorithm, in which a shape complementarity scoring function was used to consider long-range interactions. Then, the binding modes with best shape complementarity were optimized with our iterative scoring function for protein-protein interactions. Our symmetric docking algorithm CHDOCK was evaluated on a diverse benchmark of 212 C n symmetric protein complexes from the PDB, and was compared with three state-of-the-art symmetric docking approaches including M-ZDOCK, SAM, and SymmDock. It shows that CHDOCK achieved a significantly better performance than the other three docking methods in both the number and the quality of successful predictions for bound docking and unbound docking. The results demonstrate the strong predictive power of our hierarchical docking algorithm CHDOCK in modeling C n symmetric protein complexes.

FFT-based translational search
The putative symmetrical complexes were constructed from a monomer or subunit in 3D translational space by a modified version of our general FFT-based docking algorithm (Yan et al. 2017;Yan and Huang 2018). Specifically, we first made two copies of the subunit or Fig. 4 Comparisons between the top predicted binding modes and native structures for three targets 1MSC (C2 symmetry) (A), 1OVO (C4 symmetry) (B) and 1KQ1 (C6 symmetry) (C). The native structure is colored in pink and the predicted structure is colored by chains. For each column, the upper and lower ones are for bound docking and unbound docking, respectively monomer. One was called ''receptor'' subunit and the other ''ligand'' subunit. For docking with C n symmetry, the receptor subunit was fixed and the ligand subunit was rotated by an angle of 360°/n around the z-axis. To perform an FFT-based search, both the receptor and ligand subunits needed to be mapped onto a 3D grid of N Â N Â N grid points (Chen and Weng 2003;Katchalski-Katzir et al. 1992). The grid points within the VDW radius of any protein atoms were considered inside the molecule, and the others were considered as outside the protein. Here, the VDW radii for standard protein atoms were taken from the study by Li and Nussinov (1998). Then, the inside-protein grid points were divided into three parts: surface layer, nearsurface layer, and core region. It is defined that a grid point belonged to the surface layer if any of its neighboring grid points is outside the protein. Similarly, a grid point belonged to the near-surface layer if any of its neighbors is in the surface layer. All the other grid points except the surface and near-surface layers inside the protein were defined as the core region. According to the above definitions, one can see that the nearsurface layer and core region were normally occupied by the protein atoms, and the surface layer separated the inside protein from the outside space. Then, each grid point for the receptor (R) and ligand (L) subunits was assigned a complex value as: where J 2 ¼ À1, l, m, and n are the indices of the 3D grid (l; m; n ¼ 1; . . .; N), and r is the distance between the grid points of (i, j, k) and (l, m, n). Here, i 2 l À 3; l þ 3 ½ , j 2 m À 3; m þ 3 ½ and k 2 n À 3; n þ 3 ½ for the surface layer, and i 2 l À 1; l þ 1 ½ , j 2 m À 1; m þ 1 ½ and k 2 n À 1; n þ 1 ½ for the near-surface layer, respectively.
And also, the grid point (i, j, k) should belong to nearsurface layer or protein core. With the above mapping of the proteins on the grid, the shape complementarity score between two neighboring subunits of a symmetric complex around the zaxis can be generally expressed by the following equation (Chen and Weng 2003;Katchalski-Katzir et al. 1992): where o and p are the number of grid points by which the ligand (L) is shifted with respect to the receptor (R) in the x-y plane, respectively. There is no shift in the z-axis because the rotational axis is parallel to the z-axis, which reduces the sampling space in one translational dimension. The correlation of Eq. 3 can be calculated by an FFT-based algorithm. A higher correlation score means a better shape complementarity between two grids for a relative translation of (o, p) (Katchalski-Katzir et al. 1992).

Rotational sampling strategy
To perform a global sampling approach for putative binding modes, one needs to search the six-dimensional (i.e., 3 translational ? 3 rotational) space. The exhaustive search in 3D translational space can be performed by an FFT-based approach, as described in the previous section. The exhaustive search in the rotational space will be conducted in the space of Euler angles by taking into the C n symmetry restriction account. Specifically, the monomer subunit is rotated by an interval of Euler angles (/=0, Dh, Dw) in the rotational space, where the angular definition is based on the so-called ''x-axis convention''. Namely, / is the first rotation about the zaxis, h 2 0; p=2 ½ is the second rotation about the former x-axis (now x 0 ), and w [ (0,2p] is the third rotation about the former z-axis (now z 0 ). It is unnecessary to sample the / angles as the rotational axis is z-axis. In addition, h only needs to be sampled within 0; p=2 ½ instead of 0; p ½ because of the rotational symmetry. All these reduce the sampling space in the rotational space.
Then, for each rotation of the monomer subunit, an FFT-based algorithm was used to calculate the shape complementarity between the grids of the receptor and the ligand in the translational space. During the docking calculation, an angle interval of 10°was used for rotational sampling, and a spacing of 1.2 Å was adopted in discretizing proteins onto grids for FFT-based translational search. Evenly distributed Euler angles were used for the rotational search, resulting in a total of 360 orientations in the rotational space. For each rotation, up to the top 100 translations with best shape complementarities were kept and optimized by our scoring function ITScorePP (Huang and Zou 2008). The binding mode that corresponds to the best energy score in an FFT-based translational search was kept for each rotation of the ligand subunit, yielding a total of 360 ligand binding modes for a docking run. Our FFT-based docking algorithm is computationally efficient and on average can complete a full docking calculation in 30 min on a 2.6 GHZ Intel CPU core.

Scoring function
All the binding modes generated from the initial stage were evaluated by ITScorePP (Huang and Zou 2008) and minimized according to their binding scores by a SIMPLEX optimization method. The binding energy score is a summation of the binding scores over all the interfaces between the subunits of the predicted complex. The final ranked binding modes were clustered with an RMSD cutoff of 5 Å, where the RMSD was calculated using the backbone atoms. If two binding modes have a ligand RMSD of \5 Å, the one with the better score is kept.

Benchmark
Based on the protein complexes in the PDB, we have also constructed a non-redundant benchmark for our symmetric protein-protein docking. Briefly, all the homo-oligomeric protein complexes with C n symmetry were collected from the crystal structures with resolution better than 2.5 Å. The symmetry type of a complex was determined by its biological unit. The symmetric homo-oligomer complexes were then clustered according to their SCOP (version 1.75) family IDs (Lo Conte et al. 2000). For the complexes belonging to the same family, the one with the best resolution was selected as the representative, corresponding to a bound case of our benchmark, in which each subunit was called the bound structure of the complex. For the bound structure in each bound case, the unbound structure was identified by searching against the PDB database for the asymmetric structures using the BLASTP (proteinprotein BLAST) algorithm of the BLAST package (Camacho et al. 2009). If an asymmetric structure had [95% sequence identity with the bound structure and covered [95% of the sequence alignment, the asymmetric structure was regarded as a candidate of the unbound structure. If there were multiple unbound structures for a subunit protein, the one with the high resolution was selected as the representative. This yielded a total of 212 homo-oligomeric protein complexes with C n symmetry (http://huanglab.phys.hust. edu.cn/SDBenchmark/) (Yan and Huang 2019). All the structures in the benchmark have their original coordinates without any random rotation.

Evaluation criteria
The quality of a predicted binding mode was measured by the ligand RMSD (L RMSD ). Here, the RMSD was calculated based on the backbone atoms of the ligand subunit after optimal superimposition of the receptor subunit and the native structure. The docking performance was evaluated by the success rate, i.e., the fraction of the targets with at least one hit in the test set when a certain number of top predictions were considered. Here, a hit is a prediction with a ligand RMSD of \10 Å (Huang 2014).