Coevolution signals capture the specific packing of secondary structures in protein architecture

Decoding protein structural characteristics and formation is essential to the understanding of its function. Despite the diversity and enormity of protein universe, the dissection of known protein structures has identified regular secondary structure elements (SSEs), namely, α-helix and β-strand, that constitute the basic structural elements of protein universe (Levitt and Chothia, 1976). As a remarkable characteristics of the overall topology of a polypeptide, the specific spatial arrangement of these SSEs is a hallmark of the protein tertiary conformation (Murzin et al., 1995). Unraveling the rules that govern the specific packing of the SSEs is critical to understanding the folding and formation of the tertiary structure of a protein and thus its function. Previously , numerous studies have attempted to identify the factors that determine the stability and specificity of SSE associations through analysis of super-secondary structure model systems like β-sheets and coiled-coils. Using designed β-hairpins or other β-sheet model systems, researchers have identified many factors that can significantly change the folding rate, stability or strand registry of β-sheets, such as charge interactions (Smith and Regan, 1995), beta propensity (Phillips et al., 2005) and β-sheet surface complementarity (Liang et al., 2008). The coiled-coil motif is another model system for understanding of molecular mechanisms that govern the super-secondary structure formation. Through NMR study of mutated GCN4p (Matousek et al., 2007) and design of stable and specific coiled-coils (Woolfson, 2005; Oakley and Kim, 1998), researchers have suggested the importance of elec-trostatic and polar interactions in determining the coiled-coil packing stability and specificity. In addition, Kennan and Ryan showed that changing the length of residue side chains could dramatically influence the stability of coiled-coils (Ryan and Kennan, 2007). Despite the extensive attention on the factors that determine the specific packing of SSEs in certain forms, most of the above discoveries were made in simple soluble model systems but not natural proteins. Thus so far a global view of the effect of these factors on the protein topology still remains unclear. In this study, we would like to provide an evolutionary perspective into the roles of physiochemical properties in spatial arrangement of SSEs in protein universe. Based on a large ortholog group dataset, we systematically analyzed the coevolution signals of a variety of physiochemical properties for association of SSEs in protein evolution. To analyze the extent of coevolution of two SSEs with regard to different physiochemical properties during protein evolution, we designed an analysis framework, …

identity above 90%, only one of them was considered. UniqueProt (Mika and Rost, 1 2003) with the HSSP threshold of 70 was used to remove orthologs of more than 2 ~90% sequence identity to the target sequence in each ortholog group. After the above 3 filtering process, 562 groups with more than 50 ortholog sequences were obtained for 4 coevolution analysis. 5 β-sheet dataset were taken from Cheng and Baldi dataset (Cheng and Baldi, 6 2005a) . For each protein in the dataset, its homolog sequences were obtained by 7 using PSI- BLAST (Altschul et al., 2010) to search the NCBI non-redundant database 8 (NR). The NR database was filtered by the "coils" program to exclude coiled-coil 9 proteins (Lupas et al., 1991). In running PSI-BLAST, E-values were set to 10E-10 in 10 the three iterations of PSI-BLAST, and the E-value of 0.001 was used to generate 11 position-specific profile. The profile was then used to blast against the original 12 non-redundant database without filtering coiled-coil to generate homolog sequences 13 using E-value cutoff 0.001. The homolog sequences with identity lower than 30% or 14 larger than 90% to the query and those with aligned region less than 80% of the query 15 were discarded. The remaining homolog sequences were further filtered by 16 UniqueProt with HSSP value of 70 and aligned using MUSCLE (Edgar, 2004). 17 18

Secondary structures of ortholog sequence 19
For each of the 562 ortholog groups extracted from OMA dataset (see above), the 20 structures of ortholog sequences were constructed using homology modeling based on 21 the PDB-target of known structure (defined above) as template. Each ortholog 22 sequence was aligned to the template using the sequence-structure alignment tool 23 FUGUE (Shi et al., 2001). Homology-based structure modeling was performed using 24 MODELLER (Eswar et al., 2001), and the best generated model was used as the 25 predicted structure for the homolog sequence. Then the predicted structure was 26 analyzed with DSSP (Kabsch and Sander, 1983) to produce secondary structure 27 information. For simplicity, only three secondary states (helix, strand and loop) were 28 considered. However, the sharp turns in protein structures were identified to optimize 1 the results of DSSP, because sharp turns often have geometries that are 2 conformationally similar to a turn of ideal helical models and are easily identified as 3 helix state by DSSP. Finally, A fragment of ≥ 7 amino acids with helix state was 4 assigned as an α-helix SSE, a fragment of ≥ 4 amino acids with strand state was 5 assigned as a β-strand SSE and a fragment of ≥ 3 amino acids with loop state was 6 assigned as loop SSE. 7 For the homolog sequences obtained for the Baldi dataset, their β-strands were 8 assigned as the aligned region to the β-strand of the proteins in the Baldi dataset in 9 multiple sequence alignment. 10 11

Definition of SSE pairs in contact 12
A crucial step in our analysis is the identification of SSE pairs in contact. A pair 13 of SSEs was considered as the candidates if two or more residues on each side were in 14 contact. The residues were considered to be in contact if their Cα−Cα distance was 15 below 8 Å. Only SSE pairs with a residue contact ratio (number of residues in contact 16 divided by the total number of residues in the pair) ≥ 20% were selected for this study. 17 In this work, totally 29014 SSE pairs in contact were derive d the investigated OMA 18 dataset, which included 2360 helix-helix packing, 3120 strand-strand packing, 4819 19 loop-loop packing, 3222 helix-strand packing, 8429 helix-loop packing and 7064 20 strand-loop packing. 21 22

Calculation of coevolution scores of different physicoche mical properties 23 between SSEs in contact 24
For an SSE in the PDB-target sequence (SSE target ), its longest aligned SSE in a 25 homolog sequence is called a homolog SSE (SSE homolog ). The homolog SSEs 26 (SSE homolog ) of all its homolog sequences form a homolog SSE group. When 27 analyzing the evolution of a physiochemical property in a homolog SSE group, a 28 distance vector representing the distances of all SSE homolog to SSE target with regard to a 1 physiochemical property was computed. Six general physiochemical properties are 2 considered in our analysis: volume, hydrophobicity, charge, betapropensity, 3 amphiphilicity and polarity, which are described briefly in the following: 4 The average volume of a residue represents the average actual space that the 5 residue occupies in protein structures, which was taken from (Pontius et al., 1996) . 6 The polarity index of a residue measures the extent of separation of electric charge in 7 the residue, which was taken from (Grantham, 1974). The hydrophilicity scale of a 8 residue represents the tendency of the residue to interact with water, which was taken 9 from (Kuhn et al., 1995). The amphiphilicity index of a residue denotes the preference 10 of the residue to be in the interface between polar and non-polar environments, which 11 was taken from (Mitaku et al., 2002). The beta propensity of a residue indicates its 12 preference to appear in β-strands, which was taken from Lifson's work (Lifson and 13 Sander, 1979). The isoelectric point of amino acid characterizes the charge of a 14 residue, which was taken from (Zimmerman et al., 1968). 15 The distance vector contains a vector of property difference between the 16 SSE homolog and SSE target , (A i,target ) , which is the average difference between aligned 17 residues measured by a particular physiochemical property in all non-gap positions of 18 The coevolution signals of packing SSEs is only meaningful with the correct 2 construction of a null model, which is used as background signal for evaluating the 3 significance and robustness of the observed signals. According to our methodology, 4 the coevolution score between two SSEs in contact was calculated as the correlation 5 coefficient between the two distance vectors (vector A and vector B in Fig. S1). The 6 distance vector was consisted of differed values of the average physicochemical 7 property between the homolog SSEs and target SSEs, and the correlation coefficient 8 between the distance vectors was used to measure the strength of coordinate changes. 9 Could the coordinated changes still exist even in random SSE associations? 10 Therefore, we constructed the null model by randomly shuffling one of the distance 11 vector and then calculated the distribution of correlation coefficient between the 12 shuffled distance vectors as the null distribution. pairing probability using various sequence and evolutionary features as input. Then 20 the residue interaction probability was used to define a pseudoenergy between any 21 two strands, which is the highest sum of residue pairing probabilities among all 22 possible alignments of the two strands. Finally the pseudoenergy was used in a greedy 23 selection algorithm to predict β-strand pairing by searching for the topology that has 24 the highest sum of pseudoenergy between all predicted pairing strands. 25 We integrated coevolution score and pseudoenergy from Betapro by a weighted 26 sum of Betapro pseudoenergy and contact-vs-noncontact likelihood ratio of 27 coevolution score measured by each property. For each property, the contact-vs-28 noncontact likelihood ratio of each pair of SSEs was calculated as follows: 1 Coevolution scores of all pairs of strands for each property were binned with 2 binning size 0.1(See Fig. S2 for the details of distribution for contact SSE pairs and 3 non-contact SSE pairs in each bin). For each bin B, its contact-vs-noncontact 4 likelihood ratio C is: 5 Where N contact,B is the number of contacting pairs in bin B, N contact is the number 8 of all contact pairs. Thus, P(B |contact) is the frequency of contact pairs in bin B. 9 Similarly, P(B |noncontact) represents the frequency of noncontact pairs in bin B. 10 After integration, the new pseudoenergy function of a strand pair can be 11 expressed as: 12 Where betapro E is the pseudoenergy calculated by Betapro. i C is the likelihood ratio of 14 coevolution feature i. i W is the weighting parameter of coevolution feature i. The new 15 pseudoenergy E is used to predict β-sheet pairing using Kruskal's algorithm to find 16 maximum spanning tree (MST) of strands. The tree must satisfy the same three 17 constraints as in Betapro: i) a β-residue can be partner with at most two other 18 residues; ii) two strands pairing with the same side of the third strand cannot overlap, 19 and iii) all strands must have at least one and at most three strand partners. 20 21

Predicting β-sheets by cross validation 22
The proteins in Baldi dataset (Cheng and Baldi, 2005b) that have over 200 23 homolog sequences were considered in prediction. There are 286 proteins having 24 more than 200 homolog sequences. This dataset was randomly spited into 10 folds. In 25 the 10-fold cross validation, 9 folds are used for training and 1 fold for testing. 26 The predictive power on test data is described by precision: P=TP/(TP+FP), recall: 27 R=TP/(TP+FN) and F1=2PR/(P+R). The adjustable parameters ( i W ) in the 1 pseudoenergy function were optimized using genetic algorithm by maximizing the 2 prediction F1 on the training data. In this study, for a deep understanding of the molecular mechanisms governing 7 packing of secondary structure elements (SSEs) in protein structure formation, we 8 carried out a systematic analysis of coevolution of six general physiochemical 9 properties between SSEs within known protein structures. The systematic analyses 10 have revealed significantly strong coevolution signals of charge, volume, 11 amphiphilicity and betapropensity between associated SSEs, suggesting the general 12 roles of these properties in constraining the specific packing of SSEs in protein 13 structure. Early studies found that the volume of hydrophobic cores of proteins are largely 24 constant (Gerstein et al., 1994) , and the compensatory coevolution of residue volume 25 has been hypothesized to play an important role in maintaining the compactness of 26 protein hydrophobic core. Surprisingly, the observed coevolution signal of volume 27 property in our study present positively correlation, rather than compensatory 28 coevolution. This is consistent with a previous study (Williams and Lovell, 2009), 1 which showed that compensatory mutation of volume is not as frequent as previously 2 thought. In their study, they did not observe many compensatory changes of volume 3 in neighboring residues; instead, they found positive correlation between residue 4 volume change and the contact distance change for neighboring residues. Another 5 study also observed a positive correlation between distance of two β-strands in the 6 formation of β-sheet and the sum of the volume of residues at their interface based on 7 a large-scale analysis of β-sheets with known structures (Nagarajaram et al., 1999), 8 suggesting that the volume change of neighboring β-strands could result in the change 9 of the distance between them. Therefore, the observed positively correlated 10 coevolution of volume property between associated SSEs may reflect a spatial 11 flexibility in packing of these SSEs in protein structures. 12 Unlike the charge and volume properties, the coevolution signals of other 13 physiochemical properties differs significantly in packing of different secondary 14 structure types. This phenomenon actually reflects the heterogeneity of SSE packing, 15 and the packing heterogeneity has been explored from many perspectives, such as 16 geometric orientation (Chothia et al., 1977;Chothia and Janin, 1981;Chothia et al., 17 1981), packing density (Kurochkina and Privalov, 1998)  conformation of the isolated secondary structure (Chothia, 1984). Therefore, the 28 diverse properties such as residue composition and packing interfaces of secondary 1 structure types, combining with the constraints of thermal stability, may result in the 2 different coevolution signals underlying different types of associated SSEs. 3 Previously, many studies have analyzed the coevolution of residue pairs that are 4 in contact in protein structures (Caporaso et al., 2008;Dunn et al., 2008;Yip et al., 5 2008). It has been shown that the constraints of residue interaction can lead to 6 coevolution of structurally neighboring residues (Altschuh et al., 1988;Gobel et al., 7 1994). However, the strength of coevolution at residue level is not so evident. For 8 example, in analysis of coevolution of residues between contact β-strands, Gregoret 9 and Gutfreund did not observe any stronger conservation nor coevolution between 10 pairing residues on neighboring strands than between non-contact residues on the 11 same strand (Mandel-Gutfreund et al., 2001). This result coincides with Parisien et 12 al's finding that the contact propensities between inter-strand residues are relatively 13 trivial in the prediction of β-sheet topology (Parisien and Major, 2007). The 14 coevolution of physiochemical properties at residue level has also been investigated. Particularly, it was found that residues tend to co-evolve in groups or between 23 neighboring residues that are not necessarily in direct contact (Hamilton et al., 2004;24 Dutheil and Galtier, 2007;Madaoui and Guerois, 2008;Xu and Tillier, 2010). 25 Therefore, the previous analyses could underestimate the coevolution signal between 26 tightly packed secondary structures. In our analysis, we considered the whole 27 secondary structures as an entity, which can take into account the group of residues 28 that co-evolve. Indeed, we observed strong signals of coevolution of SSEs in contact 1 with regard to a variety of physiochemical properties during protein structure 2 evolution. This demonstrates the advantage of our work in analysis of coevolution 3 within protein structures. Although analysis of whole secondary structure showed 4 significant coevolutionary signal within protein structure, there is much room for 5 further improvement. First, a secondary structure can interact with more than one 6 secondary structure in a protein structure, which could weaken the signals of 7 coevolution due to ignoring the contributions from other interacting se condary 8 structures. Second, some secondary structures make contact with each other by using 9 only a small portion of residues, hence calculating the coevolution using whole 10 sequence of secondary structures can bring in noises imposed by residues that are not 11 related to packing. Third, the method we used here can be further improved by 12 correcting for the phylogenetic noise and using distance matrix between all pairs of 13 homolog SSEs to enlarge data set. 14 15 16

AMINO-ACID CHANGES IN HOMOLOGOUS PROTEIN FAMILIES. Protein 19
Engineering 2, 193-199. 20 Altschul Chothia, C., Levitt, M., and Richardson, D. (1981). Helix to helix packing in proteins. 4 Journal of molecular biology 145, 215-250. 5 DeLong, E.R., DeLong, D.M., and C larke-Pearson, D.L. (1988). Comparing the areas 6 under two or more correlated receiver operating characteristic curves: a nonparametric 7 approach. Biometrics, 837-845. 8 Dunn, S.D., Wahl, L.M., and Gloor, G.B. (2008). Mutual information without the 9 influence of phylogeny or entropy dramatically improves residue contact prediction. 10 Bioinformatics 24, 333-340.   Two secondary structure elements SSE A and SSE B are highlighted in both the 5 structure and multiple sequence alignment. The multiple sequence alignment of the 6 two SSEs is used to generate distance vector against target SSEs of known protein 7 structure. For example, in SSE A, distances between ortholog SSE(i) and target 8 SSE(target) are calculated as the average difference between residues in all non-gap 9 positions measured by particular physiochemical property. Then the Spearman's rank correlation coefficient of the two distance vectors is used as the measure of 1 coevolution between SSE A and SSE B. curves between the two models: the solid line represents the ROC curve for the model 7 which integrates coevolution signals with the pseudo-energy function developed by 8 Betapro, and the dash line represent the ROC curve for the original BetaPro predictor. 9 The AUC (Area Under The Curve) for the integrated model (BetaPro + coevolution 10 information) and the original model (BetaPro) is 0.882 and 0.864 respectively, which 11 shows significant difference according to Delong's test (DeLong et al., 1988)