Abstract
Knowledge of protein flexibility is vital for deciphering the corresponding functional mechanisms. This knowledge would help, for instance, in improving computational drug design and refinement in homology-based modeling. We propose a new predictor of the residue flexibility, which is expressed by B-factors, from protein chains that use local (in the chain) predicted (or native) relative solvent accessibility (RSA) and custom-derived amino acid (AA) alphabets. Our predictor is implemented as a two-stage linear regression model that uses RSA-based space in a local sequence window in the first stage and a reduced AA pair-based space in the second stage as the inputs. This method is easy to comprehend explicit linear form in both stages. Particle swarm optimization was used to find an optimal reduced AA alphabet to simplify the input space and improve the prediction performance. The average correlation coefficients between the native and predicted B-factors measured on a large benchmark dataset are improved from 0.65 to 0.67 when using the native RSA values and from 0.55 to 0.57 when using the predicted RSA values. Blind tests that were performed on two independent datasets show consistent improvements in the average correlation coefficients by a modest value of 0.02 for both native and predicted RSA-based predictions.
Similar content being viewed by others
References
Ahmad S, Gromiha MM, Sarai A (2003) Real value prediction of solvent accessibility from amino acid sequence. Proteins 50:629–635. doi:10.1002/prot.10328
Altschul SF, Madden TL, Schäffer AA et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402
Berman HM, Westbrook J, Feng Z et al (2000) The protein data bank. Nucleic Acids Res 28:235–242
B-Rao C, Subramanian J, Sharma SD (2009) Managing protein flexibility in docking and its applications. Drug Discov Today 14:394–400. doi:10.1016/j.drudis.2009.01.003
Carbonell P, del Sol A (2009) Methyl side-chain dynamics prediction based on protein structure. Bioinformatics 25:2552–2558. doi:10.1093/bioinformatics/btp463
Chen K, Kurgan M, Kurgan L (2008) Sequence based prediction of relative solvent accessibility using two-stage support vector regression with confidence values. J Biomed Sci Eng 01:1–9. doi:10.4236/jbise.2008.11001
Cheng J, Baldi P (2007) Improved residue contact prediction using support vector machines and a large feature set. BMC Bioinform 8:113. doi:10.1186/1471-2105-8-113
Cilia E, Pancsa R, Tompa P et al (2013) From protein sequence to dynamics and disorder with DynaMine. Nat Commun 4:2741. doi:10.1038/ncomms3741
Cilia E, Pancsa R, Tompa P et al (2014) The DynaMine webserver: predicting protein dynamics from sequence. Nucleic Acids Res 42:W264–W270. doi:10.1093/nar/gku270
Davies MN, Secker A, Freitas AA et al (2008) Optimizing amino acid groupings for GPCR classification. Bioinformatics 24:1980–1986. doi:10.1093/bioinformatics/btn382
Del Sol A, Tsai C-J, Ma B, Nussinov R (2009) The origin of allosteric functional modulation: multiple pre-existing pathways. Structure 17:1042–1050. doi:10.1016/j.str.2009.06.008
Díaz-Espinoza R, Garcés AP, Arbildua JJ et al (2007) Domain folding and flexibility of Escherichia coli FtsZ determined by tryptophan site-directed mutagenesis. Protein Sci 16:1543–1556. doi:10.1110/ps.072807607
Disfani FM, Hsu W-L, Mizianty MJ et al (2012) MoRFpred, a computational tool for sequence-based prediction and characterization of short disorder-to-order transitioning binding regions in proteins. Bioinformatics 28:i75–i83. doi:10.1093/bioinformatics/bts209
Dodson G, Verma CS (2006) Protein flexibility: its role in structure and mechanism revealed by molecular simulations. Cell Mol Life Sci 63:207–219. doi:10.1007/s00018-005-5236-7
Dosztányi Z, Csizmok V, Tompa P, Simon I (2005a) IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics 21:3433–3434. doi:10.1093/bioinformatics/bti541
Dosztányi Z, Csizmók V, Tompa P, Simon I (2005b) The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. J Mol Biol 347:827–839. doi:10.1016/j.jmb.2005.01.071
Dosztányi Z, Mészáros B, Simon I (2010) Bioinformatical approaches to characterize intrinsically disordered/unstructured proteins. Brief Bioinformatics 11:225–243. doi:10.1093/bib/bbp061
Eisenmesser EZ, Millet O, Labeikovsky W et al (2005) Intrinsic dynamics of an enzyme underlies catalysis. Nature 438:117–121. doi:10.1038/nature04105
Faraggi E, Xue B, Zhou Y (2009) Improving the prediction accuracy of residue solvent accessibility and real-value backbone torsion angles of proteins by guided-learning through a two-layer neural network. Proteins 74:847–856. doi:10.1002/prot.22193
Faraggi E, Zhang T, Yang Y et al (2012) SPINE X: improving protein secondary structure prediction by multistep learning coupled with prediction of solvent accessible surface area and backbone torsion angles. J Comput Chem 33:259–267. doi:10.1002/jcc.21968
Ferron F, Longhi S, Canard B, Karlin D (2006) A practical overview of protein disorder prediction methods. Proteins 65:1–14. doi:10.1002/prot.21075
Fontana A, Spolaore B, Mero A, Veronese FM (2008) Site-specific modification and PEGylation of pharmaceutical proteins mediated by transglutaminase. Adv Drug Deliv Rev 60:13–28. doi:10.1016/j.addr.2007.06.015
Gao J, Zhang T, Zhang H et al (2010) Accurate prediction of protein folding rates from sequence and sequence-derived residue flexibility and solvent accessibility. Proteins 78:2114–2130. doi:10.1002/prot.22727
Gutteridge A, Bartlett GJ, Thornton JM (2003) Using a neural network and spatial clustering to predict the location of active sites in enzymes. J Mol Biol 330:719–734
Halle B (2002) Flexibility and packing in proteins. Proc Natl Acad Sci USA 99:1274–1279. doi:10.1073/pnas.032522499
Han R, Leo-Macias A, Zerbino D et al (2008) An efficient conformational sampling method for homology modeling. Proteins 71:175–188. doi:10.1002/prot.21672
Han L, Zhang Y-J, Song J et al (2012) Identification of catalytic residues using a novel feature that integrates the microenvironment and geometrical location properties of residues. PLoS One 7:e41370. doi:10.1371/journal.pone.0041370
Jin Y, Dunbrack RL Jr (2005) Assessment of disorder predictions in CASP6. Proteins 61(Suppl 7):167–175. doi:10.1002/prot.20734
Jones DT (1999) Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 292:195–202. doi:10.1006/jmbi.1999.3091
Kabsch W, Sander C (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22:2577–2637. doi:10.1002/bip.360221211
Kennedy J, Eberhart R (1995) Particle swarm optimization. In: Proceedings IEEE International Conference on Neural Networks, vol 4, 1995 pp 1942–1948
Kundu S, Melton JS, Sorensen DC, Phillips GN Jr (2002) Dynamics of proteins in crystals: comparison of experiment with simple models. Biophys J 83:723–732. doi:10.1016/S0006-3495(02)75203-X
Kurgan L, Cios K, Zhang H et al (2008) Sequence-based methods for real value predictions of protein structure. Curr Bioinform 3:183–196. doi:10.2174/157489308785909197
Kwansa AL, Freeman JW (2010) Elastic energy storage in an unmineralized collagen type I molecular model with explicit solvation and water infiltration. J Theor Biol 262:691–697. doi:10.1016/j.jtbi.2009.10.024
Li B-Q, Hu L–L, Chen L et al (2012) Prediction of protein domain with mRMR feature selection and analysis. PLoS One. doi:10.1371/journal.pone.0039308
Lin W-Q, Jiang J-H, Shen Q et al (2005) Optimized block-wise variable combination by particle swarm optimization for partial least squares modeling in quantitative structure-activity relationship studies. J Chem Inf Model 45:486–493. doi:10.1021/ci049890i
Lin C-P, Huang S-W, Lai Y-L et al (2008) Deriving protein dynamical properties from weighted protein contact number. Proteins 72:929–935. doi:10.1002/prot.21983
Linding R, Jensen LJ, Diella F et al (2003) Protein disorder prediction: implications for structural proteomics. Structure 11:1453–1459
Liu X, Karimi HA (2007) High-throughput modeling and analysis of protein structural dynamics. Brief Bioinform 8:432–445. doi:10.1093/bib/bbm014
Liu J, Rost B (2004) Sequence-based prediction of protein domains. Nucleic Acids Res 32:3522–3530. doi:10.1093/nar/gkh684
Luthra A, Jha AN, Ananthasuresh GK, Vishveswara S (2007) A method for computing the inter-residue interaction potentials for reduced amino acid alphabet. J Biosci 32:883–889
Mackereth CD, Sattler M (2012) Dynamics in multi-domain protein recognition of RNA. Curr Opin Struct Biol 22:287–296. doi:10.1016/j.sbi.2012.03.013
Magnusson U, Chaudhuri BN, Ko J et al (2002) Hinge-bending motion of d-allose-binding protein from Escherichia coli three open conformations. J Biol Chem 277:14077–14084. doi:10.1074/jbc.M200514200
Meissner M, Schmuker M, Schneider G (2006) Optimized particle swarm optimization (OPSO) and its application to artificial neural network training. BMC Bioinform 7:125. doi:10.1186/1471-2105-7-125
Mizianty MJ, Kurgan L (2011) Sequence-based prediction of protein crystallization, purification and production propensity. Bioinformatics 27:i24–i33. doi:10.1093/bioinformatics/btr229
Mizianty MJ, Stach W, Chen K et al (2010) Improved sequence-based prediction of disordered regions with multilayer fusion of multiple information sources. Bioinformatics 26:i489–i496. doi:10.1093/bioinformatics/btq373
Neuvirth H, Raz R, Schreiber G (2004) ProMate: a structure based prediction program to identify the location of protein–protein binding sites. J Mol Biol 338:181–199. doi:10.1016/j.jmb.2004.02.040
Nguyen MN, Rajapakse JC (2006) Two-stage support vector regression approach for predicting accessible surface areas of amino acids. Proteins 63:542–550. doi:10.1002/prot.20883
Niu Y, Shen L (2006) An adaptive multi-objective particle swarm optimization for color image fusion. In: Wang T-D, Li X, Chen S-H et al (eds) Simulated evolution and learning. Springer, Berlin Heidelberg, pp 473–480
Oğul H, Mumcuoğu EU (2007) Subcellular localization prediction with new protein encoding schemes. IEEE/ACM Trans Comput Biol Bioinform 4:227–232. doi:10.1109/TCBB.2007.070209
Pan X-Y, Shen H-B (2009) Robust prediction of B-factor profile from sequence using two-stage SVR based on random forest feature selection. Protein Pept Lett 16:1447–1454
Panjkovich A, Daura X (2010) Assessing the structural conservation of protein pockets to study functional and allosteric sites: implications for drug discovery. BMC Struct Biol 10:9. doi:10.1186/1472-6807-10-9
Parthasarathy S, Murthy MR (1997) Analysis of temperature factor distribution in high-resolution protein structures. Protein Sci 6:2561–2567. doi:10.1002/pro.5560061208
Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
Peng Z-L, Kurgan L (2012) Comprehensive comparative assessment of in silico predictors of disordered regions. Curr Protein Pept Sci 13:6–18
Peng Z, Oldfield CJ, Xue B et al (2013a) A creature with a hundred waggly tails: intrinsically disordered proteins in the ribosome. Cell Mol Life Sci. doi:10.1007/s00018-013-1446-6
Peng Z, Xue B, Kurgan L, Uversky VN (2013b) Resilience of death: intrinsic disorder in proteins involved in the programmed cell death. Cell Death Differ 20:1257–1267. doi:10.1038/cdd.2013.65
Peterson EL, Kondev J, Theriot JA, Phillips R (2009) Reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment. Bioinformatics 25:1356–1362. doi:10.1093/bioinformatics/btp164
Radivojac P, Obradovic Z, Smith DK et al (2004) Protein flexibility and intrinsic disorder. Protein Sci 13:71–80. doi:10.1110/ps.03128904
Riddle DS, Santiago JV, Bray-Hall ST et al (1997) Functional rapidly folding proteins from simplified amino acid sequences. Nat Struct Biol 4:805–809
Scheraga HA, Khalili M, Liwo A (2007) Protein-folding dynamics: overview of molecular simulation techniques. Annu Rev Phys Chem 58:57–83. doi:10.1146/annurev.physchem.58.032806.104614
Schlessinger A, Rost B (2005) Protein flexibility and rigidity predicted from sequence. Proteins 61:115–126. doi:10.1002/prot.20587
Schnell JR, Dyson HJ, Wright PE (2004) Structure, dynamics, and catalytic function of dihydrofolate reductase. Annu Rev Biophys Biomol Struct 33:119–140. doi:10.1146/annurev.biophys.33.110502.133613
Sickmeier M, Hamilton JA, LeGall T et al (2007) DisProt: the database of disordered proteins. Nucleic Acids Res 35:D786–D793. doi:10.1093/nar/gkl893
Tegge AN, Wang Z, Eickholt J, Cheng J (2009) NNcon: improved protein contact map prediction using 2D-recursive neural networks. Nucleic Acids Res 37:W515–W518. doi:10.1093/nar/gkp305
Tokuriki N, Tawfik DS (2009) Protein dynamism and evolvability. Science 324:203–207. doi:10.1126/science.1169375
Tozzini V (2005) Coarse-grained models for proteins. Curr Opin Struct Biol 15:144–150. doi:10.1016/j.sbi.2005.02.005
Uversky VN, Dunker AK (2010) Understanding protein non-folding. Biochim Biophys Acta 1804:1231–1264. doi:10.1016/j.bbapap.2010.01.017
Vihinen M (1987) Relationship of protein flexibility to thermostability. Protein Eng 1:477–480
Walsh I, Martin AJM, Di Domenico T, Tosatto SCE (2012) ESpritz: accurate and fast prediction of protein disorder. Bioinformatics 28:503–509. doi:10.1093/bioinformatics/btr682
Wang J-Y, Lee H-M, Ahmad S (2007) SVM-Cabins: prediction of solvent accessibility using accumulation cutoff set and support vector machine. Proteins 68:82–91. doi:10.1002/prot.21422
Weathers EA, Paulaitis ME, Woolf TB, Hoh JH (2004) Reduced amino acid alphabet is sufficient to accurately recognize intrinsically disordered protein. FEBS Lett 576:348–352. doi:10.1016/j.febslet.2004.09.036
Worch R, Stolarski R (2008) Stacking efficiency and flexibility analysis of aromatic amino acids in cap-binding proteins. Proteins 71:2026–2037. doi:10.1002/prot.21882
Yang L-W, Bahar I (2005) Coupling between catalytic site and collective dynamics: a requirement for mechanochemical activity of enzymes. Structure 13:893–904. doi:10.1016/j.str.2005.03.015
Yang L-W, Eyal E, Chennubhotla C et al (2007) Insights into equilibrium dynamics of proteins from comparison of NMR and X-ray data with computational predictions. Structure 15:741–749. doi:10.1016/j.str.2007.04.014
Yang L, Song G, Jernigan RL (2009) Protein elastic network models and the ranges of cooperativity. Proc Natl Acad Sci USA 106:12347–12352. doi:10.1073/pnas.0902159106
Yuan Z, Huang B (2004) Prediction of protein accessible surface areas by support vector regression. Proteins 57:558–564. doi:10.1002/prot.20234
Yuan Z, Zhao J, Wang Z-X (2003) Flexibility analysis of enzyme active sites by crystallographic temperature factors. Protein Eng 16:109–114
Yuan Z, Bailey TL, Teasdale RD (2005) Prediction of protein B-factor profiles. Proteins 58:905–912. doi:10.1002/prot.20375
Zhang H, Zhang T, Chen K et al (2008) Sequence based residue depth prediction using evolutionary information and predicted secondary structure. BMC Bioinform 9:388. doi:10.1186/1471-2105-9-388
Zhang H, Zhang T, Chen K et al (2009) On the relation between residue flexibility and local solvent accessibility in proteins. Proteins 76:617–636. doi:10.1002/prot.22375
Zhang T, Faraggi E, Zhou Y (2010) Fluctuations of backbone torsion angles obtained from NMR-determined structures and their prediction. Proteins 78:3353–3362. doi:10.1002/prot.22842
Zhang H, Zhang T, Chen K et al (2011) Critical assessment of high-throughput standalone methods for secondary structure prediction. Brief Bioinform 12:672–688. doi:10.1093/bib/bbq088
Zhang H, Shi H, Hanlon M (2012a) A large-scale comparison of computational models on the residue flexibility for NMR-derived proteins. Protein Pept Lett 19:244–251
Zhang T, Faraggi E, Xue B et al (2012b) SPINE-D: accurate prediction of short and long disordered regions by a single neural-network based method. J Biomol Struct Dyn 29:799–813
Zhang X, Lu L, Song Q et al (2013) DomHR: accurately identifying domain boundaries in proteins using a hinge region strategy. PLoS One 8:e60559. doi:10.1371/journal.pone.0060559
Zuo Y-C, Li Q-Z (2010) Using K-minimum increment of diversity to predict secretory proteins of malaria parasite based on groupings of amino acids. Amino Acids 38:859–867. doi:10.1007/s00726-009-0292-1
Acknowledgments
This work was supported by the National Natural Science Foundation of China (grant no. 61170099) and the Zhejiang Provincial Natural Science Foundation of China (grant no. Y1110840) to H.Z., and by Discovery grant by Natural Sciences and Engineering Research Council (NSERC) of Canada to L.K.
Conflict of interest
The authors declare that they have no competing financial interests.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Zhang, H., Kurgan, L. Improved prediction of residue flexibility by embedding optimized amino acid grouping into RSA-based linear models. Amino Acids 46, 2665–2680 (2014). https://doi.org/10.1007/s00726-014-1817-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00726-014-1817-9