Abstract
The high-throughput structure determination pipelines developed by structural genomics programs offer a unique opportunity for data mining. One important question is how protein properties derived from a primary sequence correlate with the protein’s propensity to yield X-ray quality crystals (crystallizability) and 3D X-ray structures. A set of protein properties were computed for over 1,300 proteins that expressed well but were insoluble, and for ~720 unique proteins that resulted in X-ray structures. The correlation of the protein’s iso-electric point and grand average hydropathy (GRAVY) with crystallizability was analyzed for full length and domain constructs of protein targets. In a second step, several additional properties that can be calculated from the protein sequence were added and evaluated. Using statistical analyses we have identified a set of the attributes correlating with a protein’s propensity to crystallize and implemented a Support Vector Machine (SVM) classifier based on these. We have created applications to analyze and provide optimal boundary information for query sequences and to visualize the data. These tools are available via the web site http://bioinformatics.anl.gov/cgi-bin/tools/pdpredictor.
Similar content being viewed by others
Abbreviations
- GRAVY:
-
Grand average hydropathy
- MCSG:
-
Midwest Center for Structural Genomics
- pI:
-
Iso-electric point
- PSI:
-
Protein Structure Initiative
- SVM:
-
Support vector machine
References
Gao X et al (2005) High-throughput limited proteolysis/mass spectrometry for protein domain elucidation. J Struct Funct Genomics 6(2–3):129–134
Koth CM et al (2003) Use of limited proteolysis to identify protein domains suitable for structural analysis. Methods Enzymol 368:77–84
Dong A et al (2007) In situ proteolysis for protein crystallization and structure determination. Nat Methods 4(12):1019–1021
Goldschmidt L et al (2007) Toward rational protein crystallization: a web server for the design of crystallizable protein variants. Protein Sci 16(8):1569–1576
Kim Y et al (2008) Large-scale evaluation of protein reductive methylation for improving protein crystallization. Nat Methods 5(10):853–854
Nocek B et al (2005) Crystal structures of delta1-pyrroline-5-carboxylate reductase from human pathogens Neisseria meningitides and Streptococcus pyogenes. J Mol Biol 354(1):91–106
Slabinski L et al (2007) XtalPred: a web server for prediction of protein crystallizability. Bioinformatics 23(24):3403–3405
Bertone P et al (2001) SPINE: an integrated tracking database and data mining approach for identifying feasible targets in high-throughput structural proteomics. Nucleic Acids Res 29(13):2884–2898
Canaves JM et al (2004) Protein biophysical properties that correlate with crystallization success in Thermotoga maritima: maximum clustering strategy for structural genomics. J Mol Biol 344(4):977–991
Goh CS et al (2003) SPINE 2: a system for collaborative structural proteomics within a federated database framework. Nucleic Acids Res 31(11):2833–2838
Oldfield CJ et al (2005) Addressing the intrinsic disorder bottleneck in structural proteomics. Proteins 59(3):444–453
Overton IM, Barton GJ (2006) A normalised scale for structural genomics target ranking: the OB-Score. FEBS Lett 580(16):4005–4009
Slabinski L et al (2007) The challenge of protein structure determination—lessons from structural genomics. Protein Sci 16(11):2472–2482
Smialowski P et al (2006) Will my protein crystallize? A sequence-based predictor. Proteins 62(2):343–355
Price WN II et al (2009) Understanding the physical properties that control protein crystallization by analysis of large-scale experimental data. Nat Biotechnol 27(1):51–57
Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13):1658–1659
Marsden RL, Orengo CA (2008) Target selection for structural genomics: an overview. Methods Mol Biol 426:3–25
Eddy SR (1995) Multiple alignment using hidden Markov models. Proc Int Conf Intell Syst Mol Biol 3:114–120
Eddy SR (1996) Hidden Markov models. Curr Opin Struct Biol 6(3):361–365
Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14(9):755–763
Eddy SR (2004) What is a hidden Markov model? Nat Biotechnol 22(10):1315–1316
Eddy SR, Mitchison G, Durbin R (1995) Maximum discrimination hidden Markov models of sequence consensus. J Comput Biol 2(1):9–23
Martelli PL et al (2002) A sequence-profile-based HMM for predicting and discriminating beta barrel membrane proteins. Bioinformatics 18(Suppl 1):S46–S53
Ward JJ et al (2004) The DISOPRED server for the prediction of protein disorder. Bioinformatics 20(13):2138–2139
Altschul SF et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402
Babnigg G, Giometti CS (2004) GELBANK: a database of annotated two-dimensional gel electrophoresis patterns of biological systems with completed genomes. Nucleic Acids Res 32(Database issue): D582–D585
Kawashima S, Ogata H, Kanehisa M (1999) AAindex: amino acid index database. Nucleic Acids Res 27(1):368–369
Kohonen T (1982) Self-organized formation of topologically correct feature maps. Biol Cybern 43:56–69
Stols L et al (2002) A new vector for high-throughput, ligation-independent cloning encoding a tobacco etch virus protease cleavage site. Protein Expr Purif 25(1):8–15
Bjellqvist B et al (1994) Reference points for comparisons of two-dimensional maps of proteins from different human cell types defined in a pH scale where isoelectric points correlate with polypeptide compositions. Electrophoresis 15(3–4):529–539
Kall L, Krogh A, Sonnhammer EL (2007) Advantages of combined transmembrane topology and signal peptide prediction—the Phobius web server. Nucleic Acids Res 35(Web Server issue):W429–W432
Chang C et al (2010) Extracytoplasmic PAS-like domains are common in signal transduction proteins. J Bacteriol 192(4):1156–1159
Kawashima S et al (2008) AAindex: amino acid index database, progress report 2008. Nucleic Acids Res 36(Database issue):D202–D205
Chothia C (1975) Structural invariants in protein folding. Nature 254(5498):304–308
Monne M et al (1999) Turns in transmembrane helices: determination of the minimal length of a “helical hairpin” and derivation of a fine-grained turn propensity scale. J Mol Biol 293(4):807–814
Monne M, Hermansson M, von Heijne G (1999) A turn propensity scale for transmembrane helices. J Mol Biol 288(1):141–145
Palau J, Argos P, Puigdomenech P (1982) Protein secondary structure. Studies on the limits of prediction accuracy. Int J Pept Protein Res 19(4):394–401
Vapnik VN (1999) An overview of statistical learning theory. IEEE Trans Neural Netw 10(5):988–999
Matthews BW (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta 405(2):442–451
Chen K, Kurgan L, Rahbari M (2007) Prediction of protein crystallization using collocation of amino acid pairs. Biochem Biophys Res Commun 355(3):764–769
Overton IM et al (2008) ParCrys: a Parzen window density estimation approach to protein crystallization propensity prediction. Bioinformatics 24(7):901–907
Chou PY, Fasman GD (1978) Prediction of the secondary structure of proteins from their amino acid sequence. Adv Enzymol Relat Areas Mol Biol 47:45–148
Munoz V, Serrano L (1994) Intrinsic secondary structure propensities of the amino acids, using statistical phi-psi matrices: comparison with experimental scales. Proteins 20(4):301–311
Qian N, Sejnowski TJ (1988) Predicting the secondary structure of globular proteins using neural network models. J Mol Biol 202(4):865–884
Richardson JS, Richardson DC (1988) Amino acid preferences for specific locations at the ends of alpha helices. Science 240(4859):1648–1652
Ponnuswamy PK et al (1980) Hydrophobic packing and spatial arrangement of amino acid residues in globular proteins. Biochim Biophys Acta 623(2):301–316
Rackovsky S, Scheraga HA (1982) Differential geometry and polymer conformation. 4. Conformational and nucleation properties of individual amino acids. Macromolecules 15(5):1340–1346
Tanaka S, Scheraga HA (1977) Statistical mechanical treatment of protein conformation. 5. A multistate model for specific-sequence copolymers of amino acids. Macromolecules 10(1):9–20
Acknowledgments
This work was supported by the National Institutes of Health grant GM074942 and by the U.S. Department of Energy, Office of Biological and Environmental Research, under contract DE-AC02-06CH11357.
Author information
Authors and Affiliations
Corresponding authors
Additional information
The submitted manuscript has been created by UChicago Argonne, LLC, Operator of Argonne National Laboratory (“Argonne”). Argonne, a U.S. Department of Energy Office of Science laboratory, is operated under Contract No. DE-AC02-06CH11357. The U.S. Government retains for itself, and others acting on its behalf, a paid-up nonexclusive, irrevocable worldwide license in said article to reproduce, prepare derivative works, distribute copies to the public, and perform publicly and display publicly, by or on behalf of the Government.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Babnigg, G., Joachimiak, A. Predicting protein crystallization propensity from protein sequence. J Struct Funct Genomics 11, 71–80 (2010). https://doi.org/10.1007/s10969-010-9080-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10969-010-9080-0