Skip to main content
Log in

Predicting protein crystallization propensity from protein sequence

  • Published:
Journal of Structural and Functional Genomics

Abstract

The high-throughput structure determination pipelines developed by structural genomics programs offer a unique opportunity for data mining. One important question is how protein properties derived from a primary sequence correlate with the protein’s propensity to yield X-ray quality crystals (crystallizability) and 3D X-ray structures. A set of protein properties were computed for over 1,300 proteins that expressed well but were insoluble, and for ~720 unique proteins that resulted in X-ray structures. The correlation of the protein’s iso-electric point and grand average hydropathy (GRAVY) with crystallizability was analyzed for full length and domain constructs of protein targets. In a second step, several additional properties that can be calculated from the protein sequence were added and evaluated. Using statistical analyses we have identified a set of the attributes correlating with a protein’s propensity to crystallize and implemented a Support Vector Machine (SVM) classifier based on these. We have created applications to analyze and provide optimal boundary information for query sequences and to visualize the data. These tools are available via the web site http://bioinformatics.anl.gov/cgi-bin/tools/pdpredictor.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Abbreviations

GRAVY:

Grand average hydropathy

MCSG:

Midwest Center for Structural Genomics

pI:

Iso-electric point

PSI:

Protein Structure Initiative

SVM:

Support vector machine

References

  1. Gao X et al (2005) High-throughput limited proteolysis/mass spectrometry for protein domain elucidation. J Struct Funct Genomics 6(2–3):129–134

    Article  CAS  PubMed  Google Scholar 

  2. Koth CM et al (2003) Use of limited proteolysis to identify protein domains suitable for structural analysis. Methods Enzymol 368:77–84

    Article  CAS  PubMed  Google Scholar 

  3. Dong A et al (2007) In situ proteolysis for protein crystallization and structure determination. Nat Methods 4(12):1019–1021

    Article  CAS  PubMed  Google Scholar 

  4. Goldschmidt L et al (2007) Toward rational protein crystallization: a web server for the design of crystallizable protein variants. Protein Sci 16(8):1569–1576

    Article  CAS  PubMed  Google Scholar 

  5. Kim Y et al (2008) Large-scale evaluation of protein reductive methylation for improving protein crystallization. Nat Methods 5(10):853–854

    Article  CAS  PubMed  Google Scholar 

  6. Nocek B et al (2005) Crystal structures of delta1-pyrroline-5-carboxylate reductase from human pathogens Neisseria meningitides and Streptococcus pyogenes. J Mol Biol 354(1):91–106

    Article  CAS  PubMed  Google Scholar 

  7. Slabinski L et al (2007) XtalPred: a web server for prediction of protein crystallizability. Bioinformatics 23(24):3403–3405

    Article  CAS  PubMed  Google Scholar 

  8. Bertone P et al (2001) SPINE: an integrated tracking database and data mining approach for identifying feasible targets in high-throughput structural proteomics. Nucleic Acids Res 29(13):2884–2898

    Article  CAS  PubMed  Google Scholar 

  9. Canaves JM et al (2004) Protein biophysical properties that correlate with crystallization success in Thermotoga maritima: maximum clustering strategy for structural genomics. J Mol Biol 344(4):977–991

    Article  CAS  PubMed  Google Scholar 

  10. Goh CS et al (2003) SPINE 2: a system for collaborative structural proteomics within a federated database framework. Nucleic Acids Res 31(11):2833–2838

    Article  CAS  PubMed  Google Scholar 

  11. Oldfield CJ et al (2005) Addressing the intrinsic disorder bottleneck in structural proteomics. Proteins 59(3):444–453

    Article  CAS  PubMed  Google Scholar 

  12. Overton IM, Barton GJ (2006) A normalised scale for structural genomics target ranking: the OB-Score. FEBS Lett 580(16):4005–4009

    Article  CAS  PubMed  Google Scholar 

  13. Slabinski L et al (2007) The challenge of protein structure determination—lessons from structural genomics. Protein Sci 16(11):2472–2482

    Article  CAS  PubMed  Google Scholar 

  14. Smialowski P et al (2006) Will my protein crystallize? A sequence-based predictor. Proteins 62(2):343–355

    Article  CAS  PubMed  Google Scholar 

  15. Price WN II et al (2009) Understanding the physical properties that control protein crystallization by analysis of large-scale experimental data. Nat Biotechnol 27(1):51–57

    Article  CAS  PubMed  Google Scholar 

  16. Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13):1658–1659

    Article  CAS  PubMed  Google Scholar 

  17. Marsden RL, Orengo CA (2008) Target selection for structural genomics: an overview. Methods Mol Biol 426:3–25

    Article  CAS  PubMed  Google Scholar 

  18. Eddy SR (1995) Multiple alignment using hidden Markov models. Proc Int Conf Intell Syst Mol Biol 3:114–120

    CAS  PubMed  Google Scholar 

  19. Eddy SR (1996) Hidden Markov models. Curr Opin Struct Biol 6(3):361–365

    Article  CAS  PubMed  Google Scholar 

  20. Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14(9):755–763

    Article  CAS  PubMed  Google Scholar 

  21. Eddy SR (2004) What is a hidden Markov model? Nat Biotechnol 22(10):1315–1316

    Article  CAS  PubMed  Google Scholar 

  22. Eddy SR, Mitchison G, Durbin R (1995) Maximum discrimination hidden Markov models of sequence consensus. J Comput Biol 2(1):9–23

    Article  CAS  PubMed  Google Scholar 

  23. Martelli PL et al (2002) A sequence-profile-based HMM for predicting and discriminating beta barrel membrane proteins. Bioinformatics 18(Suppl 1):S46–S53

    PubMed  Google Scholar 

  24. Ward JJ et al (2004) The DISOPRED server for the prediction of protein disorder. Bioinformatics 20(13):2138–2139

    Article  CAS  PubMed  Google Scholar 

  25. Altschul SF et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402

    Article  CAS  PubMed  Google Scholar 

  26. Babnigg G, Giometti CS (2004) GELBANK: a database of annotated two-dimensional gel electrophoresis patterns of biological systems with completed genomes. Nucleic Acids Res 32(Database issue): D582–D585

    Google Scholar 

  27. Kawashima S, Ogata H, Kanehisa M (1999) AAindex: amino acid index database. Nucleic Acids Res 27(1):368–369

    Article  CAS  PubMed  Google Scholar 

  28. Kohonen T (1982) Self-organized formation of topologically correct feature maps. Biol Cybern 43:56–69

    Article  Google Scholar 

  29. Stols L et al (2002) A new vector for high-throughput, ligation-independent cloning encoding a tobacco etch virus protease cleavage site. Protein Expr Purif 25(1):8–15

    Article  CAS  PubMed  Google Scholar 

  30. Bjellqvist B et al (1994) Reference points for comparisons of two-dimensional maps of proteins from different human cell types defined in a pH scale where isoelectric points correlate with polypeptide compositions. Electrophoresis 15(3–4):529–539

    Article  CAS  PubMed  Google Scholar 

  31. Kall L, Krogh A, Sonnhammer EL (2007) Advantages of combined transmembrane topology and signal peptide prediction—the Phobius web server. Nucleic Acids Res 35(Web Server issue):W429–W432

    Google Scholar 

  32. Chang C et al (2010) Extracytoplasmic PAS-like domains are common in signal transduction proteins. J Bacteriol 192(4):1156–1159

    Google Scholar 

  33. Kawashima S et al (2008) AAindex: amino acid index database, progress report 2008. Nucleic Acids Res 36(Database issue):D202–D205

    Google Scholar 

  34. Chothia C (1975) Structural invariants in protein folding. Nature 254(5498):304–308

    Article  CAS  PubMed  Google Scholar 

  35. Monne M et al (1999) Turns in transmembrane helices: determination of the minimal length of a “helical hairpin” and derivation of a fine-grained turn propensity scale. J Mol Biol 293(4):807–814

    Article  CAS  PubMed  Google Scholar 

  36. Monne M, Hermansson M, von Heijne G (1999) A turn propensity scale for transmembrane helices. J Mol Biol 288(1):141–145

    Article  CAS  PubMed  Google Scholar 

  37. Palau J, Argos P, Puigdomenech P (1982) Protein secondary structure. Studies on the limits of prediction accuracy. Int J Pept Protein Res 19(4):394–401

    Article  CAS  PubMed  Google Scholar 

  38. Vapnik VN (1999) An overview of statistical learning theory. IEEE Trans Neural Netw 10(5):988–999

    Article  CAS  PubMed  Google Scholar 

  39. Matthews BW (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta 405(2):442–451

    CAS  PubMed  Google Scholar 

  40. Chen K, Kurgan L, Rahbari M (2007) Prediction of protein crystallization using collocation of amino acid pairs. Biochem Biophys Res Commun 355(3):764–769

    Article  CAS  PubMed  Google Scholar 

  41. Overton IM et al (2008) ParCrys: a Parzen window density estimation approach to protein crystallization propensity prediction. Bioinformatics 24(7):901–907

    Article  CAS  PubMed  Google Scholar 

  42. Chou PY, Fasman GD (1978) Prediction of the secondary structure of proteins from their amino acid sequence. Adv Enzymol Relat Areas Mol Biol 47:45–148

    CAS  PubMed  Google Scholar 

  43. Munoz V, Serrano L (1994) Intrinsic secondary structure propensities of the amino acids, using statistical phi-psi matrices: comparison with experimental scales. Proteins 20(4):301–311

    Article  CAS  PubMed  Google Scholar 

  44. Qian N, Sejnowski TJ (1988) Predicting the secondary structure of globular proteins using neural network models. J Mol Biol 202(4):865–884

    Article  CAS  PubMed  Google Scholar 

  45. Richardson JS, Richardson DC (1988) Amino acid preferences for specific locations at the ends of alpha helices. Science 240(4859):1648–1652

    Article  CAS  PubMed  Google Scholar 

  46. Ponnuswamy PK et al (1980) Hydrophobic packing and spatial arrangement of amino acid residues in globular proteins. Biochim Biophys Acta 623(2):301–316

    Google Scholar 

  47. Rackovsky S, Scheraga HA (1982) Differential geometry and polymer conformation. 4. Conformational and nucleation properties of individual amino acids. Macromolecules 15(5):1340–1346

    Google Scholar 

  48. Tanaka S, Scheraga HA (1977) Statistical mechanical treatment of protein conformation. 5. A multistate model for specific-sequence copolymers of amino acids. Macromolecules 10(1):9–20

    Google Scholar 

Download references

Acknowledgments

This work was supported by the National Institutes of Health grant GM074942 and by the U.S. Department of Energy, Office of Biological and Environmental Research, under contract DE-AC02-06CH11357.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to György Babnigg or Andrzej Joachimiak.

Additional information

The submitted manuscript has been created by UChicago Argonne, LLC, Operator of Argonne National Laboratory (“Argonne”). Argonne, a U.S. Department of Energy Office of Science laboratory, is operated under Contract No. DE-AC02-06CH11357. The U.S. Government retains for itself, and others acting on its behalf, a paid-up nonexclusive, irrevocable worldwide license in said article to reproduce, prepare derivative works, distribute copies to the public, and perform publicly and display publicly, by or on behalf of the Government.

Electronic supplementary material

Below is the link to the electronic supplementary material.

(DOCX 257 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Babnigg, G., Joachimiak, A. Predicting protein crystallization propensity from protein sequence. J Struct Funct Genomics 11, 71–80 (2010). https://doi.org/10.1007/s10969-010-9080-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10969-010-9080-0

Keywords

Navigation