The Inverse Protein Folding Problem: Protein Design and Structure Prediction in the Genomic Era

  • Marcel Schmidt am Busch
  • Anne Lopes
  • David Mignon
  • Thomas Gaillard
  • Thomas Simonson


Millions of proteins are being identified every year by high throughput genome sequencing projects. Many others can potentially be created by protein engineering and design methods. Here, we review a method for computational protein design (CPD), which starts from a known protein and its 3D structure, and seeks to modify it by mutating some or all of the amino acid sidechains. The mutations are selected to provide stability, and possibly other properties, such as ligand binding. For each set of candidate mutations, the 3D structure is modeled, with an assumption of small, localized perturbations; in particular, we assume the backbone conformation does not change significantly. As in other CPD implementations, the structure is modeled using a classical, molecular mechanics approach along with a simple, implicit description of solvent. Some of the calculations have been distributed to volunteers on the Internet, through our Proteins@Home volunteer computing project. The method and selected results are described, which show that the designed sequences share important properties of natural proteins.


Structure Prediction Amino Acid Type Empirical Correction Compute Sequence Rotamer Library 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



We thank the many volunteers who have participated in the Proteins@Home project and contributed computer cycles to this work. See for a complete list of participants. We thank the BOINC development community for testing the alpha version of Proteins@Home.


  1. 1.
    Service RF (2006) Gene sequencing: the race for the $1000 genome. Science 311:1544–1546 CrossRefGoogle Scholar
  2. 2.
    Lander ES et al. (2001) Initial sequencing and analysis of the human genome. Nature 409:860–921 CrossRefGoogle Scholar
  3. 3.
    Venter C et al. (2001) The sequence of the human genome. Science 291:1304–1351 CrossRefGoogle Scholar
  4. 4.
    Branden C, Tooze J (1999) Introduction to protein structure. Garland Publishing, New York Google Scholar
  5. 5.
    Baker D, Sali A (2001) Protein structure prediction and structural genomics. Science 294:93–96 CrossRefGoogle Scholar
  6. 6.
    Schueler-Furman O, Wang C, Bradley P, Misura K, Baker D (2005) Progress in modeling of protein structures and interactions. Science 310:638–642 CrossRefGoogle Scholar
  7. 7.
    Mannhold R, Kubinyi H, Timmerman H, Lengauer T (eds) (2002) Bioinformatics: from genomes to drugs. Wiley, New York Google Scholar
  8. 8.
    Lee D, Redfern O, Orengo C (2007) Predicting protein function from sequence and structure. Nat Rev Mol Cell Biol 8:995–1005 CrossRefGoogle Scholar
  9. 9.
    Anfinsen CB (1973) Principles that govern the folding of protein chains. Science 181:223–230 CrossRefGoogle Scholar
  10. 10.
    Baker D (2000) A surprising simplicity to protein folding. Nature 405:39–42 CrossRefGoogle Scholar
  11. 11.
    Fersht A (1999) Structure and mechanism in protein science: a guide to enzyme catalysis and protein folding. Freeman, New York Google Scholar
  12. 12.
    Marti-Renom MA, Stuart AC, Fiser A, Sanchez R, Melo F, Sali A (2000) Comparative protein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct 29:291–325 CrossRefGoogle Scholar
  13. 13.
    Shirts M, Pande V (2002) Screen savers of the world unite! Science 290:1903–1904 CrossRefGoogle Scholar
  14. 14.
    Eisenberg D (1982) A problem for the theory of biological structure. Nature 295:99–100 CrossRefGoogle Scholar
  15. 15.
    Ponder J, Richards FM (1988) Tertiary templates for proteins: use of packing criteria in the enumeration of allowed sequences for different structural classes. J Mol Biol 193:775–791 CrossRefGoogle Scholar
  16. 16.
    Schmidt am Busch M, Mignon D, Simonson T (2009) Computational protein design as a tool for fold recognition. Proteins 77:139–158 CrossRefGoogle Scholar
  17. 17.
    Schmidt am Busch M, Sedano A, Simonson T (2010) Computational protein design: validation and possible relevance as a tool for homology searching and fold recognition. PLoS ONE 5(5):e10410 CrossRefGoogle Scholar
  18. 18.
    Andreeva A, Howorth D, Brenner SE, Hubbard JJ, Chothia C, Murzin AG (2004) SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res 32:D226–D229 CrossRefGoogle Scholar
  19. 19.
    Pearl F, Todd A, Sillitoe I, Dibley M, Redfern O, Lewis T, Bennett C, Marsden R, Grant A, Lee D, Akpor A, Maibaum M, Harrison A, Dallman T, Reeves G, Diboun I, Addou S, Lise S, Johnston C, Sillero A, Thornton J, Orengo C (2005) The CATH domain structure database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis. Nucleic Acids Res 33:D247–D251 CrossRefGoogle Scholar
  20. 20.
    Orengo CA, Thornton JM (2005) Protein families and their evolution—a structural perspective. Annu Rev Biochem 74:867–900 CrossRefGoogle Scholar
  21. 21.
    Lazar GA, Marsall SA, Plecs JJ, Mayo SL, Desjarlais JR (2003) Designing proteins for therapeutic applications. Curr Opin Struct Biol 13:513–518 CrossRefGoogle Scholar
  22. 22.
    Kuhlman B, Dantas G, Ireton GC, Varani G, Stoddard BL, Baker D (2003) Design of a novel globular protein fold with atomic-level accuracy. Science 302:1364–1368 CrossRefGoogle Scholar
  23. 23.
    Looger LL, Dwyer MA, Smith JJ, Hellinga HW (2003) Computational design of receptor and sensor proteins with novel functions. Nature 423:185–190 CrossRefGoogle Scholar
  24. 24.
    Butterfoss GL, Kuhlman B (2006) Computer-based design of novel protein structures. Annu Rev Biophys Biomol Struct 35:49–65 CrossRefGoogle Scholar
  25. 25.
    Lippow SM, Tidor B (2007) Progress in computational protein design. Curr Opin Biotechnol 18:305–311 CrossRefGoogle Scholar
  26. 26.
    Pleiss J (2011) Protein design in synthetic biology. Curr Opin Biotechnol 22:611–617 CrossRefGoogle Scholar
  27. 27.
    Samish I, Perez-Aguilar JM, Saven JG (2011) Theoretical and computational protein design. Annu Rev Phys Chem 62:129–149 CrossRefGoogle Scholar
  28. 28.
    Schmidt am Busch M, Lopes A, Mignon D, Simonson T (2008) Computational protein design: software implementation, parameter optimization, and performance of a simple model. J Comput Chem 29:1092–1102 CrossRefGoogle Scholar
  29. 29.
    Schmidt am Busch M, Lopes A, Amara N, Bathelt C, Simonson T (2008) Testing the Coulomb/accessible surface area solvent model for protein stability, ligand binding, and protein design. BMC Bioinform 9:148–163 CrossRefGoogle Scholar
  30. 30.
    Mackerell AD Jr (2001) Atomistic models and force fields. In: Becker O, Mackerell A Jr, Roux B, Watanabe M (eds) Computational biochemistry & biophysics. Marcel Dekker, New York, Chap 1 Google Scholar
  31. 31.
    Brünger AT (1992) X-plor version 3.1, a system for X-ray crystallography and NMR. Yale University Press, New Haven Google Scholar
  32. 32.
    Brünger AT, Adams PD, Clore GM, Delano WL, Gros P, Grosse-Kunstleve RW, Jiang J, Kuszewski J, Nilges M, Pannu NS, Read RJ, Rice LM, Simonson T, Warren GL (1998) Crystallography and NMR system: a new software suite for macromolecular structure determination. Acta Crystallogr, D Biol Crystallogr 54:905–921 CrossRefGoogle Scholar
  33. 33.
    Anderson DP (2004) BOINC: a system for public-resource computing and storage. In: 5th IEEE/ACM international workshop on grid computing. IEEE Comput Soc, Los Alamitos Google Scholar
  34. 34.
    Janin J, Wodak S, Levitt M, Maigret B (1978) Conformation of amino acid sidechains in proteins. J Mol Biol 125:357–386 CrossRefGoogle Scholar
  35. 35.
    Tuffery P, Etchebest C, Hazout S, Lavery R (1991) A new approach to the rapid determination of protein side chain conformations. J Biomol Struct Dyn 8:1267 CrossRefGoogle Scholar
  36. 36.
    Dunbrack RL, Karplus M (1993) Backbone-dependent rotamer library for proteins. Application to sidechain prediction. J Mol Biol 230:543–574 CrossRefGoogle Scholar
  37. 37.
    Dunbrack RL, Cohen FE (1997) Bayesian statistical analysis of protein sidechain rotamer preferences. Protein Sci 6:1661–1681 CrossRefGoogle Scholar
  38. 38.
    Dunbrack RL (2002) Rotamer libraries in the 21st century. Curr Opin Struct Biol 12:431–440 CrossRefGoogle Scholar
  39. 39.
    Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The protein data bank. Nucleic Acids Res 28:235–242 CrossRefGoogle Scholar
  40. 40.
    Wernisch L, Héry S, Wodak S (2000) Automatic protein design with all atom force fields by exact and heuristic optimization. J Mol Biol 301:713–736 CrossRefGoogle Scholar
  41. 41.
    Seeliger D, de Groot B (2010) Protein thermostability calculations using alchemical free energy simulations. Biophys J 98:2309–2316 CrossRefGoogle Scholar
  42. 42.
    Jaramillo A, Wernisch L, Héry S, Wodak S (2002) Folding free energy function selects native-like protein sequences in the core but not on the surface. Proc Natl Acad Sci USA 99:13554–13559 CrossRefGoogle Scholar
  43. 43.
    Saunders CT, Baker D (2005) Recapitulation of protein family divergence using flexible backbone protein design. J Mol Biol 346:631–644 CrossRefGoogle Scholar
  44. 44.
    Brooks CL, Karplus M, Pettitt M (1987) Proteins: a theoretical perspective of dynamics, structure and thermodynamics. Adv Chem Phys 71:1–259 CrossRefGoogle Scholar
  45. 45.
    McCammon JA, Gelin B, Karplus M (1977) Dynamics of folded proteins. Nature 267:585 CrossRefGoogle Scholar
  46. 46.
    Lopes A, Aleksandrov A, Bathelt C, Archontis G, Simonson T (2007) Computational sidechain placement and protein mutagenesis with implicit solvent models. Proteins 67:853–867 CrossRefGoogle Scholar
  47. 47.
    Roux B, Simonson T (1999) Implicit solvent models. Biophys Chem 78:1–20 CrossRefGoogle Scholar
  48. 48.
    Archontis G, Simonson T (2005) Proton binding to proteins: a free energy component analysis using a dielectric continuum model. Biophys J 88:3888–3904 CrossRefGoogle Scholar
  49. 49.
    Vizcarra CL, Zhang NG, Marshall SA, Wingreen NS, Zeng C, Mayo SL (2008) An improved pairwise decomposable finite-difference Poisson-Boltzmann method for computational protein design. J Comput Chem 29:1153–1162 CrossRefGoogle Scholar
  50. 50.
    Dahiyat BI, Mayo SL (1997) De novo protein design: fully automated sequence selection. Science 278:82–87 CrossRefGoogle Scholar
  51. 51.
    Brooks B, Brooks CL III, Mackerell AD Jr, Nilsson L, Petrella RJ, Roux B, Won Y, Archontis G, Bartels C, Boresch S, Caflisch A, Caves L, Cui Q, Dinner AR, Feig M, Fischer S, Gao J, Hodoscek M, Im W, Kuczera K, Lazaridis T, Ma J, Ovchinnikov V, Paci E, Pastor RW, Post CB, Pu JZ, Schaefer M, Tidor B, Venable RM, Woodcock HL, Wu X, Yang W, York DM, Karplus M (2009) CHARMM: the biomolecular simulation program. J Comput Chem 30:1545–1614 CrossRefGoogle Scholar
  52. 52.
    Brünger AT, Adams PD, DeLano WL, Gros P, Grosse-Kunstleve RW, Jiang JS, Pannu NS, Read RJ, Rice LM, Simonson T (2001) The structure determination language of the crystallography and NMR system. In: Rossmann M, Arnold E (eds) International tables for crystallography, vol F. Kluwer Academic, Dordrecht, pp 710–720 Google Scholar
  53. 53.
    Wright RS, Lipchak B (2006) OpenGL SuperBible. SAMS, New York Google Scholar
  54. 54.
    Guérois R, Nielsen JE, Serrano L (2002) Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations. J Mol Biol 320:369–387 CrossRefGoogle Scholar
  55. 55.
    Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 89:10915–10919 CrossRefGoogle Scholar
  56. 56.
    Madera M, Vogel C, Kummerfeld SK, Chothia C, Gough J (2004) The SUPERFAMILY database in 2004: additions and improvements. Nucleic Acids Res 32:D235–D239 CrossRefGoogle Scholar
  57. 57.
    Wilson D, Madera M, Vogel C, Chothia C, Gough J (2007) The SUPERFAMILY database in 2007: families and functions. Nucleic Acids Res 35:D308–D313 CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2012

Authors and Affiliations

  • Marcel Schmidt am Busch
    • 1
    • 2
  • Anne Lopes
    • 1
  • David Mignon
    • 1
  • Thomas Gaillard
    • 1
  • Thomas Simonson
    • 1
  1. 1.Laboratoire de Biochimie (UMR CNRS 7654), Department of BiologyEcole PolytechniquePalaiseauFrance
  2. 2.Institut fuer theoretische PhysikJohannes Kepler Universitaet LinzLinzAustria

Personalised recommendations