Journal of Statistical Physics

, Volume 162, Issue 5, pp 1267–1293 | Cite as

On the Entropy of Protein Families

  • John P. Barton
  • Arup K. Chakraborty
  • Simona CoccoEmail author
  • Hugo Jacquin
  • Rémi Monasson


Proteins are essential components of living systems, capable of performing a huge variety of tasks at the molecular level, such as recognition, signalling, copy, transport, ... The protein sequences realizing a given function may largely vary across organisms, giving rise to a protein family. Here, we estimate the entropy of those families based on different approaches, including Hidden Markov Models used for protein databases and inferred statistical models reproducing the low-order (1- and 2-point) statistics of multi-sequence alignments. We also compute the entropic cost, that is, the loss in entropy resulting from a constraint acting on the protein, such as the mutation of one particular amino-acid on a specific site, and relate this notion to the escape probability of the HIV virus. The case of lattice proteins, for which the entropy can be computed exactly, allows us to provide another illustration of the concept of cost, due to the competition of different folds. The relevance of the entropy in relation to directed evolution experiments is stressed.


Statistical inference Entropy Fitness landscape  Genomics Hidden Markov models Covariation HIV virus 



S.C., H.J. and R.M. were partly funded by the Agence Nationale de la Recherche Coevstat project (ANR-13-BS04-0012-01).


  1. 1.
    Durbin, R., Sean Eddy, R., Krogh, A., Mitchison, G.: Biological Sequence Analysis Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, London (1998)CrossRefzbMATHGoogle Scholar
  2. 2.
    Ashkenazy, H., Erez, E., Martz, E., Pupko, T., Ben-Tal, N.: ConSurf 2010: calculating evolutionary conservation in sequence and structure of proteins and nucleic acids. Nucl. Acids Res. 38, W529–W533 (2010)CrossRefGoogle Scholar
  3. 3.
    Lapedes, A.S., Giraud, B.G., Liu, L., Stormo, G.D.: Correlated mutations in models of protein sequences: phylogenetic and structural effects. Lect. Notes-Monogr. Ser. 33, 236–256 (1999)CrossRefMathSciNetGoogle Scholar
  4. 4.
    Rausell, A., Juan, D., Pazos, F., Valencia, A.: Protein interactions and ligand binding: from protein subfamilies to functional specificity. Proc. Natl. Acad. Sci. 107(5), 1995–2000 (2010)CrossRefADSGoogle Scholar
  5. 5.
    Pazos, F., Helmer-Citterich, E., Ausiello, G., Valencia, A.: Correlated mutations contain information about protein- protein interaction. J. Mol. Biol. 271, 511–523 (1997)CrossRefGoogle Scholar
  6. 6.
    de Juan, D., Pazos, F., Valencia, A.: Emerging methods in protein co-evolution. Nat. Rev. Genet. 14, 249–261 (2013)CrossRefGoogle Scholar
  7. 7.
    Berman, H.M., Kleywegt, G.J., Nakamura, H., Markley, J.L.: The protein data bank at 40: reflecting on the past to prepare for the future. Structure 20(3), 391–396 (2012)CrossRefGoogle Scholar
  8. 8.
    The Uniprot Consortium: Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucl. Acids Res. 40, D71 (2012)Google Scholar
  9. 9.
    Punta, M., Coggill, P.C., Eberhardt, R.Y., Mistry, J., Tate, J.G., Boursnell, C., Pang, N., Forslund, K., Ceric, G., Clements, J., Heger, A., Holm, L., Sonnhammer, E.L.L., Eddy, S.R., Bateman, Al, Finn, R.D.: The Pfam protein families database. Nucl. Acids Res. 40, D290 (2012)CrossRefGoogle Scholar
  10. 10.
    Jaynes, E.T.: On the rationale of maximum-entropy methods. Proc. IEEE 70(9), 939–952 (1982)CrossRefADSGoogle Scholar
  11. 11.
    Bialek, William: Biophysics: Searching for Principles. Princeton University Press, Princeton (2012)Google Scholar
  12. 12.
    Weigt, Martin, White, Robert A., Szurmant, Hendrik, Hoch, James A., Hwa, Terence: Identification of direct residue contacts in protein-protein interaction by message passing. Proc. Natl. Acad. Sci. USA 106(1), 67–72 (2009)CrossRefADSGoogle Scholar
  13. 13.
    Burger, L., van Nimwegen, E.: Disentangling Direct from Indirect Co-Evolution of Residues in Protein Alignments. PLoS Comput. Biol. 6, E1000633 (2010)CrossRefADSGoogle Scholar
  14. 14.
    Balakrishnan, S., Kamisetty, H., Carbonell, J.G., Lee, S.I., Langmead, C.J.: Learning generative models for protein fold families. Proteins: Struct. Funct. Bioinf. 79, 1061 (2011)CrossRefGoogle Scholar
  15. 15.
    Cocco, Simona, Monasson, Rémi: Adaptive cluster expansion for inferring Boltzmann machines with noisy data. Phys. Rev. Lett. 106, 090601 (2011)CrossRefADSGoogle Scholar
  16. 16.
    Cocco, Simona, Monasson, Rémi: Adaptive cluster expansion for the inverse ising problem: convergence, algorithm and tests. J. Stat. Phys. 147(2), 252–314 (2012)CrossRefMathSciNetzbMATHADSGoogle Scholar
  17. 17.
    Shakhnovich, E., Gutin, A.: Enumeration of all compact conformations of coplymers with random sequence of links. J. Chem. Phys. 93, 5967–5971 (1990)CrossRefADSGoogle Scholar
  18. 18.
    Shakhnovich, E.: Protein design: a perspective from simple tractable models. Fold. Des. 3, R45–R58 (1998)CrossRefGoogle Scholar
  19. 19.
    Finn, Robert D., Mistry, Jaina, Tate, John, Coggill, Penny, Heger, Andreas, Pollington, Joanne E., Luke Gavin, O., Gunasekaran, Prasad, Ceric, Goran, Forslund, Kristoffer, Holm, Liisa, Sonnhammer, Erik L.L., Eddy, Sean R., Bateman, Alex: The pfam protein families database. Nucl. Acids Res. 38(suppl 1), D211–D222 (2010)CrossRefGoogle Scholar
  20. 20.
    Barton, J.P., Cocco, S., De Leonardis, E., Monasson, R.: Large pseudocounts and L2-norm penalties are necessary for the mean-field inference of Ising and Potts models. Phys. Rev. E 90(1), 012132 (2014)CrossRefADSGoogle Scholar
  21. 21.
    Morcos, F., Pagnani, A., Lunt, B., Bertolino, A., Marks, D.S., Sander, C., Zecchina, R., Onuchic, J.N., Hwa, Terence, Weigt, Martin: Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc. Natl. Acad. Sci. USA 108(49), E1293–E1301 (2011)CrossRefADSGoogle Scholar
  22. 22.
    Ekeberg, M., Lovkvist, C., Lan, Y., Weigt, M., Aurell, E.: Improved contact prediction in proteins: using pseudolikelihoods to infer potts models. Phys. Rev. E 87, 012707 (2013)CrossRefADSGoogle Scholar
  23. 23.
    Cocco, S., Monasson, R., Weigt, M.: From principal component to direct coupling analysis of coevolution in proteins: Low-eigenvalue modes are needed for structure prediction. PLoS Comput. Biol. 9, E1003176 (2013)CrossRefMathSciNetADSGoogle Scholar
  24. 24.
    Russ, W., Lowery, D.M., Mishra, P., Yaffe, M.B., Ranganathan, R.: Natural-like function in artificial WW domains. Nature 437, 579–583 (2005)CrossRefADSGoogle Scholar
  25. 25.
    Socolich, Michael, Lockless, Steve W., Russ, William P., Lee, Heather, Gardner, Kevin H., Ranganathan, Rama: Evolutionary information for specifying a protein fold. Nature 437(7058), 512–518 (2005)CrossRefADSGoogle Scholar
  26. 26.
    Korber, Bette, Gaschen, Brian, Yusim, Karina, Thakallapally, Rama, Keşmir, Can, Detours, Vincent: Evolutionary and immunological implications of contemporary HIV-1 variation. Br. Med. Bull. 58(1), 19–42 (2001)CrossRefGoogle Scholar
  27. 27.
    Ferguson, Andrew L., Mann, Jaclyn K., Omarjee, Saleha, Ndung’u, Thumbi, Walker, Bruce D., Chakraborty, Arup K.: Translating HIV sequences into quantitative fitness landscapes predicts viral vulnerabilities for rational immunogen design. Immunity 38(3), 606–617 (2013)CrossRefGoogle Scholar
  28. 28.
    Mann, Jaclyn K., Barton, John P., Ferguson, Andrew L., Omarjee, Saleha, Walker, Bruce D., Chakraborty, Arup K., Ndung’u, Thumbi: The fitness landscape of HIV-1 Gag: advanced modeling approaches and validation of model predictions by in vitro testing. PLoS Comput. Biol. 10(8), e1003776 (2014)CrossRefGoogle Scholar
  29. 29.
    Haq, Omar, Andrec, Michael, Morozov, Alexandre V., Levy, Ronald M.: Correlated electrostatic mutations provide a reservoir of stability in HIV protease. PLoS Comput. Biol. 8(9), e1002675 (2012)CrossRefADSGoogle Scholar
  30. 30.
    Flynn, William F., Chang, Max W., Tan, Zhiqiang, Oliveira, Glenn, Yuan, Jinyun, Okulicz, Jason F., Torbett, Bruce E., Levy, Ronald M.: Deep sequencing of protease inhibitor resistant HIV patient isolates reveals patterns of correlated mutations in gag and protease. PLoS Comput. Biol. 11(4), e1004249 (2015)CrossRefGoogle Scholar
  31. 31.
    Shekhar, K., Ruberman, C.F., Ferguson, A.L., Barton, J.P., Kardar, M., Chakraborty, A.K.: Spin models inferred from patient-derived viral sequence data faithfully describe HIV fitness landscapes. Phys. Rev. E 88(6), 062705 (2013)CrossRefADSGoogle Scholar
  32. 32.
    Addo, M.M., Yu, X.G., Rathod, A., Eldridge, R.L., Strick, D., Johnston, M.N., Corcoran, C., Fitzpatrick, C.A., Feeney, M.E., Rodriguez, W.R., Basgoz, N., Draenert, R., Brander, C., Goulder, P.J.R., Rosenberg, E.S., Altfeld, Marcus, Walker, Bruce D.: Comprehensive epitope analysis of human immunodeficiency virus type 1 (HIV-1)-specific T-cell responses directed against the entire expressed HIV-1 genome demonstrate broadly directed responses, but no correlation to viral load. J. Virol. 77(3), 2081–2092 (2003)CrossRefGoogle Scholar
  33. 33.
    Streeck, H., Jolin, J.S., Qi, Ying, Yassine-Diab, B., Johnson, R.C., Kwon, D.S., Addo, M.M., Brumme, C., Routy, J.P., Little, S., Jessen, H.K., Kelleher, A.D., Hecht, F.M., Sekaly, R.P., Rosenberg, E.S., Walker, Bruce D., Carrington, Mary, Altfeld, Marcus: Human immunodeficiency virus type 1-specific CD8+ T-cell responses during primary infection are major determinants of the viral set point and loss of CD4+ T cells. J. Virol. 83(15), 7641–7648 (2009)CrossRefGoogle Scholar
  34. 34.
    Zhao, Gongpu, Perilla, Juan R., Yufenyuy, Ernest L., Meng, Xin, Chen, Bo, Ning, Jiying, Ahn, Jinwoo, Gronenborn, Angela M., Schulten, Klaus, Aiken, Christopher, et al.: Mature hiv-1 capsid structure by cryo-electron microscopy and all-atom molecular dynamics. Nature 497(7451), 643–646 (2013)CrossRefADSGoogle Scholar
  35. 35.
    Dahirel, V., Shekhar, K., Florencia, P., Miura, T., Artyomov, M., Talsania, S., Allen, T.M., Altfeld, M., Carrington, M., Irvine, D.J., Walker, B.D., Chakraborty, A.K.: Coordinate linkage of HIV evolution reveals regions of immunological vulnerability. Proc. Natl. Acad. Sci. 108(28), 11530–11535 (2011)CrossRefADSGoogle Scholar
  36. 36.
    Barton, John P., Kardar, Mehran, Chakraborty, Arup K.: Scaling laws describe memories of host pathogen riposte in the HIV population. Proc. Natl. Acad. Sci. 112(7), 1965–1970 (2015)CrossRefADSGoogle Scholar
  37. 37.
    Beitzel, B.F., Bakken, R.R., Smith, J.M., Schmaljohn, C.S.: High-resolution functional mapping of the venezuelan equine encephalitis virus genome by insertional mutagenesis and massively parallel sequencing. PLoS Pathog. 6(10), e1001146 (2010)CrossRefGoogle Scholar
  38. 38.
    Heaton, Nicholas S., Sachs, David, Chen, Chi-Jene, Hai, Rong, Palese, Peter: Genome-wide mutagenesis of influenza virus reveals unique plasticity of the hemagglutinin and ns1 proteins. Proc. Natl. Acad. Sci. 110(50), 20248–20253 (2013)CrossRefADSGoogle Scholar
  39. 39.
    Remenyi, R., Qi, H., Su, S.Y., Chen, Z., Wu, N.C., Arumugaswami, V., Truong, S., Chu, V., Stokelman, T., Lo, H.H., Olson, A., Wu, T.T., Chen, S.H., Lin, C.Y., Sun, R.: A comprehensive functional map of the hepatitis c virus genome provides a resource for probing viral proteins. mBio 5, e01469-14 (2014)CrossRefGoogle Scholar
  40. 40.
    Fulton, B.O., Sachs, D., Beaty, S.M., Won, S.T., Lee, B., Palese, P., Heaton, N.S.: Mutational analysis of measles virus suggests constraints on antigenic variation of the glycoproteins. Cell Rep. 11(9), 1331–1338 (2015)CrossRefGoogle Scholar
  41. 41.
    Ferrari, Guido, Korber, Bette, Goonetilleke, Nilu, Liu, Michael K.P., Turnbull, Emma L., Salazar-Gonzalez, Jesus F., Hawkins, Natalie, Self, Steve, Watson, Sydeaka, Betts, Michael R., Gay, Cynthia, McGhee, Cynthia, Pellegrino, Pierre, Williams, Ian, Tomaras, Georgia D., Haynes, Barton F., Gray, Clive M., Borrow, Persephone, Roederer, Mario, McMichael, Andrew J., Weinhold, Kent J.: Relationship between functional profile of HIV-1 specific CD8 T cells and epitope variability with the selection of escape mutants in acute HIV-1 infection. PLoS Pathog. 7(2), e1001273 (2011)CrossRefGoogle Scholar
  42. 42.
    Liu, M.K.P., Hawkins, N., Ritchie, A.J., Ganusov, V.V., Whale, V., Brackenridge, S., Li, H., Pavlicek, J.W., Cai, F., Rose-Abrahams, M., Treurnicht, F., Hraber, P., Riou, C., Gray, C., Ferrari, G., Tanner, R., Ping, L.H., Anderson, J.A., Swanstrom, R., Cohen, M., Abdool Karim, S.S., Haynes, B., Borrow, P., Perelson, A.S., Shaw, G.M., Hahn, B.H., Williamson, C., Korber, B.T., Gao, F., Self, S., McMichael, A., Goonetilleke, N.: Vertical T cell immunodominance and epitope entropy determine HIV-1 escape. J. Clin. Investig. 123(1), 380–393 (2013)Google Scholar
  43. 43.
    Li, H., Helling, R., Tang, C., Wingreen, N.: Emergence of preferred structures in a simple model of protein folding. Science 273, 666–669 (1996)CrossRefADSGoogle Scholar
  44. 44.
    Li, H., Tang, C., Wingreen, N.: Designability of protein structures: a lattice-model study using the miyazawa-jernigan matrix. Proteins 49, 403–412 (2002)CrossRefGoogle Scholar
  45. 45.
    England, Jeremy L., Shakhnovich, Eugene I.: Structural determinant of protein designability. Phys. Rev. Lett. 90, 218101 (2003)CrossRefADSGoogle Scholar
  46. 46.
    Miyazawa, A., Jernigan, R.: Estimation of effective interresidue contact energies from protein crystal structures: quasi-chemical approximation. Macromolecules 18, 534 (1985)CrossRefADSGoogle Scholar
  47. 47.
    Jacquin, H., Gilson, A., Shakhnovich, E., Cocco, S., Monasson, R.: Benchmarking inverse statistical approaches for protein structure and design with exactly solvable models. available on Biorxiv, 2015. doi:  10.1101/028936
  48. 48.
    Berezovsky, I.N., Zeldovich, K.B., Shakhnovich, E.: Positive and negative design in stability and thermal adaptation of natural proteins. PLoS Comput. Biol. 3(32), e52 (2007)CrossRefADSGoogle Scholar
  49. 49.
    Keefe, Anthony, Szostak, W.Jack: Functional proteins from a random-sequence library. Nature 410(6829), 715–718 (2001)CrossRefADSGoogle Scholar
  50. 50.
    Greenbaum, B., Cocco, S., Levine, A., Monasson, R.: A quantitative theory of entropic forces acting on constrained nucleotide sequences applied to viruses. Proc. Natl. Acad. Sci. USA 111, 5054–5059 (2014)CrossRefADSGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • John P. Barton
    • 1
    • 2
    • 3
    • 6
  • Arup K. Chakraborty
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
  • Simona Cocco
    • 7
    Email author
  • Hugo Jacquin
    • 7
  • Rémi Monasson
    • 8
  1. 1.Ragon Institute of MGHMIT & HarvardCambridgeUSA
  2. 2.Department of Chemical EngineeringMassachusetts Institute of TechnologyCambridgeUSA
  3. 3.Department of PhysicsMassachusetts Institute of TechnologyCambridgeUSA
  4. 4.Department of ChemistryMassachusetts Institute of TechnologyCambridgeUSA
  5. 5.Department of Biological EngineeringMassachusetts Institute of TechnologyCambridgeUSA
  6. 6.Institute for Medical Engineering & ScienceMassachusetts Institute of TechnologyCambridgeUSA
  7. 7.Laboratoire de Physique Statistique de l’ENSUMR 8550, associé au CNRS et à l’Université P&M. CurieParisFrance
  8. 8.Laboratoire de Physique Théorique de l’ENSUMR 8549, associé au CNRS et à l’Université P&M. CurieParisFrance

Personalised recommendations