Skip to main content

On the Entropy of Protein Families

Abstract

Proteins are essential components of living systems, capable of performing a huge variety of tasks at the molecular level, such as recognition, signalling, copy, transport, ... The protein sequences realizing a given function may largely vary across organisms, giving rise to a protein family. Here, we estimate the entropy of those families based on different approaches, including Hidden Markov Models used for protein databases and inferred statistical models reproducing the low-order (1- and 2-point) statistics of multi-sequence alignments. We also compute the entropic cost, that is, the loss in entropy resulting from a constraint acting on the protein, such as the mutation of one particular amino-acid on a specific site, and relate this notion to the escape probability of the HIV virus. The case of lattice proteins, for which the entropy can be computed exactly, allows us to provide another illustration of the concept of cost, due to the competition of different folds. The relevance of the entropy in relation to directed evolution experiments is stressed.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

References

  1. Durbin, R., Sean Eddy, R., Krogh, A., Mitchison, G.: Biological Sequence Analysis Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, London (1998)

    Book  MATH  Google Scholar 

  2. Ashkenazy, H., Erez, E., Martz, E., Pupko, T., Ben-Tal, N.: ConSurf 2010: calculating evolutionary conservation in sequence and structure of proteins and nucleic acids. Nucl. Acids Res. 38, W529–W533 (2010)

    Article  Google Scholar 

  3. Lapedes, A.S., Giraud, B.G., Liu, L., Stormo, G.D.: Correlated mutations in models of protein sequences: phylogenetic and structural effects. Lect. Notes-Monogr. Ser. 33, 236–256 (1999)

    Article  MathSciNet  Google Scholar 

  4. Rausell, A., Juan, D., Pazos, F., Valencia, A.: Protein interactions and ligand binding: from protein subfamilies to functional specificity. Proc. Natl. Acad. Sci. 107(5), 1995–2000 (2010)

    Article  ADS  Google Scholar 

  5. Pazos, F., Helmer-Citterich, E., Ausiello, G., Valencia, A.: Correlated mutations contain information about protein- protein interaction. J. Mol. Biol. 271, 511–523 (1997)

    Article  Google Scholar 

  6. de Juan, D., Pazos, F., Valencia, A.: Emerging methods in protein co-evolution. Nat. Rev. Genet. 14, 249–261 (2013)

    Article  Google Scholar 

  7. Berman, H.M., Kleywegt, G.J., Nakamura, H., Markley, J.L.: The protein data bank at 40: reflecting on the past to prepare for the future. Structure 20(3), 391–396 (2012)

    Article  Google Scholar 

  8. The Uniprot Consortium: Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucl. Acids Res. 40, D71 (2012)

  9. Punta, M., Coggill, P.C., Eberhardt, R.Y., Mistry, J., Tate, J.G., Boursnell, C., Pang, N., Forslund, K., Ceric, G., Clements, J., Heger, A., Holm, L., Sonnhammer, E.L.L., Eddy, S.R., Bateman, Al, Finn, R.D.: The Pfam protein families database. Nucl. Acids Res. 40, D290 (2012)

    Article  Google Scholar 

  10. Jaynes, E.T.: On the rationale of maximum-entropy methods. Proc. IEEE 70(9), 939–952 (1982)

    Article  ADS  Google Scholar 

  11. Bialek, William: Biophysics: Searching for Principles. Princeton University Press, Princeton (2012)

    Google Scholar 

  12. Weigt, Martin, White, Robert A., Szurmant, Hendrik, Hoch, James A., Hwa, Terence: Identification of direct residue contacts in protein-protein interaction by message passing. Proc. Natl. Acad. Sci. USA 106(1), 67–72 (2009)

    Article  ADS  Google Scholar 

  13. Burger, L., van Nimwegen, E.: Disentangling Direct from Indirect Co-Evolution of Residues in Protein Alignments. PLoS Comput. Biol. 6, E1000633 (2010)

    Article  ADS  Google Scholar 

  14. Balakrishnan, S., Kamisetty, H., Carbonell, J.G., Lee, S.I., Langmead, C.J.: Learning generative models for protein fold families. Proteins: Struct. Funct. Bioinf. 79, 1061 (2011)

    Article  Google Scholar 

  15. Cocco, Simona, Monasson, Rémi: Adaptive cluster expansion for inferring Boltzmann machines with noisy data. Phys. Rev. Lett. 106, 090601 (2011)

    Article  ADS  Google Scholar 

  16. Cocco, Simona, Monasson, Rémi: Adaptive cluster expansion for the inverse ising problem: convergence, algorithm and tests. J. Stat. Phys. 147(2), 252–314 (2012)

    Article  MathSciNet  MATH  ADS  Google Scholar 

  17. Shakhnovich, E., Gutin, A.: Enumeration of all compact conformations of coplymers with random sequence of links. J. Chem. Phys. 93, 5967–5971 (1990)

    Article  ADS  Google Scholar 

  18. Shakhnovich, E.: Protein design: a perspective from simple tractable models. Fold. Des. 3, R45–R58 (1998)

    Article  Google Scholar 

  19. Finn, Robert D., Mistry, Jaina, Tate, John, Coggill, Penny, Heger, Andreas, Pollington, Joanne E., Luke Gavin, O., Gunasekaran, Prasad, Ceric, Goran, Forslund, Kristoffer, Holm, Liisa, Sonnhammer, Erik L.L., Eddy, Sean R., Bateman, Alex: The pfam protein families database. Nucl. Acids Res. 38(suppl 1), D211–D222 (2010)

    Article  Google Scholar 

  20. Barton, J.P., Cocco, S., De Leonardis, E., Monasson, R.: Large pseudocounts and L2-norm penalties are necessary for the mean-field inference of Ising and Potts models. Phys. Rev. E 90(1), 012132 (2014)

    Article  ADS  Google Scholar 

  21. Morcos, F., Pagnani, A., Lunt, B., Bertolino, A., Marks, D.S., Sander, C., Zecchina, R., Onuchic, J.N., Hwa, Terence, Weigt, Martin: Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc. Natl. Acad. Sci. USA 108(49), E1293–E1301 (2011)

    Article  ADS  Google Scholar 

  22. Ekeberg, M., Lovkvist, C., Lan, Y., Weigt, M., Aurell, E.: Improved contact prediction in proteins: using pseudolikelihoods to infer potts models. Phys. Rev. E 87, 012707 (2013)

    Article  ADS  Google Scholar 

  23. Cocco, S., Monasson, R., Weigt, M.: From principal component to direct coupling analysis of coevolution in proteins: Low-eigenvalue modes are needed for structure prediction. PLoS Comput. Biol. 9, E1003176 (2013)

    Article  MathSciNet  ADS  Google Scholar 

  24. Russ, W., Lowery, D.M., Mishra, P., Yaffe, M.B., Ranganathan, R.: Natural-like function in artificial WW domains. Nature 437, 579–583 (2005)

    Article  ADS  Google Scholar 

  25. Socolich, Michael, Lockless, Steve W., Russ, William P., Lee, Heather, Gardner, Kevin H., Ranganathan, Rama: Evolutionary information for specifying a protein fold. Nature 437(7058), 512–518 (2005)

    Article  ADS  Google Scholar 

  26. Korber, Bette, Gaschen, Brian, Yusim, Karina, Thakallapally, Rama, Keşmir, Can, Detours, Vincent: Evolutionary and immunological implications of contemporary HIV-1 variation. Br. Med. Bull. 58(1), 19–42 (2001)

    Article  Google Scholar 

  27. Ferguson, Andrew L., Mann, Jaclyn K., Omarjee, Saleha, Ndung’u, Thumbi, Walker, Bruce D., Chakraborty, Arup K.: Translating HIV sequences into quantitative fitness landscapes predicts viral vulnerabilities for rational immunogen design. Immunity 38(3), 606–617 (2013)

    Article  Google Scholar 

  28. Mann, Jaclyn K., Barton, John P., Ferguson, Andrew L., Omarjee, Saleha, Walker, Bruce D., Chakraborty, Arup K., Ndung’u, Thumbi: The fitness landscape of HIV-1 Gag: advanced modeling approaches and validation of model predictions by in vitro testing. PLoS Comput. Biol. 10(8), e1003776 (2014)

    Article  Google Scholar 

  29. Haq, Omar, Andrec, Michael, Morozov, Alexandre V., Levy, Ronald M.: Correlated electrostatic mutations provide a reservoir of stability in HIV protease. PLoS Comput. Biol. 8(9), e1002675 (2012)

    Article  ADS  Google Scholar 

  30. Flynn, William F., Chang, Max W., Tan, Zhiqiang, Oliveira, Glenn, Yuan, Jinyun, Okulicz, Jason F., Torbett, Bruce E., Levy, Ronald M.: Deep sequencing of protease inhibitor resistant HIV patient isolates reveals patterns of correlated mutations in gag and protease. PLoS Comput. Biol. 11(4), e1004249 (2015)

    Article  Google Scholar 

  31. Shekhar, K., Ruberman, C.F., Ferguson, A.L., Barton, J.P., Kardar, M., Chakraborty, A.K.: Spin models inferred from patient-derived viral sequence data faithfully describe HIV fitness landscapes. Phys. Rev. E 88(6), 062705 (2013)

    Article  ADS  Google Scholar 

  32. Addo, M.M., Yu, X.G., Rathod, A., Eldridge, R.L., Strick, D., Johnston, M.N., Corcoran, C., Fitzpatrick, C.A., Feeney, M.E., Rodriguez, W.R., Basgoz, N., Draenert, R., Brander, C., Goulder, P.J.R., Rosenberg, E.S., Altfeld, Marcus, Walker, Bruce D.: Comprehensive epitope analysis of human immunodeficiency virus type 1 (HIV-1)-specific T-cell responses directed against the entire expressed HIV-1 genome demonstrate broadly directed responses, but no correlation to viral load. J. Virol. 77(3), 2081–2092 (2003)

    Article  Google Scholar 

  33. Streeck, H., Jolin, J.S., Qi, Ying, Yassine-Diab, B., Johnson, R.C., Kwon, D.S., Addo, M.M., Brumme, C., Routy, J.P., Little, S., Jessen, H.K., Kelleher, A.D., Hecht, F.M., Sekaly, R.P., Rosenberg, E.S., Walker, Bruce D., Carrington, Mary, Altfeld, Marcus: Human immunodeficiency virus type 1-specific CD8+ T-cell responses during primary infection are major determinants of the viral set point and loss of CD4+ T cells. J. Virol. 83(15), 7641–7648 (2009)

    Article  Google Scholar 

  34. Zhao, Gongpu, Perilla, Juan R., Yufenyuy, Ernest L., Meng, Xin, Chen, Bo, Ning, Jiying, Ahn, Jinwoo, Gronenborn, Angela M., Schulten, Klaus, Aiken, Christopher, et al.: Mature hiv-1 capsid structure by cryo-electron microscopy and all-atom molecular dynamics. Nature 497(7451), 643–646 (2013)

    Article  ADS  Google Scholar 

  35. Dahirel, V., Shekhar, K., Florencia, P., Miura, T., Artyomov, M., Talsania, S., Allen, T.M., Altfeld, M., Carrington, M., Irvine, D.J., Walker, B.D., Chakraborty, A.K.: Coordinate linkage of HIV evolution reveals regions of immunological vulnerability. Proc. Natl. Acad. Sci. 108(28), 11530–11535 (2011)

    Article  ADS  Google Scholar 

  36. Barton, John P., Kardar, Mehran, Chakraborty, Arup K.: Scaling laws describe memories of host pathogen riposte in the HIV population. Proc. Natl. Acad. Sci. 112(7), 1965–1970 (2015)

    Article  ADS  Google Scholar 

  37. Beitzel, B.F., Bakken, R.R., Smith, J.M., Schmaljohn, C.S.: High-resolution functional mapping of the venezuelan equine encephalitis virus genome by insertional mutagenesis and massively parallel sequencing. PLoS Pathog. 6(10), e1001146 (2010)

    Article  Google Scholar 

  38. Heaton, Nicholas S., Sachs, David, Chen, Chi-Jene, Hai, Rong, Palese, Peter: Genome-wide mutagenesis of influenza virus reveals unique plasticity of the hemagglutinin and ns1 proteins. Proc. Natl. Acad. Sci. 110(50), 20248–20253 (2013)

    Article  ADS  Google Scholar 

  39. Remenyi, R., Qi, H., Su, S.Y., Chen, Z., Wu, N.C., Arumugaswami, V., Truong, S., Chu, V., Stokelman, T., Lo, H.H., Olson, A., Wu, T.T., Chen, S.H., Lin, C.Y., Sun, R.: A comprehensive functional map of the hepatitis c virus genome provides a resource for probing viral proteins. mBio 5, e01469-14 (2014)

    Article  Google Scholar 

  40. Fulton, B.O., Sachs, D., Beaty, S.M., Won, S.T., Lee, B., Palese, P., Heaton, N.S.: Mutational analysis of measles virus suggests constraints on antigenic variation of the glycoproteins. Cell Rep. 11(9), 1331–1338 (2015)

    Article  Google Scholar 

  41. Ferrari, Guido, Korber, Bette, Goonetilleke, Nilu, Liu, Michael K.P., Turnbull, Emma L., Salazar-Gonzalez, Jesus F., Hawkins, Natalie, Self, Steve, Watson, Sydeaka, Betts, Michael R., Gay, Cynthia, McGhee, Cynthia, Pellegrino, Pierre, Williams, Ian, Tomaras, Georgia D., Haynes, Barton F., Gray, Clive M., Borrow, Persephone, Roederer, Mario, McMichael, Andrew J., Weinhold, Kent J.: Relationship between functional profile of HIV-1 specific CD8 T cells and epitope variability with the selection of escape mutants in acute HIV-1 infection. PLoS Pathog. 7(2), e1001273 (2011)

    Article  Google Scholar 

  42. Liu, M.K.P., Hawkins, N., Ritchie, A.J., Ganusov, V.V., Whale, V., Brackenridge, S., Li, H., Pavlicek, J.W., Cai, F., Rose-Abrahams, M., Treurnicht, F., Hraber, P., Riou, C., Gray, C., Ferrari, G., Tanner, R., Ping, L.H., Anderson, J.A., Swanstrom, R., Cohen, M., Abdool Karim, S.S., Haynes, B., Borrow, P., Perelson, A.S., Shaw, G.M., Hahn, B.H., Williamson, C., Korber, B.T., Gao, F., Self, S., McMichael, A., Goonetilleke, N.: Vertical T cell immunodominance and epitope entropy determine HIV-1 escape. J. Clin. Investig. 123(1), 380–393 (2013)

    Google Scholar 

  43. Li, H., Helling, R., Tang, C., Wingreen, N.: Emergence of preferred structures in a simple model of protein folding. Science 273, 666–669 (1996)

    Article  ADS  Google Scholar 

  44. Li, H., Tang, C., Wingreen, N.: Designability of protein structures: a lattice-model study using the miyazawa-jernigan matrix. Proteins 49, 403–412 (2002)

    Article  Google Scholar 

  45. England, Jeremy L., Shakhnovich, Eugene I.: Structural determinant of protein designability. Phys. Rev. Lett. 90, 218101 (2003)

    Article  ADS  Google Scholar 

  46. Miyazawa, A., Jernigan, R.: Estimation of effective interresidue contact energies from protein crystal structures: quasi-chemical approximation. Macromolecules 18, 534 (1985)

    Article  ADS  Google Scholar 

  47. Jacquin, H., Gilson, A., Shakhnovich, E., Cocco, S., Monasson, R.: Benchmarking inverse statistical approaches for protein structure and design with exactly solvable models. available on Biorxiv, 2015. doi: 10.1101/028936

  48. Berezovsky, I.N., Zeldovich, K.B., Shakhnovich, E.: Positive and negative design in stability and thermal adaptation of natural proteins. PLoS Comput. Biol. 3(32), e52 (2007)

    Article  ADS  Google Scholar 

  49. Keefe, Anthony, Szostak, W.Jack: Functional proteins from a random-sequence library. Nature 410(6829), 715–718 (2001)

    Article  ADS  Google Scholar 

  50. Greenbaum, B., Cocco, S., Levine, A., Monasson, R.: A quantitative theory of entropic forces acting on constrained nucleotide sequences applied to viruses. Proc. Natl. Acad. Sci. USA 111, 5054–5059 (2014)

    Article  ADS  Google Scholar 

Download references

Acknowledgments

S.C., H.J. and R.M. were partly funded by the Agence Nationale de la Recherche Coevstat project (ANR-13-BS04-0012-01).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Simona Cocco.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Barton, J.P., Chakraborty, A.K., Cocco, S. et al. On the Entropy of Protein Families. J Stat Phys 162, 1267–1293 (2016). https://doi.org/10.1007/s10955-015-1441-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10955-015-1441-4

Keywords

  • Statistical inference
  • Entropy
  • Fitness landscape
  • Genomics
  • Hidden Markov models
  • Covariation
  • HIV virus