Information Content of Sets of Biological Sequences Revisited

  • Alessandra CarboneEmail author
  • Stefan Engelen
Part of the Natural Computing Series book series (NCS)


To analyze the information included in a pool of amino acid sequences, a first approach is to align the sequences, to estimate the probability of each amino acid to occur within columns of the aligned sequences and to combine these values through an “entropy” function whose minimum corresponds to absence of information, that is, to the case where each amino acid has the same probability to occur. Another alternative is to construct a distance tree between sequences (issued by the alignment) based on sequence similarity and to properly interpret the tree topology so to model the evolutionary property of residue conservation. We introduce the concept of “evolutionary content” of a tree of sequences, and demonstrate at what extent the more classical notion of “information content” on sequences approximates the new measure and in what manner tree topology contributes sharper information for the detection of protein binding sites.


Information Content Entropic Function Distance Tree Protein Interface Residue Position 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Adami C, Cerf NJ (2000) Physical complexity of symbolic sequences. Physica D 137:62–69 zbMATHCrossRefMathSciNetGoogle Scholar
  2. 2.
    Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(3):389–3402 Google Scholar
  3. 3.
    Baussand J (2008) Évolution des séquences protéiques: signatures structurales hydrophobes et réseaux d’acides aminés co-évolués. Thèse de Doctorat de l’Université Pierre et Marie Curie-Paris 6 Google Scholar
  4. 4.
    Caffrey DR, Somaroo S, Hughes JH, Mintseris J, Huang ES (2004) Are protein–protein interfaces more conserved in sequence than the rest of the protein surface? Protein Sci 13:190–189 CrossRefGoogle Scholar
  5. 5.
    Carothers JM, Oestreich SC, Davis JH, Szostak JW (2004) Informational complexity and functional activity of RNA structures. J Am Chem Soc 126:5130–5137 CrossRefGoogle Scholar
  6. 6.
    Duret L, Abdeddaim S (2000) Multiple alignment for structural functional or phylogenetic analyses of homologous sequences. In: Higgins D, Taylor W (eds) Bioinformatics sequence structure and databanks. Oxford University Press, Oxford Google Scholar
  7. 7.
    Engelen S, Trojan LA, Sacquin-Mora S, Lavery R, Carbone A (2009) JET: detection and analysis of protein interfaces based on evolution. PLOS Comput Biol 5(1):e1000267, 1–17 CrossRefGoogle Scholar
  8. 8.
    Lecompte O, Thompson JD, Plewniak F, Thierry J, Poch O (2001) Multiple alignment of complete sequences (MACS) in the post-genomic era. Gene 270:17–30 CrossRefGoogle Scholar
  9. 9.
    Lockless S, Ranganathan R (1999) Evolutionary conserved pathways of energetic connectivity in protein families. Science 286:295–299 CrossRefGoogle Scholar
  10. 10.
    Mihalek I, Reš I, Lichtarge O (2004) A family of evolution-entropy hybrid methods for ranking protein residues by importance. J Mol Biol 336:1265–1282 CrossRefGoogle Scholar
  11. 11.
    Moult J (2005) A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction. Curr Opin Struct Biol 15:285–289 CrossRefGoogle Scholar
  12. 12.
    Notredame C (2002) Recent progresses in multiple sequence alignment: a survey. Pharmacogenomics 31:131–144 CrossRefGoogle Scholar
  13. 13.
    Notredame C (2007) Recent evolutions of multiple sequence alignment algorithms. PLOS Comput Biol 8:e123 CrossRefGoogle Scholar
  14. 14.
    Phillips A, Janies D, Wheeler W (2000) Multiple sequence alignment in phylogenetic analysis. Mol Phylogenet Evol 16:317–330 CrossRefGoogle Scholar
  15. 15.
    Schmidt Am Busch M, Lopes A, Mignon D, Simonson T (2007) Computational protein design: software implementation, parameter optimization, and performance of a simple model. J Comput Chem 29(7):1092–1102 CrossRefGoogle Scholar
  16. 16.
    Suel G, Lockless S, Wall M, Ranganthan R (2003) Evolutionary conserved networks of residues mediate allosteric communication in proteins. Nat Struct Biol 23:59–69 CrossRefGoogle Scholar
  17. 17.
    Sugio S, Petsko GA, Manning JM, Soda K, Ringe D (1995) Crystal structure of a D-amino acid aminotransferase: how the protein controls stereoselectivity. Biochemistry 34:9661–9669 CrossRefGoogle Scholar
  18. 18.
    Thompson JD, Plewniak F, Poch O (1999) A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res 27:12682–12690 CrossRefGoogle Scholar
  19. 19.
    Wallace IM, Blackshields G, Higgins DG (2005) Multiple sequence alignments. Curr Opin Struct Biol 15:261–266 CrossRefGoogle Scholar
  20. 20.
    Watson JD, Laskowski RA, Thornton JM (2005) Predicting protein function from sequence and structural data. Curr Opin Struct Biol 15:275–284 CrossRefGoogle Scholar
  21. 21.
    Xayaphoummine A, Viasnoff V, Harlepp S, Isambert H (2007) Encoding folding paths of RNA switches. Nucleic Acids Res 35:614–622 CrossRefGoogle Scholar
  22. 22.
    Xia Y, Levitt M (2004) Simulating protein evolution in sequence and structure space. Curr Opin Struct Biol 14:202–207 CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  1. 1.Génomique AnalytiqueUniversité Pierre et Marie CurieParisFrance

Personalised recommendations