Skip to main content
Log in

Application of information theory to feature selection in protein docking

  • Original Paper
  • Published:
Journal of Molecular Modeling Aims and scope Submit manuscript

Abstract

In the era of structural genomics, the prediction of protein interactions using docking algorithms is an important goal. The success of this method critically relies on the identification of good docking solutions among a vast excess of false solutions. We have adapted the concept of mutual information (MI) from information theory to achieve a fast and quantitative screening of different structural features with respect to their ability to discriminate between physiological and nonphysiological protein interfaces. The strategy includes the discretization of each structural feature into distinct value ranges to optimize its mutual information. We have selected 11 structural features and two datasets to demonstrate that the MI is dimensionless and can be directly compared for diverse structural features and between datasets of different sizes. Conversion of the MI values into a simple scoring function revealed that those features with a higher MI are actually more powerful for the identification of good docking solutions. Thus, an MI-based approach allows the rapid screening of structural features with respect to their information content and should therefore be helpful for the design of improved scoring functions in future. In addition, the concept presented here may also be adapted to related areas that require feature selection for biomolecules or organic ligands.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Lensink MF, Mendez R, Wodak SJ (2007) Docking and scoring protein complexes: CAPRI 3rd edition. Proteins 69:704–718

    Google Scholar 

  2. Lensink MF, Wodak SJ (2010) Docking and scoring protein interactions: CAPRI 2009. Proteins 78:3073–3084

    Google Scholar 

  3. Janin J (2010) Protein–protein docking tested in blind predictions: the CAPRI experiment. Mol Biosyst 6:2351–2362

    Google Scholar 

  4. Janin J (2010) The targets of CAPRI rounds 13–19. Proteins 78:3067–3072

    Google Scholar 

  5. Katchalski-Katzir E, Shariv I, Eisenstein M, Friesem AA, Aflalo C, Vakser IA (1992) Molecular surface recognition: determination of geometric fit between proteins and their ligands by correlation techniques. Proc Natl Acad Sci USA 89:2195–2199

    Article  CAS  Google Scholar 

  6. Walls PH, Sternberg MJ (1992) New algorithm to model protein–protein recognition based on surface complementarity. Applications to antibody–antigen docking. J Mol Biol 228:277–297

    Google Scholar 

  7. Jones S, Thornton JM (1996) Principles of protein–protein interactions. Proc Natl Acad Sci USA 93:13–20

    Google Scholar 

  8. Meyer M, Wilson P, Schomburg D (1996) Hydrogen bonding and molecular surface shape complementarity as a basis for protein docking. J Mol Biol 264:199–210

    Article  CAS  Google Scholar 

  9. Ausiello G, Cesareni G, Helmer-Citterich M (1997) Escher: a new docking procedure applied to the reconstruction of protein tertiary structure. Proteins 28:556–567

    Article  CAS  Google Scholar 

  10. Vakser IA, Aflalo C (1994) Hydrophobic docking: a proposed enhancement to molecular recognition techniques. Proteins 20:320–329

    Article  CAS  Google Scholar 

  11. Gabb HA, Jackson RM, Sternberg MJ (1997) Modelling protein docking using shape complementarity, electrostatics and biochemical information. J Mol Biol 272:106–120

    Article  CAS  Google Scholar 

  12. Robert CH, Janin J (1998) A soft, mean-field potential derived from crystal contacts for predicting protein–protein interactions. J Mol Biol 283:1037–1047

    Google Scholar 

  13. Moont G, Gabb HA, Sternberg MJ (1999) Use of pair potentials across protein interfaces in screening predicted docked complexes. Proteins 35:364–373

    Article  CAS  Google Scholar 

  14. Zhang C, Liu S, Zhou H, Zhou Y (2004) An accurate, residue-level, pair potential of mean force for folding and binding based on the distance-scaled, ideal-gas reference state. Protein Sci 13:400–411

    Article  CAS  Google Scholar 

  15. Pons C, Talavera D, de la Cruz X, Orozco M, Fernandez-Recio J (2011) Scoring by intermolecular pairwise propensities of exposed residues (sipper): a new efficient potential for protein–protein docking. J Chem Inf Model 51:370–377

    Article  CAS  Google Scholar 

  16. Cover TM, Thomas JA (2006) Elements of information theory. Wiley-Interscience, Hoboken

  17. Douguet D, Chen HC, Tovchigrechko A, Vakser IA (2006) Dockground resource for studying protein–protein interfaces. Bioinformatics 22:2612–2618

    Google Scholar 

  18. Gao Y, Douguet D, Tovchigrechko A, Vakser IA (2007) Dockground system of databases for protein recognition studies: unbound structures for docking. Proteins 69:845–851

    Article  CAS  Google Scholar 

  19. Liu S, Gao Y, Vakser IA (2008) Dockground protein–protein docking decoy set. Bioinformatics 24:2634–2635

    Google Scholar 

  20. Kabsch W, Sander C (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22:2577–2637

    Article  CAS  Google Scholar 

  21. Fiorucci S, Zacharias M (2010) Prediction of protein–protein interaction sites using electrostatic desolvation profiles. Biophys J 98:1921–1930

    Google Scholar 

  22. Aloy P, Russell RB (2002) Interrogating protein interaction networks through structural biology. Proc Natl Acad Sci USA 99:5896–5901

    Article  CAS  Google Scholar 

  23. Ansari S, Helms V (2005) Statistical analysis of predominantly transient protein–protein interfaces. Proteins 61:344–355

    Google Scholar 

  24. Melo F, Feytmans E (1997) Novel knowledge-based mean force potential at atomic level. J Mol Biol 267:207–222

    Article  CAS  Google Scholar 

  25. Melo F, Sanchez R, Sali A (2002) Statistical potentials for fold assessment. Protein Sci 11:430–448

    Article  CAS  Google Scholar 

  26. Launay G, Mendez R, Wodak S, Simonson T (2007) Recognizing protein–protein interfaces with empirical potentials and reduced amino acid alphabets. BMC Bioinforma 8:270

    Google Scholar 

  27. Fiorucci S, Zacharias M (2010) Binding site prediction and improved scoring during flexible protein–protein docking with attract. Proteins 78:3131–3139

    Google Scholar 

  28. Ohlson MB, Huang Z, Alto NM, Blanc MP, Dixon JE, Chai J, Miller SI (2008) Structure and function of salmonella sifa indicate that its interactions with skip, ssej, and rhoa family gtpases induce endosomal tubulation. Cell Host Microbe 4:434–446

    Article  CAS  Google Scholar 

  29. Diacovich L, Dumont A, Lafitte D, Soprano E, Guilhon AA, Bignon C, Gorvel JP, Bourne Y, Meresse S (2009) Interaction between the sifa virulence factor and its host target skip is essential for salmonella pathogenesis. J Biol Chem 284:33151–33160

    Article  CAS  Google Scholar 

  30. Perkins JR, Diboun I, Dessailly BH, Lees JG, Orengo C (2010) Transient protein–protein interactions: structural, functional, and network properties. Structure 18:1233–1243

    Google Scholar 

  31. Dey S, Pal A, Chakrabarti P, Janin J (2010) The subunit interfaces of weakly associated homodimeric proteins. J Mol Biol 398:146–160

    Article  CAS  Google Scholar 

  32. Gatenby RA, Frieden BR (2007) Information theory in living systems, methods, applications, and challenges. Bull Math Biol 69:635–657

    Article  Google Scholar 

  33. Kauffman C, Karypis G (2008) An analysis of information content present in protein–DNA interactions. Pac Symp Biocomput:477–488

  34. Sterner B, Singh R, Berger B (2007) Predicting and annotating catalytic residues: an information theoretic approach. J Comput Biol 14:1058–1073

    Google Scholar 

  35. Magliery TJ, Regan L (2005) Sequence variation in ligand binding sites in proteins. BMC Bioinforma 6:240

    Article  Google Scholar 

  36. Kulharia M, Goody RS, Jackson RM (2008) Information theory-based scoring function for the structure-based prediction of protein–ligand binding affinity. J Chem Inf Model 48:1990–1998

    Google Scholar 

  37. Wassermann AM, Nisius B, Vogt M, Bajorath J (2010) Identification of descriptors capturing compound class-specific features by mutual information analysis. J Chem Inf Model 50:1935–1940

    Article  CAS  Google Scholar 

  38. Cline MS, Karplus K, Lathrop RH, Smith TF, Rogers RG Jr, Haussler D (2002) Information-theoretic dissection of pairwise contact potentials. Proteins 49:7–14

    Article  CAS  Google Scholar 

  39. Shackelford G, Karplus K (2007) Contact prediction using mutual information and neural nets. Proteins 69(Suppl 8):159–164

    Article  CAS  Google Scholar 

  40. Miller CS, Eisenberg D (2008) Using inferred residue contacts to distinguish between correct and incorrect protein models. Bioinformatics 24:1575–1582

    Article  CAS  Google Scholar 

  41. Solis AD, Rackovsky S (2008) Information and discrimination in pairwise contact potentials. Proteins 71:1071–1087

    Article  CAS  Google Scholar 

Download references

Acknowledgments

The authors thank Kristin Kassler and Dr. Christophe Jardin for critically reading the manuscript. The project was funded within the DFG (Deutsche Forschungsgemeinschaft) priority program (SPP 1395) by grants to JH and HS.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Heinrich Sticht.

Appendix

Appendix

Proof that I ( X ; y j ) is always positive

We will show that

$$ I(X;{y_j}) = \sum\limits_{{{\text{i}} = 1}}^{{{{\text{M}}_{\text{x}}}}} {\Pr \left( {{x_i},{y_j}} \right){{\log }_2}} \frac{{\Pr \left( {{x_i},{y_j}} \right)}}{{\Pr \left( {{x_i}} \right)\Pr \left( {{y_j}} \right)}} $$

is always greater than or equal to 0. For this purpose, the log sum inequality is quoted first [16]:

$$ \sum\limits_{{i = 1}}^{n} {{a_{i}}{{\log }_{2}}\frac{{{a_{i}}}}{{{b_{i}}}} \geqslant } \left( {\sum\limits_{{i = 1}}^{n} {{a_{i}}} } \right){\log _{2}}\frac{{\sum\nolimits_{{i = 1}}^{n} {{a_{i}}} }}{{\sum\nolimits_{{i = 1}}^{n} {{b_{i}}} }}{\text{for}}\;{\text{any}}\;{a_{i}},{b_{i}} \geqslant 0,i = 1 \ldots n. $$

With a i = Pr(x i ,y j ), b i = Pr(x i )Pr(y j ) and n = M x , it follows that

$$ \begin{array}{*{20}{c}} {I\left( {X;{y_j}} \right) = \sum\limits_{{{\text{i}} = 1}}^{{{{\text{M}}_{\text{x}}}}} {\Pr \left( {{x_i},{y_j}} \right)} {{\log }_2}\frac{{\Pr \left( {{x_i},{y_j}} \right)}}{{\Pr \left( {{x_i}} \right)\Pr \left( {{y_j}} \right)}} \geqslant \left( {\sum\limits_{{i = 1}}^{{{M_x}}} {\Pr \left( {{x_i},{y_j}} \right)} } \right)\underbrace{{{{\log }_2}\frac{{\overbrace{{\sum\nolimits_{{i = 1}}^{{{M_x}}} {\Pr \left( {{x_i},{y_j}} \right)} }}^{{\Pr \left( {{y_j}} \right)}}}}{{\underbrace{{\sum\nolimits_{{i = 1}}^{{{M_x}}} {\Pr \left( {{x_i}} \right)\Pr \left( {{y_j}} \right)} }}_{{\Pr \left( {{y_j}} \right)}}}}}}_{{{{\log }_2}1 = 0}} = 0} \hfill \\ { \Rightarrow I\left( {X;{y_j}} \right) \geqslant 0} \hfill \\ \end{array}. $$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Othersen, O.G., Stefani, A.G., Huber, J.B. et al. Application of information theory to feature selection in protein docking. J Mol Model 18, 1285–1297 (2012). https://doi.org/10.1007/s00894-011-1157-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00894-011-1157-6

Keywords

Navigation