Application of information theory to feature selection in protein docking

Othersen, Olaf G.; Stefani, Arno G.; Huber, Johannes B.; Sticht, Heinrich

doi:10.1007/s00894-011-1157-6

Application of information theory to feature selection in protein docking

Original Paper
Published: 12 July 2011

Volume 18, pages 1285–1297, (2012)
Cite this article

Journal of Molecular Modeling Aims and scope Submit manuscript

Olaf G. Othersen¹,
Arno G. Stefani²,
Johannes B. Huber² &
…
Heinrich Sticht¹

1341 Accesses
8 Citations
Explore all metrics

Abstract

In the era of structural genomics, the prediction of protein interactions using docking algorithms is an important goal. The success of this method critically relies on the identification of good docking solutions among a vast excess of false solutions. We have adapted the concept of mutual information (MI) from information theory to achieve a fast and quantitative screening of different structural features with respect to their ability to discriminate between physiological and nonphysiological protein interfaces. The strategy includes the discretization of each structural feature into distinct value ranges to optimize its mutual information. We have selected 11 structural features and two datasets to demonstrate that the MI is dimensionless and can be directly compared for diverse structural features and between datasets of different sizes. Conversion of the MI values into a simple scoring function revealed that those features with a higher MI are actually more powerful for the identification of good docking solutions. Thus, an MI-based approach allows the rapid screening of structural features with respect to their information content and should therefore be helpful for the design of improved scoring functions in future. In addition, the concept presented here may also be adapted to related areas that require feature selection for biomolecules or organic ligands.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robocrystallographer: automated crystal structure text descriptions and analysis

Article 20 September 2019

Software for molecular docking: a review

Article 16 January 2017

Feature selection techniques for machine learning: a survey of more than two decades of research

Article 01 December 2023

References

Lensink MF, Mendez R, Wodak SJ (2007) Docking and scoring protein complexes: CAPRI 3rd edition. Proteins 69:704–718
Google Scholar
Lensink MF, Wodak SJ (2010) Docking and scoring protein interactions: CAPRI 2009. Proteins 78:3073–3084
Google Scholar
Janin J (2010) Protein–protein docking tested in blind predictions: the CAPRI experiment. Mol Biosyst 6:2351–2362
Google Scholar
Janin J (2010) The targets of CAPRI rounds 13–19. Proteins 78:3067–3072
Google Scholar
Katchalski-Katzir E, Shariv I, Eisenstein M, Friesem AA, Aflalo C, Vakser IA (1992) Molecular surface recognition: determination of geometric fit between proteins and their ligands by correlation techniques. Proc Natl Acad Sci USA 89:2195–2199
Article CAS Google Scholar
Walls PH, Sternberg MJ (1992) New algorithm to model protein–protein recognition based on surface complementarity. Applications to antibody–antigen docking. J Mol Biol 228:277–297
Google Scholar
Jones S, Thornton JM (1996) Principles of protein–protein interactions. Proc Natl Acad Sci USA 93:13–20
Google Scholar
Meyer M, Wilson P, Schomburg D (1996) Hydrogen bonding and molecular surface shape complementarity as a basis for protein docking. J Mol Biol 264:199–210
Article CAS Google Scholar
Ausiello G, Cesareni G, Helmer-Citterich M (1997) Escher: a new docking procedure applied to the reconstruction of protein tertiary structure. Proteins 28:556–567
Article CAS Google Scholar
Vakser IA, Aflalo C (1994) Hydrophobic docking: a proposed enhancement to molecular recognition techniques. Proteins 20:320–329
Article CAS Google Scholar
Gabb HA, Jackson RM, Sternberg MJ (1997) Modelling protein docking using shape complementarity, electrostatics and biochemical information. J Mol Biol 272:106–120
Article CAS Google Scholar
Robert CH, Janin J (1998) A soft, mean-field potential derived from crystal contacts for predicting protein–protein interactions. J Mol Biol 283:1037–1047
Google Scholar
Moont G, Gabb HA, Sternberg MJ (1999) Use of pair potentials across protein interfaces in screening predicted docked complexes. Proteins 35:364–373
Article CAS Google Scholar
Zhang C, Liu S, Zhou H, Zhou Y (2004) An accurate, residue-level, pair potential of mean force for folding and binding based on the distance-scaled, ideal-gas reference state. Protein Sci 13:400–411
Article CAS Google Scholar
Pons C, Talavera D, de la Cruz X, Orozco M, Fernandez-Recio J (2011) Scoring by intermolecular pairwise propensities of exposed residues (sipper): a new efficient potential for protein–protein docking. J Chem Inf Model 51:370–377
Article CAS Google Scholar
Cover TM, Thomas JA (2006) Elements of information theory. Wiley-Interscience, Hoboken
Douguet D, Chen HC, Tovchigrechko A, Vakser IA (2006) Dockground resource for studying protein–protein interfaces. Bioinformatics 22:2612–2618
Google Scholar
Gao Y, Douguet D, Tovchigrechko A, Vakser IA (2007) Dockground system of databases for protein recognition studies: unbound structures for docking. Proteins 69:845–851
Article CAS Google Scholar
Liu S, Gao Y, Vakser IA (2008) Dockground protein–protein docking decoy set. Bioinformatics 24:2634–2635
Google Scholar
Kabsch W, Sander C (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22:2577–2637
Article CAS Google Scholar
Fiorucci S, Zacharias M (2010) Prediction of protein–protein interaction sites using electrostatic desolvation profiles. Biophys J 98:1921–1930
Google Scholar
Aloy P, Russell RB (2002) Interrogating protein interaction networks through structural biology. Proc Natl Acad Sci USA 99:5896–5901
Article CAS Google Scholar
Ansari S, Helms V (2005) Statistical analysis of predominantly transient protein–protein interfaces. Proteins 61:344–355
Google Scholar
Melo F, Feytmans E (1997) Novel knowledge-based mean force potential at atomic level. J Mol Biol 267:207–222
Article CAS Google Scholar
Melo F, Sanchez R, Sali A (2002) Statistical potentials for fold assessment. Protein Sci 11:430–448
Article CAS Google Scholar
Launay G, Mendez R, Wodak S, Simonson T (2007) Recognizing protein–protein interfaces with empirical potentials and reduced amino acid alphabets. BMC Bioinforma 8:270
Google Scholar
Fiorucci S, Zacharias M (2010) Binding site prediction and improved scoring during flexible protein–protein docking with attract. Proteins 78:3131–3139
Google Scholar
Ohlson MB, Huang Z, Alto NM, Blanc MP, Dixon JE, Chai J, Miller SI (2008) Structure and function of salmonella sifa indicate that its interactions with skip, ssej, and rhoa family gtpases induce endosomal tubulation. Cell Host Microbe 4:434–446
Article CAS Google Scholar
Diacovich L, Dumont A, Lafitte D, Soprano E, Guilhon AA, Bignon C, Gorvel JP, Bourne Y, Meresse S (2009) Interaction between the sifa virulence factor and its host target skip is essential for salmonella pathogenesis. J Biol Chem 284:33151–33160
Article CAS Google Scholar
Perkins JR, Diboun I, Dessailly BH, Lees JG, Orengo C (2010) Transient protein–protein interactions: structural, functional, and network properties. Structure 18:1233–1243
Google Scholar
Dey S, Pal A, Chakrabarti P, Janin J (2010) The subunit interfaces of weakly associated homodimeric proteins. J Mol Biol 398:146–160
Article CAS Google Scholar
Gatenby RA, Frieden BR (2007) Information theory in living systems, methods, applications, and challenges. Bull Math Biol 69:635–657
Article Google Scholar
Kauffman C, Karypis G (2008) An analysis of information content present in protein–DNA interactions. Pac Symp Biocomput:477–488
Sterner B, Singh R, Berger B (2007) Predicting and annotating catalytic residues: an information theoretic approach. J Comput Biol 14:1058–1073
Google Scholar
Magliery TJ, Regan L (2005) Sequence variation in ligand binding sites in proteins. BMC Bioinforma 6:240
Article Google Scholar
Kulharia M, Goody RS, Jackson RM (2008) Information theory-based scoring function for the structure-based prediction of protein–ligand binding affinity. J Chem Inf Model 48:1990–1998
Google Scholar
Wassermann AM, Nisius B, Vogt M, Bajorath J (2010) Identification of descriptors capturing compound class-specific features by mutual information analysis. J Chem Inf Model 50:1935–1940
Article CAS Google Scholar
Cline MS, Karplus K, Lathrop RH, Smith TF, Rogers RG Jr, Haussler D (2002) Information-theoretic dissection of pairwise contact potentials. Proteins 49:7–14
Article CAS Google Scholar
Shackelford G, Karplus K (2007) Contact prediction using mutual information and neural nets. Proteins 69(Suppl 8):159–164
Article CAS Google Scholar
Miller CS, Eisenberg D (2008) Using inferred residue contacts to distinguish between correct and incorrect protein models. Bioinformatics 24:1575–1582
Article CAS Google Scholar
Solis AD, Rackovsky S (2008) Information and discrimination in pairwise contact potentials. Proteins 71:1071–1087
Article CAS Google Scholar

Download references

Acknowledgments

The authors thank Kristin Kassler and Dr. Christophe Jardin for critically reading the manuscript. The project was funded within the DFG (Deutsche Forschungsgemeinschaft) priority program (SPP 1395) by grants to JH and HS.

Author information

Authors and Affiliations

Bioinformatik, Institut für Biochemie, Friedrich-Alexander-Universität Erlangen-Nürnberg, Fahrstr. 17, 91054, Erlangen, Germany
Olaf G. Othersen & Heinrich Sticht
Lehrstuhl für Informationsübertragung, Friedrich-Alexander-Universität Erlangen-Nürnberg, Cauerstr. 7, 91058, Erlangen, Germany
Arno G. Stefani & Johannes B. Huber

Authors

Olaf G. Othersen
View author publications
You can also search for this author in PubMed Google Scholar
Arno G. Stefani
View author publications
You can also search for this author in PubMed Google Scholar
Johannes B. Huber
View author publications
You can also search for this author in PubMed Google Scholar
Heinrich Sticht
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Heinrich Sticht.

Appendix

Proof that I ( X ; y _j ) is always positive

We will show that

$$ I(X;{y_j}) = \sum\limits_{{{\text{i}} = 1}}^{{{{\text{M}}_{\text{x}}}}} {\Pr \left( {{x_i},{y_j}} \right){{\log }_2}} \frac{{\Pr \left( {{x_i},{y_j}} \right)}}{{\Pr \left( {{x_i}} \right)\Pr \left( {{y_j}} \right)}} $$

is always greater than or equal to 0. For this purpose, the log sum inequality is quoted first [16]:

$$ \sum\limits_{{i = 1}}^{n} {{a_{i}}{{\log }_{2}}\frac{{{a_{i}}}}{{{b_{i}}}} \geqslant } \left( {\sum\limits_{{i = 1}}^{n} {{a_{i}}} } \right){\log _{2}}\frac{{\sum\nolimits_{{i = 1}}^{n} {{a_{i}}} }}{{\sum\nolimits_{{i = 1}}^{n} {{b_{i}}} }}{\text{for}}\;{\text{any}}\;{a_{i}},{b_{i}} \geqslant 0,i = 1 \ldots n. $$

With a _i = Pr(x _i,y _j), b _i = Pr(x _i)Pr(y _j) and n = M _x, it follows that

$$ \begin{array}{*{20}{c}} {I\left( {X;{y_j}} \right) = \sum\limits_{{{\text{i}} = 1}}^{{{{\text{M}}_{\text{x}}}}} {\Pr \left( {{x_i},{y_j}} \right)} {{\log }_2}\frac{{\Pr \left( {{x_i},{y_j}} \right)}}{{\Pr \left( {{x_i}} \right)\Pr \left( {{y_j}} \right)}} \geqslant \left( {\sum\limits_{{i = 1}}^{{{M_x}}} {\Pr \left( {{x_i},{y_j}} \right)} } \right)\underbrace{{{{\log }_2}\frac{{\overbrace{{\sum\nolimits_{{i = 1}}^{{{M_x}}} {\Pr \left( {{x_i},{y_j}} \right)} }}^{{\Pr \left( {{y_j}} \right)}}}}{{\underbrace{{\sum\nolimits_{{i = 1}}^{{{M_x}}} {\Pr \left( {{x_i}} \right)\Pr \left( {{y_j}} \right)} }}_{{\Pr \left( {{y_j}} \right)}}}}}}_{{{{\log }_2}1 = 0}} = 0} \hfill \\ { \Rightarrow I\left( {X;{y_j}} \right) \geqslant 0} \hfill \\ \end{array}. $$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Othersen, O.G., Stefani, A.G., Huber, J.B. et al. Application of information theory to feature selection in protein docking. J Mol Model 18, 1285–1297 (2012). https://doi.org/10.1007/s00894-011-1157-6

Download citation

Received: 20 May 2011
Accepted: 21 June 2011
Published: 12 July 2011
Issue Date: April 2012
DOI: https://doi.org/10.1007/s00894-011-1157-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Application of information theory to feature selection in protein docking

Abstract

Access this article

Similar content being viewed by others

Robocrystallographer: automated crystal structure text descriptions and analysis

Software for molecular docking: a review

Feature selection techniques for machine learning: a survey of more than two decades of research

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Proof that I ( X ; y _j ) is always positive

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Application of information theory to feature selection in protein docking

Abstract

Access this article

Similar content being viewed by others

Robocrystallographer: automated crystal structure text descriptions and analysis

Software for molecular docking: a review

Feature selection techniques for machine learning: a survey of more than two decades of research

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Proof that I ( X ; y j ) is always positive

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation

Proof that I ( X ; y _j ) is always positive