Protein binding hot spots prediction from sequence only by a new ensemble learning method
- 382 Downloads
Hot spots are interfacial core areas of binding proteins, which have been applied as targets in drug design. Experimental methods are costly in both time and expense to locate hot spot areas. Recently, in-silicon computational methods have been widely used for hot spot prediction through sequence or structure characterization. As the structural information of proteins is not always solved, and thus hot spot identification from amino acid sequences only is more useful for real-life applications. This work proposes a new sequence-based model that combines physicochemical features with the relative accessible surface area of amino acid sequences for hot spot prediction. The model consists of 83 classifiers involving the IBk (Instance-based k means) algorithm, where instances are encoded by important properties extracted from a total of 544 properties in the AAindex1 (Amino Acid Index) database. Then top-performance classifiers are selected to form an ensemble by a majority voting technique. The ensemble classifier outperforms the state-of-the-art computational methods, yielding an F1 score of 0.80 on the benchmark binding interface database (BID) test set.Availability: http://www2.ahu.edu.cn/pchen/web/HotspotEC.htm.
KeywordsHot spot residue Ensemble system IBk
This work was supported by the National Natural Science Foundation of China (Nos. 61672035, 61300058, 61472282, 61271098 and 61374181).
SH and PC conceived the study; SH participated in the experimental design; SH and PC carried it out and drafted the manuscript. All authors revised the manuscript critically. JL and PC approved the final manuscript.
Compliance with ethical standards
Conflict of interest
The authors declare that they have no competing interests.
The authors declare that their manuscript complies to the Ethical Rules applicable for this journal.
- Aha DW, Kibler D, Albert MK (1991) Instance-Based Learning Algorithms. Machine Learning. 6(1):37–66Google Scholar
- Brenke R, Kozakov D, Chuang GY, Beglov D, Hall D, Landon MR, et al. Fragment-based identification of druggable ’hot spots’ of proteins using Fourier domain correlation techniques. Bioinformatics (Oxford, England). 2009;25:621–7Google Scholar
- Chothia C, Janin J (1975) Principles of proteinprotein recognition. Nature. 256(5520):705Google Scholar
- Chen P, Li J, Wong L, Kuwahara H, Huang JZ, Gao X. Accurate prediction of hot spot residues through physicochemical characteristics of amino acid sequences. Proteins. 2013 Aug;81(8):1351–1362. Available from: http://dx.doi.org/10.1002/prot.24278Google Scholar
- Fasman GD, Sober HA, et al. Handbook of biochemistry and molecular biology. vol. 1. CRC press, Cleveland; 1977Google Scholar
- Kortemme T, Kim DE, Baker D. Computational alanine scanning of protein-protein interfaces. Science’s STKE : signal transduction knowledge environment. 2004 Feb;2004:pl2Google Scholar
- Ofran Y, Rost B. ISIS: interaction sites identified from sequence. Bioinformatics (Oxford, England). 2007 Jan;23:e13–6Google Scholar