A comparative study of family-specific protein–ligand complex affinity prediction based on random forest approach
- 422 Downloads
The assessment of binding affinity between ligands and the target proteins plays an essential role in drug discovery and design process. As an alternative to widely used scoring approaches, machine learning methods have also been proposed for fast prediction of the binding affinity with promising results, but most of them were developed as all-purpose models despite of the specific functions of different protein families, since proteins from different function families always have different structures and physicochemical features. In this study, we proposed a random forest method to predict the protein–ligand binding affinity based on a comprehensive feature set covering protein sequence, binding pocket, ligand structure and intermolecular interaction. Feature processing and compression was respectively implemented for different protein family datasets, which indicates that different features contribute to different models, so individual representation for each protein family is necessary. Three family-specific models were constructed for three important protein target families of HIV-1 protease, trypsin and carbonic anhydrase respectively. As a comparison, two generic models including diverse protein families were also built. The evaluation results show that models on family-specific datasets have the superior performance to those on the generic datasets and the Pearson and Spearman correlation coefficients (R p and Rs) on the test sets are 0.740, 0.874, 0.735 and 0.697, 0.853, 0.723 for HIV-1 protease, trypsin and carbonic anhydrase respectively. Comparisons with the other methods further demonstrate that individual representation and model construction for each protein family is a more reasonable way in predicting the affinity of one particular protein family.
KeywordsProtein–ligand binding affinity prediction Family-specific model Generic model Random forest
This work was funded by the National Natural Science Foundation of China (No. 21175095, 21273154, 21375090).
- 15.Imai T, Hiraoka R, Seto T, Kovalenko A, Hirata F (2007) Three-dimensional distribution function theory for the prediction of protein–ligand binding sites and affinities: application to the binding of noble gases to hen egg-white lysozyme in aqueous solution. J Phys Chem B 111:11585–11591CrossRefGoogle Scholar
- 29.Lewalle A, Niederer S, Smith N (2014) Species-specific comparison of the cardiac sodium/potassium pump based on a minimal biophysical model. Biophys J 106:117aGoogle Scholar
- 31.Xu W, McDonough MC, Erdman DD (2000) Species-specific identification of human adenoviruses by a multiplex PCR assay. J Clin Microbiol 38:4114–4120Google Scholar
- 39.Moody JE, Hanson SJ, Lippmann RP (1992) Advances in neural information processing systems 4. Morgan Kaufmann, DenverGoogle Scholar
- 40.Smith M (1993) Neural networks for statistical modeling. Van Nostrand Reinhold, New YorkGoogle Scholar
- 41.Jolliffe IT (2002) Principal component analysis, 2nd edn. Springer, New YorkGoogle Scholar
- 43.Svetnik V, Liaw A, Tong C, Wang T (2004) Application of Breiman’s random forest to modeling structure–activity relationships of pharmaceutical molecules. In: Roli F, Kittler J, Windeatt T (eds) Lecture notes in computer science, vol 3077. Springer, Berlin, pp 334–343Google Scholar
- 45.Core Team R (2013) R: a language and environment for statistical computing. R Foundation for Statistical Computing, ViennaGoogle Scholar
- 46.Breiman L (1996) Out-of-bag estimation. Technical report, UC BerkeleyGoogle Scholar
- 47.Hastie T, Tibshirani R, Friedman J (2003) The elements of statistical learning. Springer, NewYorkGoogle Scholar