Prediction of O-glycosylation sites based on multi-scale composition of amino acids and feature selection
Protein glycosylation is one of the most important and complex post-translational modification that provides greater proteomic diversity than any other post-translational modification. Fast and reliable computational methods to identify glycosylation sites are in great demand. Two key issues, feature encoding and feature selection, can critically affect the accuracy of a computational method. We present a new O-glycosylation sites prediction method using only amino acid sequence information. The method includes the following components: (1) on the basis of multi-scale theory, features based on multi-scale composition of amino acids were extracted from the training sequences with identified glycosylation sites; (2) perform a two-stage feature selection to remove features that had adverse effects on the prediction, including a stage one preliminary filtering with Student’s t test, and a second stage screening through iterative elimination using novel pairwise comparisons conducted in random subspace using support vector machine. Important features retained are used to build prediction model. The method is evaluated with sequence-based tenfold cross-validation tests on balanced datasets. The results of our experiments show that our method significantly outperforms those reported in the literature in terms of sensitivity, specificity, accuracy, Matthew’s correlation coefficient. The prediction accuracy of serine and threonine residues sites reached 95.7 and 92.7 %. The Matthew correlation coefficient of our method for S and T sites is 0.914 and 0.873, respectively. This method can evaluate each feature with the interactions of the rest of the features, which are still included in the model and have the advantage of high efficiency.
KeywordsO-glycosylation Multi-scale composition of amino acids Paired comparison through random subspace screening Support vector machine
This work was supported in part by the Research Foundation for the Doctoral Program of Higher Education of China (No. 20124320110002) and Haiyan Wang’s work is partly supported by a grant from the Simons Foundation (#246077).
- 9.Chang CC, Lin CJ (2011) LIBSVM : a library for support vector machines. ACM T Intell Syst Techn 2:1–27. http://www.csie.ntu.edu.tw/~cjlin/libsvm
- 14.Geoghegan KF, Song X, Hoth LR, Fenga X, Shankera S, Quazib A, Luxenbergb DP, Wrightb JF, Griffora MC (2013) Unexpected mucin-type O-glycosylation and host-specific N-glycosylation of human recombinant interleukin-17A expressed in a human kidney cell line. Protein Expr Purif 87:27–34CrossRefPubMedGoogle Scholar
- 38.Vapnik V (1998) Statistical learning theory. Wiley, New YorkGoogle Scholar
- 42.Yuan ZM, Zhang YS, Xiong JY (2008) Multidimensional time series analysis based on support vector machine regression and its application in agriculture. Sci Agric Sin 41:2485–2492Google Scholar