Prediction of O-glycosylation sites based on multi-scale composition of amino acids and feature selection

Original Article


Protein glycosylation is one of the most important and complex post-translational modification that provides greater proteomic diversity than any other post-translational modification. Fast and reliable computational methods to identify glycosylation sites are in great demand. Two key issues, feature encoding and feature selection, can critically affect the accuracy of a computational method. We present a new O-glycosylation sites prediction method using only amino acid sequence information. The method includes the following components: (1) on the basis of multi-scale theory, features based on multi-scale composition of amino acids were extracted from the training sequences with identified glycosylation sites; (2) perform a two-stage feature selection to remove features that had adverse effects on the prediction, including a stage one preliminary filtering with Student’s t test, and a second stage screening through iterative elimination using novel pairwise comparisons conducted in random subspace using support vector machine. Important features retained are used to build prediction model. The method is evaluated with sequence-based tenfold cross-validation tests on balanced datasets. The results of our experiments show that our method significantly outperforms those reported in the literature in terms of sensitivity, specificity, accuracy, Matthew’s correlation coefficient. The prediction accuracy of serine and threonine residues sites reached 95.7 and 92.7 %. The Matthew correlation coefficient of our method for S and T sites is 0.914 and 0.873, respectively. This method can evaluate each feature with the interactions of the rest of the features, which are still included in the model and have the advantage of high efficiency.


O-glycosylation Multi-scale composition of amino acids Paired comparison through random subspace screening Support vector machine 



This work was supported in part by the Research Foundation for the Doctoral Program of Higher Education of China (No. 20124320110002) and Haiyan Wang’s work is partly supported by a grant from the Simons Foundation (#246077).

Supplementary material

11517_2015_1268_MOESM1_ESM.rar (185 kb)
Supplementary material 1 (RAR 184 kb)


  1. 1.
    Bennett EP, Mandel U, Clausen H, Gerken TA, Fritz TA, Tabak LA (2012) Control of mucin-type O-glycosylation: a classification of the polypeptide GalNAc-transferase gene family. Glycobiology 22:736–756CrossRefPubMedCentralPubMedGoogle Scholar
  2. 2.
    Bewick V, Cheek L, Ball J (2004) Statistics review 13: receiver operating characteristic curves. Crit Care 8:508–512CrossRefPubMedCentralPubMedGoogle Scholar
  3. 3.
    Blom N (2004) Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence. Proteomics 4:1633–1649CrossRefPubMedGoogle Scholar
  4. 4.
    Cabrera AF, Farina D, Dremstrup K (2010) Comparison of feature selection and classification methods for a brain–computer interface driven by non-motor imagery. Med Biol Eng Comput 48:123–132CrossRefPubMedGoogle Scholar
  5. 5.
    Cai Y, Huang T, Hu L, Shi X, Xie L, Li Y (2012) Prediction of lysine ubiquitination with mRMR feature selection and analysis. Amino Acids 42:1387–1395CrossRefPubMedGoogle Scholar
  6. 6.
    Cai YD, Chou KC (1996) Artificial neural network model for predicting the specificity of GalNAc-transferase. Anal Biochem 243:284–285CrossRefPubMedGoogle Scholar
  7. 7.
    Cai YD, Liu XJ, Xu XB, Chou KC (2002) Support vector machines for predicting the specificity of GalNAc-transferase. Peptides 23:205–208CrossRefPubMedGoogle Scholar
  8. 8.
    Centor RM (1991) Signal detectability: the use of ROC curves and their analyses. Med Decis Mak 11:102–106CrossRefGoogle Scholar
  9. 9.
    Chang CC, Lin CJ (2011) LIBSVM : a library for support vector machines. ACM T Intell Syst Techn 2:1–27.
  10. 10.
    Chen YZ, Tang YR, Sheng ZY, Zhang Z (2008) Prediction of mucin-type O-glycosylation sites in mammalian proteins using the composition of k-spaced amino acid pairs. BMC Bioinform 9:101CrossRefGoogle Scholar
  11. 11.
    Chou KC (1995) A sequence-coupled vector-projection model for predicting the specificity of GalNAc-transferase. Protein Sci 4:1365–1383CrossRefPubMedCentralPubMedGoogle Scholar
  12. 12.
    Dias NS, Kamrunnahar M, Mendes PM, Schiff SJ, Correia JH (2010) Feature selection on movement imagery discrimination and attention detection. Med Biol Eng Comput 48:331–341CrossRefPubMedCentralPubMedGoogle Scholar
  13. 13.
    Ding JD, Zhou SG, Guan JH (2011) miRFam: an effective automatic miRNA classification method based on n-grams and a multiclass SVM. BMC Bioinform 12:216CrossRefGoogle Scholar
  14. 14.
    Geoghegan KF, Song X, Hoth LR, Fenga X, Shankera S, Quazib A, Luxenbergb DP, Wrightb JF, Griffora MC (2013) Unexpected mucin-type O-glycosylation and host-specific N-glycosylation of human recombinant interleukin-17A expressed in a human kidney cell line. Protein Expr Purif 87:27–34CrossRefPubMedGoogle Scholar
  15. 15.
    Gill DJ, Chia J, Senewiratne J, Bard F (2010) Regulation of O-glycosylation through Golgi-to-ER relocation of initiation enzymes. J Cell Biol 189:843–858CrossRefPubMedCentralPubMedGoogle Scholar
  16. 16.
    Hansen JE, Lund O, Tolstrup N, Gooley AA, Williams KL, Brunak S (1998) NetOglyc: prediction of mucin type O-glycosylation sites based on sequence context and surface accessibility. Glycoconjugate J 15:115–130CrossRefGoogle Scholar
  17. 17.
    Hidalgo-Muñoz AR, López MM, Galvao-Carmona A, Pereira AT, Santos IM, Vázquez-Marrufo M, Tomé AM (2014) EEG study on affective valence elicited by novel and familiar pictures using ERD/ERS and SVM-RFE. Med Biol Eng Comput 52:149–158CrossRefPubMedGoogle Scholar
  18. 18.
    Hou TJ, Li N, Li YY, Wang W (2012) Characterization of domain-peptide interaction interface: prediction of SH3 domain-mediated protein-protein interaction network in yeast by generic structure-based models. J Proteome Res 11:2982–2995CrossRefPubMedCentralPubMedGoogle Scholar
  19. 19.
    Hou TJ, Xu Z, Zhang W, McLaughlin WA, David CA, Xu Y, Wang W (2009) Characterization of domain-peptide interaction interface: a generic structure-based model to decipher the binding specificity of SH3 domains. Mol Cell Proteomics 8:639–649CrossRefPubMedCentralPubMedGoogle Scholar
  20. 20.
    Hou TJ, Zhang W, David CA, Wang W (2008) Characterization of domain-peptide interaction interface: a case study on the amphiphysin-1 SH3 domain. J Mol Biol 376:1201–1214CrossRefPubMedGoogle Scholar
  21. 21.
    Hou TJ, Zhang W, Wang J, Wang W (2009) The prediction of HIV-1 protease drug resistance by analyzing the protease/drug decomposed interaction energy components. Proteins Struct Funct Bioinform 74:837–846CrossRefGoogle Scholar
  22. 22.
    Jenkins NP, James DC (1996) Getting the glycosylation right: implications for the biotechnology industry. Nat Biotechnol 14:975–981CrossRefPubMedGoogle Scholar
  23. 23.
    Julenius K, Molgaard A, Gupta R, Brunak S (2005) Prediction, conservation analysis, and structural characterization of mammalian mucin-type O-glycosylation sites. Glycobiology 15:153–164CrossRefPubMedGoogle Scholar
  24. 24.
    Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M (2008) AAindex: amino acid index database, progress report 2008. Nucleic Acids Res 36:D202CrossRefPubMedCentralPubMedGoogle Scholar
  25. 25.
    Li BQ, Hu LL, Chen L, Feng KY, Cai YD, Chou KC (2012) Prediction of protein domain with mRMR feature selection and analysis. PLoS ONE 7:e39308CrossRefPubMedCentralPubMedGoogle Scholar
  26. 26.
    Li BQ, Huang T, Liu L, Cai YD, Chou KC (2012) Identification of colorectal cancer related genes with mRMR and shortest path in protein-protein interaction network. PLoS ONE 7:e33393CrossRefPubMedCentralPubMedGoogle Scholar
  27. 27.
    Li S, Liu B, Zeng R, Cai Y, Li Y (2006) Predicting O-glycosylation sites in mammalian proteins by using SVMs. Comput Biol Chem 30:203–208CrossRefPubMedGoogle Scholar
  28. 28.
    Li XB, Peng SH, Chen J, Lü B, Zhang H, Lai M (2012) SVM-T-RFE: a novel gene selection algorithm for identifying metastasis-related genes in colorectal cancer using gene expression profiles. Biochem Biophys Res Commun 419:148–153CrossRefPubMedGoogle Scholar
  29. 29.
    Liang Y, Zhang F, Wang J, Joshi T, Wang Y, Xu D (2011) Prediction of drought-resistant genes in Arabidopsis thaliana using SVM-RFE. PLoS ONE 6:e21750CrossRefPubMedCentralPubMedGoogle Scholar
  30. 30.
    Ma C, Dong X, Li R, Liu L (2013) a computational study identifies HIV progression-related genes using mRMR and shortest path tracing. PLoS ONE 8:e78057CrossRefPubMedCentralPubMedGoogle Scholar
  31. 31.
    Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27:1226–1238CrossRefPubMedGoogle Scholar
  32. 32.
    Reynders E, Foulquier F, Annaert W, Matthijs G (2011) How Golgi glycosylation meets and needs trafficking: the case of the COG complex. Glycobiology 21:853–863CrossRefPubMedGoogle Scholar
  33. 33.
    Schjoldager KTBG, Clausen H (2012) Site-specific protein O-glycosylation modulates proprotein processing deciphering specific functions of the large polypeptide GalNAc-transferase gene family. BBA Gen Subj 1820:2079–2094CrossRefGoogle Scholar
  34. 34.
    Shen JW, Zhang J, Luo XM, Zhu W, Yu K, Chen K, Jiang H (2007) Predicting protein-protein interactions based only on sequences information. PNAS 104:4337–4341CrossRefPubMedCentralPubMedGoogle Scholar
  35. 35.
    Shieh MD, Yang CC (2008) Multiclass SVM-RFE for product form feature selection. Expert Syst Appl 35:531–541CrossRefGoogle Scholar
  36. 36.
    Sparrow LG, Gorman JJ, Strike PM, Robinson CP, McKern NM, Epa VC, Ward CW (2007) The location and characterisation of the O-linked glycans of the human insulin receptor. Proteins 66:261–265CrossRefPubMedGoogle Scholar
  37. 37.
    Tran DT, Ten Hagen KG (2013) Mucin-type O-glycosylation during development. J Biol Chem 288:6921–6929CrossRefPubMedCentralPubMedGoogle Scholar
  38. 38.
    Vapnik V (1998) Statistical learning theory. Wiley, New YorkGoogle Scholar
  39. 39.
    Walsh G, Jefferis R (2006) Post-translational modifications in the context of therapeutic proteins. Nat Biotechnol 24:1241–1252CrossRefPubMedGoogle Scholar
  40. 40.
    Yang ZH, Fang KT, Kotzc S (2007) On the Student’s t-distribution and the t-statistic. J Multivariate Anal 98:1293–1307CrossRefGoogle Scholar
  41. 41.
    Yoon S, Kim S (2009) Mutual information-based SVM-RFE for diagnostic classification of digitized mammograms. Pattern Recogn Lett 30:1489–1495CrossRefGoogle Scholar
  42. 42.
    Yuan ZM, Zhang YS, Xiong JY (2008) Multidimensional time series analysis based on support vector machine regression and its application in agriculture. Sci Agric Sin 41:2485–2492Google Scholar
  43. 43.
    Zaki N, Wolfsheimer S, Nuel G, Khuri S (2011) Conotoxin protein classification using free scores of words and support vector machines. BMC Bioinform 12:217CrossRefGoogle Scholar

Copyright information

© International Federation for Medical and Biological Engineering 2015

Authors and Affiliations

  • Yuan Chen
    • 1
  • Wei Zhou
    • 2
  • Haiyan Wang
    • 3
  • Zheming Yuan
    • 1
  1. 1.Hunan Provincial Key Laboratory of Crop Germplasm Innovation and UtilizationHunan Agricultural UniversityChangshaChina
  2. 2.Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect PestsChangshaChina
  3. 3.Department of StatisticsKansas State UniversityManhattanUSA

Personalised recommendations