Skip to main content
Log in

Prediction of O-glycosylation sites based on multi-scale composition of amino acids and feature selection

  • Original Article
  • Published:
Medical & Biological Engineering & Computing Aims and scope Submit manuscript

Abstract

Protein glycosylation is one of the most important and complex post-translational modification that provides greater proteomic diversity than any other post-translational modification. Fast and reliable computational methods to identify glycosylation sites are in great demand. Two key issues, feature encoding and feature selection, can critically affect the accuracy of a computational method. We present a new O-glycosylation sites prediction method using only amino acid sequence information. The method includes the following components: (1) on the basis of multi-scale theory, features based on multi-scale composition of amino acids were extracted from the training sequences with identified glycosylation sites; (2) perform a two-stage feature selection to remove features that had adverse effects on the prediction, including a stage one preliminary filtering with Student’s t test, and a second stage screening through iterative elimination using novel pairwise comparisons conducted in random subspace using support vector machine. Important features retained are used to build prediction model. The method is evaluated with sequence-based tenfold cross-validation tests on balanced datasets. The results of our experiments show that our method significantly outperforms those reported in the literature in terms of sensitivity, specificity, accuracy, Matthew’s correlation coefficient. The prediction accuracy of serine and threonine residues sites reached 95.7 and 92.7 %. The Matthew correlation coefficient of our method for S and T sites is 0.914 and 0.873, respectively. This method can evaluate each feature with the interactions of the rest of the features, which are still included in the model and have the advantage of high efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. Bennett EP, Mandel U, Clausen H, Gerken TA, Fritz TA, Tabak LA (2012) Control of mucin-type O-glycosylation: a classification of the polypeptide GalNAc-transferase gene family. Glycobiology 22:736–756

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  2. Bewick V, Cheek L, Ball J (2004) Statistics review 13: receiver operating characteristic curves. Crit Care 8:508–512

    Article  PubMed Central  PubMed  Google Scholar 

  3. Blom N (2004) Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence. Proteomics 4:1633–1649

    Article  CAS  PubMed  Google Scholar 

  4. Cabrera AF, Farina D, Dremstrup K (2010) Comparison of feature selection and classification methods for a brain–computer interface driven by non-motor imagery. Med Biol Eng Comput 48:123–132

    Article  PubMed  Google Scholar 

  5. Cai Y, Huang T, Hu L, Shi X, Xie L, Li Y (2012) Prediction of lysine ubiquitination with mRMR feature selection and analysis. Amino Acids 42:1387–1395

    Article  CAS  PubMed  Google Scholar 

  6. Cai YD, Chou KC (1996) Artificial neural network model for predicting the specificity of GalNAc-transferase. Anal Biochem 243:284–285

    Article  CAS  PubMed  Google Scholar 

  7. Cai YD, Liu XJ, Xu XB, Chou KC (2002) Support vector machines for predicting the specificity of GalNAc-transferase. Peptides 23:205–208

    Article  CAS  PubMed  Google Scholar 

  8. Centor RM (1991) Signal detectability: the use of ROC curves and their analyses. Med Decis Mak 11:102–106

    Article  CAS  Google Scholar 

  9. Chang CC, Lin CJ (2011) LIBSVM : a library for support vector machines. ACM T Intell Syst Techn 2:1–27. http://www.csie.ntu.edu.tw/~cjlin/libsvm

  10. Chen YZ, Tang YR, Sheng ZY, Zhang Z (2008) Prediction of mucin-type O-glycosylation sites in mammalian proteins using the composition of k-spaced amino acid pairs. BMC Bioinform 9:101

    Article  CAS  Google Scholar 

  11. Chou KC (1995) A sequence-coupled vector-projection model for predicting the specificity of GalNAc-transferase. Protein Sci 4:1365–1383

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  12. Dias NS, Kamrunnahar M, Mendes PM, Schiff SJ, Correia JH (2010) Feature selection on movement imagery discrimination and attention detection. Med Biol Eng Comput 48:331–341

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  13. Ding JD, Zhou SG, Guan JH (2011) miRFam: an effective automatic miRNA classification method based on n-grams and a multiclass SVM. BMC Bioinform 12:216

    Article  CAS  Google Scholar 

  14. Geoghegan KF, Song X, Hoth LR, Fenga X, Shankera S, Quazib A, Luxenbergb DP, Wrightb JF, Griffora MC (2013) Unexpected mucin-type O-glycosylation and host-specific N-glycosylation of human recombinant interleukin-17A expressed in a human kidney cell line. Protein Expr Purif 87:27–34

    Article  CAS  PubMed  Google Scholar 

  15. Gill DJ, Chia J, Senewiratne J, Bard F (2010) Regulation of O-glycosylation through Golgi-to-ER relocation of initiation enzymes. J Cell Biol 189:843–858

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  16. Hansen JE, Lund O, Tolstrup N, Gooley AA, Williams KL, Brunak S (1998) NetOglyc: prediction of mucin type O-glycosylation sites based on sequence context and surface accessibility. Glycoconjugate J 15:115–130

    Article  CAS  Google Scholar 

  17. Hidalgo-Muñoz AR, López MM, Galvao-Carmona A, Pereira AT, Santos IM, Vázquez-Marrufo M, Tomé AM (2014) EEG study on affective valence elicited by novel and familiar pictures using ERD/ERS and SVM-RFE. Med Biol Eng Comput 52:149–158

    Article  PubMed  Google Scholar 

  18. Hou TJ, Li N, Li YY, Wang W (2012) Characterization of domain-peptide interaction interface: prediction of SH3 domain-mediated protein-protein interaction network in yeast by generic structure-based models. J Proteome Res 11:2982–2995

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  19. Hou TJ, Xu Z, Zhang W, McLaughlin WA, David CA, Xu Y, Wang W (2009) Characterization of domain-peptide interaction interface: a generic structure-based model to decipher the binding specificity of SH3 domains. Mol Cell Proteomics 8:639–649

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  20. Hou TJ, Zhang W, David CA, Wang W (2008) Characterization of domain-peptide interaction interface: a case study on the amphiphysin-1 SH3 domain. J Mol Biol 376:1201–1214

    Article  CAS  PubMed  Google Scholar 

  21. Hou TJ, Zhang W, Wang J, Wang W (2009) The prediction of HIV-1 protease drug resistance by analyzing the protease/drug decomposed interaction energy components. Proteins Struct Funct Bioinform 74:837–846

    Article  CAS  Google Scholar 

  22. Jenkins NP, James DC (1996) Getting the glycosylation right: implications for the biotechnology industry. Nat Biotechnol 14:975–981

    Article  CAS  PubMed  Google Scholar 

  23. Julenius K, Molgaard A, Gupta R, Brunak S (2005) Prediction, conservation analysis, and structural characterization of mammalian mucin-type O-glycosylation sites. Glycobiology 15:153–164

    Article  CAS  PubMed  Google Scholar 

  24. Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M (2008) AAindex: amino acid index database, progress report 2008. Nucleic Acids Res 36:D202

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  25. Li BQ, Hu LL, Chen L, Feng KY, Cai YD, Chou KC (2012) Prediction of protein domain with mRMR feature selection and analysis. PLoS ONE 7:e39308

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  26. Li BQ, Huang T, Liu L, Cai YD, Chou KC (2012) Identification of colorectal cancer related genes with mRMR and shortest path in protein-protein interaction network. PLoS ONE 7:e33393

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  27. Li S, Liu B, Zeng R, Cai Y, Li Y (2006) Predicting O-glycosylation sites in mammalian proteins by using SVMs. Comput Biol Chem 30:203–208

    Article  CAS  PubMed  Google Scholar 

  28. Li XB, Peng SH, Chen J, Lü B, Zhang H, Lai M (2012) SVM-T-RFE: a novel gene selection algorithm for identifying metastasis-related genes in colorectal cancer using gene expression profiles. Biochem Biophys Res Commun 419:148–153

    Article  CAS  PubMed  Google Scholar 

  29. Liang Y, Zhang F, Wang J, Joshi T, Wang Y, Xu D (2011) Prediction of drought-resistant genes in Arabidopsis thaliana using SVM-RFE. PLoS ONE 6:e21750

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  30. Ma C, Dong X, Li R, Liu L (2013) a computational study identifies HIV progression-related genes using mRMR and shortest path tracing. PLoS ONE 8:e78057

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  31. Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27:1226–1238

    Article  PubMed  Google Scholar 

  32. Reynders E, Foulquier F, Annaert W, Matthijs G (2011) How Golgi glycosylation meets and needs trafficking: the case of the COG complex. Glycobiology 21:853–863

    Article  CAS  PubMed  Google Scholar 

  33. Schjoldager KTBG, Clausen H (2012) Site-specific protein O-glycosylation modulates proprotein processing deciphering specific functions of the large polypeptide GalNAc-transferase gene family. BBA Gen Subj 1820:2079–2094

    Article  CAS  Google Scholar 

  34. Shen JW, Zhang J, Luo XM, Zhu W, Yu K, Chen K, Jiang H (2007) Predicting protein-protein interactions based only on sequences information. PNAS 104:4337–4341

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  35. Shieh MD, Yang CC (2008) Multiclass SVM-RFE for product form feature selection. Expert Syst Appl 35:531–541

    Article  Google Scholar 

  36. Sparrow LG, Gorman JJ, Strike PM, Robinson CP, McKern NM, Epa VC, Ward CW (2007) The location and characterisation of the O-linked glycans of the human insulin receptor. Proteins 66:261–265

    Article  CAS  PubMed  Google Scholar 

  37. Tran DT, Ten Hagen KG (2013) Mucin-type O-glycosylation during development. J Biol Chem 288:6921–6929

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  38. Vapnik V (1998) Statistical learning theory. Wiley, New York

    Google Scholar 

  39. Walsh G, Jefferis R (2006) Post-translational modifications in the context of therapeutic proteins. Nat Biotechnol 24:1241–1252

    Article  CAS  PubMed  Google Scholar 

  40. Yang ZH, Fang KT, Kotzc S (2007) On the Student’s t-distribution and the t-statistic. J Multivariate Anal 98:1293–1307

    Article  Google Scholar 

  41. Yoon S, Kim S (2009) Mutual information-based SVM-RFE for diagnostic classification of digitized mammograms. Pattern Recogn Lett 30:1489–1495

    Article  Google Scholar 

  42. Yuan ZM, Zhang YS, Xiong JY (2008) Multidimensional time series analysis based on support vector machine regression and its application in agriculture. Sci Agric Sin 41:2485–2492

    Google Scholar 

  43. Zaki N, Wolfsheimer S, Nuel G, Khuri S (2011) Conotoxin protein classification using free scores of words and support vector machines. BMC Bioinform 12:217

    Article  CAS  Google Scholar 

Download references

Acknowledgments

This work was supported in part by the Research Foundation for the Doctoral Program of Higher Education of China (No. 20124320110002) and Haiyan Wang’s work is partly supported by a grant from the Simons Foundation (#246077).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zheming Yuan.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (RAR 184 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, Y., Zhou, W., Wang, H. et al. Prediction of O-glycosylation sites based on multi-scale composition of amino acids and feature selection. Med Biol Eng Comput 53, 535–544 (2015). https://doi.org/10.1007/s11517-015-1268-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11517-015-1268-9

Keywords

Navigation