GlyStruct: glycation prediction using structural properties of amino acid residues
- 133 Downloads
Glycation is a one of the post-translational modifications (PTM) where sugar molecules and residues in protein sequences are covalently bonded. It has become one of the clinically important PTM in recent times attributed to many chronic and age related complications. Being a non-enzymatic reaction, it is a great challenge when it comes to its prediction due to the lack of significant bias in the sequence motifs.
We developed a classifier, GlyStruct based on support vector machine, to predict glycated and non-glycated lysine residues using structural properties of amino acid residues. The features used were secondary structure, accessible surface area and the local backbone torsion angles. For this work, a benchmark dataset was extracted containing 235 glycated and 303 non-glycated lysine residues. GlyStruct demonstrated improved performance of approximately 10% in comparison to benchmark method of Gly-PseAAC. The performance for GlyStruct on the metrics, sensitivity, specificity, accuracy and Mathew’s correlation coefficient were 0.7013, 0.7989, 0.7562, and 0.5065, respectively for 10-fold cross-validation.
Glycation has emerged to be one of the clinically important PTM of proteins in recent times. Therefore, the development of computational tools become necessary to predict glycation, which could help medical professionals administer drugs and manage patients more effectively. The proposed predictor manages to classify glycated and non-glycated lysine residues with promising results consistently on various cross-validation schemes and outperforms other state of the art methods.
KeywordsPost-translational modification Lysine glycation Protein sequences Amino acids Prediction Support vector machine
Advanced glycation end-products
Accessible surface area
Area under the curve
Composition of 푘-spaced amino acid pairs
Compendium of Protein Lysine Modifications
Mathew’s correlation coefficient
Protein Lysine Modification Database
Position-specific amino acid propensity
Post Translational Modification
Radial basis function
Support Vector Machine
Post-translational modifications (PTM) of protein occur when there is a covalent alteration to protein backbones and side chains that increase proteome complexities. PTMs are generally mediated by enzymatic activity that occur at selected sites along amino acid side chains after its translation by ribosome is complete [1, 2]. These modifications provide important insight into various cellular functions and biological processes of proteins such as cellular dynamics and elasticity [3, 4]. There are many important PTMs with significant biological impact such as acetylation, carbonylation, glycosylation, glycation, methylation, nitrosylation, phosphorylation, sumoylation, succinylation, and ubiquitylation to name a few [5, 6, 7, 8, 9, 10].
Of lately, glycation has emerged to be of significant clinical relevance attributed to a correlation with increased blood glucose concentration [11, 12], and metabolic morbidity detection . This biochemistry involves a complex multi-step site modification process between reducing sugars and amino acid groups located in lysine (K) and arginine (R) residues, or in the N-terminal position to form Amadori adduct [14, 15]. The Amadori adduct further reacts to form advanced glycation endproducts (AGEs). With aging, AGEs accumulate and alters the tissue protein structure, function and turnover. If untreated, AGEs can lead to chronic complications of diabetes mellitus and neurodegenerative changes such as Alzheimer’s disease and amyotrophic lateral sclerosis [16, 17, 18, 19, 20, 21, 22, 23, 24]. Moreover, correlations have been established between levels of AGEs and diabetes with its related complications [7, 20, 24, 25, 26] in aging Homo sapiens. Glycation being a non-enzymatic reaction presents a great challenge in detection due to the motifs having greater levels of entropy compared to other PTMs. Conversely, enzymatic reaction is characterized by a more specific reaction and often has more biased sequence motif [27, 28].
In clinical methods, PTMs are identified in wet labs by observing this modification using methods such as mass spectrometry and immunofluorescence, and stored in online databanks such as dbPTM, CPLM and PLMD [1, 29, 30, 31]. Despite PTM being an important area for morbidity detection and genetics, clinical approaches face great limitation due to the plethora of protein sequences in existence in data repositories , high costs, and time-consuming process of biochemical experimentations in wet-labs . Hence, data scientists have been exhorted to actively pursue the development of computational tools to provide cost-effective solutions [3, 33, 34, 35]. This has led to an evolution of data mining in medicine, especially in the area of proteomics [36, 37, 38]. A concerted international effort has seen large dataset being actively developed to study and predict site-specific protein modification [31, 39].
While clinical importance of glycation is obvious, on the contrary however, few predictors have been proposed for this type of PTM. The earliest predictor, GlyNN  was developed using artificial neural network involving a dataset of only 89 glycated and 126 non-glycated lysines residues from a set of 20 proteins. PreGly predictor by Liu et al.  built on the same dataset as  used composition of 푘-spaced amino acid pairs (CKSAAP) for extracting features from protein sequences. GlyNN achieved the sensitivity, specificity, accuracy and Mathew’s correlation constant (MCC) of 0.7865, 0.8015, 0.795 and 0.58, respectively, while PreGly achieved for the same metrics, 0.7106, 0.9585, 0.8551 and 0.7 respectively. Gly-PseAAC developed by Xu et al.  used the recently updated dataset from CPLM databank consisting 223 glycated and 446 non-glycated residues. They have considered features from position-specific amino acid propensity (PSAAP) scheme. More recently, Zhao et al. proposed Glypre predictor  using a combination of features like position conservation, amino acid index and CKSAAP. In addition, Islam et al.  investigated an even larger set of features that included propensity based features, amino acid composition, physicochemical features and secondary structure motifs for their predictor iProtGly-SS. The results obtained by  on the on the recent dataset is low with sensitivity at 0.5748 and specificity at 0.7430. Furthermore, Glypre and iProtGly-SS reported performance on the two datasets from Johansen  and Xu et al.  but applied various filtering techniques to overcome the problem of data imbalance between negative and positive instances. Glypre excels with dataset from , but it achieved sensitivity at only 0.5747 while demonstrating high specificity of 0.9078 on the larger dataset from . On the same new dataset, iProtGly-SS predictor, manages higher sensitivity of 0.9238. However, their specificity reached maximum of 0.6009. All comparison are made for 10-fold validations since they are generally higher. For clinical use, however, glycation needs a more robust prediction of both instances of glycated and non-glycated lysines. Therefore, there is an opportunity to explore alternative methods for more robustness and any slight improvement in prediction provides a valuable resource to the community .
To predict glycation sites with high accuracy and to address the shortcoming of those previous studies, we introduce a new machine learning method called GlyStruct to predict glycation of lysines. To develop GlyStruct predictor, we incorporated structural information extracted from the predicted local structure of protein sequences as our input feature set and employed Support Vector Machine (SVM) as a classifier [44, 45]. Our achieved results demonstrate that GlyStruct is capable of predicting both, the glycated and non-glycated lysine residues better than previously proposed method found in the literature for this task. Using GlyStruct, we achieved 0.7013, 0.7989, 0.7562, and 0.5065 for sensitivity, specificity, accuracy and Mathew’s correlation coefficient, respectively for the 10-fold cross validation.
Methods and materials
To build our predictor model, benchmark dataset was curated from the online databanks. Following the standard methodology in bioinformatics , the dataset was then formulated to make it suitable for training classifiers and an appropriate cross-validation scheme was used to objectively evaluate the accuracy of the predictor.
This section describes the proposed method and benchmark dataset used in this study.
The dataset for glycation was obtained from publically available and widely used CPLM database  (available http://cplm.biocuckoo.org/) that was curated from comprehensive clinical and in vitro studies . The benchmark dataset we retrieved was filtered for redundant sequences with a threshold of 30% for pairwise sequence identity. The final dataset consisted of 1753 lysine sites in total found in 55 proteins. Among them, 235 lysines are glycated and 1518 are non-glycated sites. The primary sequences used to build GlyStruct are included in supplement as the Additional file 1.
The secondary structure features reveal intrinsic information regarding the characteristics of a protein sequence. In this study, we considered three attributes that formulate the local structure of protein namely, the secondary structure, local backbone torsion angles, and accessible surface area (ASA). The prediction of those attributes was carried out using the SPIDER2 toolbox . The SPIDER2 toolbox demonstrated promising result predicting these attributes compared to other methods found in the literature for predicting secondary structure [47, 48], backbone angles [49, 50], and accessible surface area [46, 49, 51] of amino acids. Predicted results using SPIDER2 has been used in different studies and demonstrated promising results [52, 53, 54]. The following describes the features integrated in this work:
Accessible Surface Area (ASA) provides an estimate surface area of a particular amino acid reachable by a solvent situated in the protein’s three-dimensional configuration [55, 56]. The predicted values of ASA for individual amino acids hence provides essential information of how it locally interacts with other amino acids to build global protein structure.
Secondary structure provides insight into the local three-dimensional structure within protein sequence where each amino acid can be discriminated based on the three defined local backbone folding patterns corresponding to a polypeptide. These are helix (ph), strand (pe) and coil (pc) motifs. Information from the secondary structure can contribute constructively to the general three-dimensional configuration of the polypeptide and the affinity for PTM of lysine residues [54, 57]. Given a protein sequence, SPIDER2 produces a L × 3 matrix containing the predicted secondary structure, which we call SSpre. L represents the length of a protein sequence and columns represent the transitional probabilities of each amino acid conforming to the three secondary structures.
Feature vector construction
Protein sequences are of varying lengths and cannot be used directly in classification. Classifiers require dataset of fixed length  therefore we employed a widely used method of truncating the protein sequence into fixed length peptide segments [54, 57, 62, 63, 64, 65, 66] proposed by Chou [67, 68].
The features set Fi presented in Eq. (2) for each amino acid is an 8-dimensional vector which is concatenated with the features of the whole segment (13 amino acid) producing a 104-dimensional vector. The appropriate class label (y = 1 and y = 0) for each instance of the lysine residue is considered for developing the classifier.
SVM works by establishing an optimal hyperplane between classes and extends to patterns that are not linearly separable by using kernel functions. If the dimensionality of feature vectors is very high, then dimensionality reduction techniques can be employed before SVM application [70, 71, 72, 73, 74, 75, 76, 77, 78, 79].
We designed our classifier using libsvm , a publicly available and widely used SVM tool, and also accessible on WEKA platform . Tuning parameters were obtained using grid-search where C = 512, and γ = 0.03125. We used polynomial learning because it provided better results given by (xiTxj + C0)d where we used C0 = 0 and degree of polynomial d was taken as 3.
Results and discussion
where TP (true positive) denotes the number of correctly identified glycated instances from the test set, and FN (false negative) denotes the number of incorrectly classified glycated sites.
The best performing predictor will be the one scoring the highest in majority of the four metrics.
The effectiveness of any classifier is measured using cross-validation methods. The three most widely used cross-validation schemes across the literature are independent dataset, k-fold and jackknife [83, 84]. Since the dataset for glycation in the curated protein sequences is limited, it was not practical to obtain additional data to run independent test validation.
The k-fold cross-validation procedure is carried out by first partitioning the total benchmark dataset into k roughly equal folds. Then one fold is held as a test set and the remaining k − 1 folds are used to train the classifier and a model is constructed. Using the constructed model and the test dataset that was held out, all prediction metrics are computed. This procedure is repeated k times as per the fold number chosen to obtain the average of the performance metrics.
Jackknife process can be viewed as a special instance of k-fold when k is n-1, where n is the number of samples. While the jackknife method is recognized as the least arbitrary that outputs unique results on the given benchmark dataset , the k-fold method offers an advantage whereby all instances or observations in the dataset can be used in both the training and test phases.
The dataset for our study comprised 235 glycated and 1518 non-glycated lysine residues obtained from 55 protein sequences, which results in a highly imbalanced data between positive (glycated) and negative (non-glycated) sets with a ratio of over 1:6. While it is a natural phenomenon in the biological sense, it creates a strong bias to the negative (or non-glycated) class if the dataset is used as is to train virtually any classifier. Therefore, we used k–nearest neighbor (kNN) filter to resolve the imbalance in dataset, similar to the approach taken by Jia et al.  and López et al. . Subsequently, the kNN cleaning treatment with a k value of 16 brought down the number of negative samples to 303. In other words, the cleaning treatment reduced the negative samples (non-glycated sites) by removing those samples, which were within the 16 neighbors of a positive sample (glycated site) to achieve 235 positive samples and 303 negative samples.
Comparison with benchmark prediction methods
Performance evaluation of GlyStruct and compared with other existing method
We compared our results to the state of the art of bioinformatics study on glycation Gly-PseACC , which was the only predictor that had the webserver available for testing our dataset.
The dataset retrieved by Gly-PseAAC authors from CPLM database is larger than GlyNN and PreGly, which consisted 223 positive and 446 negative samples filtered from 72 protein sequences with 40% pairwise sequence identity. Their dataset is slightly different (by approximately 5% for positive samples) from the GlyStruct dataset of 235 positive and 303 negative samples from 55 proteins obtained after filtering with a threshold of 30% pairwise sequence identity. Therefore, to compare the performance of Gly-PseAAC webserver, we uploaded our dataset manually to the Gly-PseAAC webserver by creating a FASTA file format. The performance results we obtained from the webserver are presented together with the GlyStruct performance in Table 1.
There was a notable increase in the sensitivity of 0.6845 for Gly-PseAAC method with our dataset from their reported value of 0.5748 for 10-fold. We anticipate that most of the protein sequences we tested on their webserver may have been used in training their model primarily because of the limited datasets available publically in databanks. In addition, the Gly-PseAAC server has been tuned to a threshold probability of 0.35 allowing higher misclassification of negative samples leading to very high fall out or false positive rate averaging 32% for the three k-fold validation schemes. High false positive rate may have a serious bearing on the clinical significance in terms of better morbidity detection. In contrast, the specificity of Gly-PseAAC for 10-fold was reduced to 0.6745 from the reported 0.8017 and MCC was also slightly lower on our dataset (0.3587 compared to their reported 0.38). The accuracy was also slightly lower (0.6784) compared to their reported results (0.6812). In order to show the significance of the achieved results for GlyStruct, pairwise t-test was conducted. The p-values obtained were 0.025, 0.019, 0.025 for 10-, 8- and 6-folds respectively. These p-values are less than 0.05, which demonstrates that improvement on performance by GlyStruct is significant compared to GlyPseAAC. Significance of contribution and the false discovery rates were also tested for each feature used. All features were found to be significant contributors to the results obtained. The aforementioned test results are included in Additional file 1.
The GlyNN webserver , which is one of the earliest bioinformatics studies for glycation is still accessible online, however has restrictions of protein sequence length between 34 and 4000 amino acids. Hence, the job we submitted was rejected due to the presence of two protein sequences in our dataset, Q86XX4 of length 4008 amino acids, and P13191 of length 20 amino acids, which violated the GlyNN server policies. This webserver was developed using a small dataset curated manually consisting 89 positive and 126 negative glycation sites from 20 peptides, which precedes the recent datasets . Moreover, the GlyNN authors did not consider residue sites that were not validated at the time of development for training the classifier. These sites marked as “U” to denote “unvalidated site” have since been validated in the recent iteration of the CPLM databank.
Among other recent methods, the webservers for glycation, PreGly  and iProtGly-SS  were not functional when accessed to test their method. In addition, the published codes for Glypre  could not be executed in the absence of a guide. Both Glypre and iProtGly-SS employed GlyPse-AAC data for training their classifier and used GlyNN data as comparator dataset. Furthermore, the datasets published by Glypre and iProtGly-SS were in segmented format without annotating the protein names, therefore could not be used for testing GlyStruct predictor. Therefore, pairwise comparison of performance with these state of the art methods was not possible.
With an exception of GlyNN and PreGly, all other state of the art methods including GlyStruct have obtained data from CPLM database. However, there is a significant difference in datasets attributed to regular updates to databanks, the inconsistencies in the selection of primary sequence identity threshold by various authors, and filtering techniques employed to the negative instances of the dataset before training the classifier. Nonetheless, we made comparison with the published results of those methods, which we could not verify through standard means of webservers or codes. The Glypre method published high specificity of 0.9078 but recorded average sensitivity of 0.5747 compared to 0.7013 achieved by GlyStruct. The accuracy and MCC for Glypre were reported to be marginally higher at 0.7968 and 0.52 respectively compared to 0.7562 and 0.51 respectively for GlyStruct. Furthermore, iProtGly-SS published high sensitivity of 0.9238. However, it recorded lower specificity of 0.6009 compared to 0.7989 by GlyStruct. All comparisons are made for 10-fold cross validation which tend to produce best results.
Overall, our predictor GlyStruct, using only structural features of peptides and SVM as a classifier produced consistent results (averaged out with 50 runs of cross-validation for each fold) in all the metrics and for all folds. It was better performing than the comparator method, Gly-PseAAC. With other state of the art methods on a similar dataset, GlyStruct outperformed in one metric or the other by over 10%.
The prime motivation to develop a prediction model for glycation is to for clinical support in timely diagnosis of morbidity and cellular conditions in a cost-effective manner. However, for prediction of PTM like glycation, we need to be mindful of the fact that while sensitivity is highly desired to identify the glycation process, making a false positive prediction can lead to potentially lethal situation. In such cases of false positive prediction, the medical professional may administer medication which would lead to further lowering of blood glucose concentration causing an induced hypoglycemia which can be fatal if not managed well [86, 87]. The prediction model we developed has a low false positive rate (or high specificity) that can be instrumental in avoiding the induced hypoglycemia situation.
With glycation emerging as one of the clinically important post-translational modification of proteins in recent times, classification engine becomes necessary to predict both, glycated and nonglycated lysine residues with high accuracy. Due to limited dataset and the lack of bias in the sequence motifs attributed to the non-enzymatic nature of this PTM, a great challenge arises to make prediction with high accuracy. The glycation predictor GlyStruct, we proposed is based on the secondary structure properties of proteins for which we considered the local backbone angles, secondary structures’ transitional probabilities and the accessible surface area that were obtained through SPIDER2 prediction engine. The protein sequences were truncated into segments of 13 amino acids for each lysine site to produce feature vectors of size (104 × 1). Due to highly unbalanced nature of PTM dataset, k-nearest neighbor filtering was employed to balance the classes before training the SVM classifier. The predictor was developed using libsvm on WEKA platform and the standard grid-search tuning was applied which yielded better results in comparison to previous studies. The results we obtained has promising levels of robustness due to its relatively high sensitivity of 0.7059 for 8-fold validation, and specificity of over 0.79 in all folds. The latter demonstrates the ability of the predictor to reduce the false positive rate (falsely predicting glycation). For clinical success, higher values for both sensitivity and specificity are desirable for this PTM since false positive prediction can be of more serious concern.
This research was in part supported by Faculty of Science Technology and Environment Research Committee, Grant Number FST14/F3205, The University of the South Pacific, Suva, Fiji Islands.
Publication of this article was funded by JSPS KAKENHI Grant Number 15F15385, and partly supported by JST CREST Grant Number JPMJCR1412, Japan.
Availability of data and materials
About this supplement
This article has been published as part of BMC Bioinformatics, Volume 19 Supplement 13, 2018: 17th International Conference on Bioinformatics (InCoB 2018): bioinformatics. The full contents of the supplement are available at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-19-supplement-13.
HR and AS conceived the idea and wrote the first manuscript. HR and AS performed analysis and experiments. AC and AD contributed in manuscript write-up. DS and TT provided computational resources. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 2.Voet D, Voet JG, Pratt CW. Fundamentals of biochemistry: life at the molecular level. 5th ed. New Jersey: Wiley; 2016.Google Scholar
- 6.Hornbeck PV, Kornhauser JM, Tkachev S, Zhang B, Skrzypek E, Murray B, Latham V, Sullivan M. PhosphoSitePlus: a comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse. Nucleic Acids Res. 2011;40(D1):D261–70.PubMedPubMedCentralCrossRefGoogle Scholar
- 13.Guedes S, Vitorino R, Domingues MRM, Amado F, Domingues P. Glycation and oxidation of histones H2B and H1:in vitro study and characterization by mass spectrometry. Anal Bioanal Chem. 2011;399(10):3529–39.Google Scholar
- 33.Yan X, Kuo-Chen C. Recent Progress in predicting posttranslational modification sites in proteins. Curr Top Med Chem. 2016;16(6):591–603.Google Scholar
- 37.Saini H, Raicar G, Sharma A, Lal S, Dehzangi A, Ananthanarayanan R, Lyons J, Biswas N, Paliwal KK. Protein structural class prediction via k-separated bigrams using position specific scoring matrix. J Adv Comput Intell. 2014;18(4):474–9.Google Scholar
- 39.dbPTM [dbptm.mbc.nctu.edu.tw/ Accessed: 20 Jan 2018].
- 40.Liu Y, Gu W, Zhang W, Wang J. Predict and analyze protein glycation sites with the mRMR and IFS methods. Biomed Res Int. 2015;2015:6.Google Scholar
- 43.Zhang Q, Monroe ME, Schepmoes AA, Clauss TR, Gritsenko MA, Meng D, Petyuk VA, Smith RD, Metz TO. Comprehensive identification of glycated peptides and their glycation motifs in plasma and erythrocytes of control and diabetic subjects. J Proteome Res. 2011;10(7):3076–88.PubMedPubMedCentralCrossRefGoogle Scholar
- 44.Ben-Hur A, Horn D, Siegelmann HT, Vapnik V. Support vector clustering. J Mach Learn Res. 2001;2:125–37.Google Scholar
- 45.Cortes C, Vapnik V. Support vector machine. Mach Learn. 1995;20(3):273–97.Google Scholar
- 46.Yang Y, Heffernan R, Paliwal K, Lyons J, Dehzangi A, Sharma A, Wang J, Sattar A, Zhou Y. SPIDER2: a package to predict secondary structure, accessible surface area, and Main-chain torsional angles by deep neural networks. In: Zhou Y, Kloczkowski A, Faraggi E, Yang Y, editors. Prediction of protein secondary structure. New York: Springer New York; 2017. p. 55–63.CrossRefGoogle Scholar
- 49.Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97.Google Scholar
- 50.Quinlan JR. Induction of decision trees. Mach Learn. 1986;1(1):81–106.Google Scholar
- 51.Salzberg SL. C4.5: programs for machine learning by J. Ross Quinlan. Morgan Kaufmann publishers, Inc., 1993. Mach Learn. 1994;16(3):235–40.Google Scholar
- 57.Dehzangi A, López Y, Lal SP, Taherzadeh G, Sattar A, Tsunoda T, Sharma A. Improving succinylation prediction accuracy by incorporating the secondary structure via helix, strand and coil, and evolutionary information from profile bigrams. PLoS One. 2018;13(2):e0191900.PubMedPubMedCentralCrossRefGoogle Scholar
- 61.Duda RO, Hart PE, Stork DG. Pattern classification. 2nd ed. New York: Wiley-Interscience; 2000.Google Scholar
- 80.Bishop C. Pattern recognition and machine learning. New York: Springer; 2006.Google Scholar
- 84.Alpaydin E. Introduction to machine learning. 3rd ed. Massachusetts: MIT Press; 2014.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.