Prediction of lysine ubiquitination with mRMR feature selection and analysis
- First Online:
- Cite this article as:
- Cai, Y., Huang, T., Hu, L. et al. Amino Acids (2012) 42: 1387. doi:10.1007/s00726-011-0835-0
- 612 Views
Ubiquitination, one of the most important post-translational modifications of proteins, occurs when ubiquitin (a small 76-amino acid protein) is attached to lysine on a target protein. It often commits the labeled protein to degradation and plays important roles in regulating many cellular processes implicated in a variety of diseases. Since ubiquitination is rapid and reversible, it is time-consuming and labor-intensive to identify ubiquitination sites using conventional experimental approaches. To efficiently discover lysine-ubiquitination sites, a sequence-based predictor of ubiquitination site was developed based on nearest neighbor algorithm. We used the maximum relevance and minimum redundancy principle to identify the key features and the incremental feature selection procedure to optimize the prediction engine. PSSM conservation scores, amino acid factors and disorder scores of the surrounding sequence formed the optimized 456 features. The Mathew’s correlation coefficient (MCC) of our ubiquitination site predictor achieved 0.142 by jackknife cross-validation test on a large benchmark dataset. In independent test, the MCC of our method was 0.139, higher than the existing ubiquitination site predictor UbiPred and UbPred. The MCCs of UbiPred and UbPred on the same test set were 0.135 and 0.117, respectively. Our analysis shows that the conservation of amino acids at and around lysine plays an important role in ubiquitination site prediction. What’s more, disorder and ubiquitination have a strong relevance. These findings might provide useful insights for studying the mechanisms of ubiquitination and modulating the ubiquitination pathway, potentially leading to potential therapeutic strategies in the future.
KeywordsUbiquitinationMaximum relevance and minimum redundancy (mRMR)Incremental feature selection (IFS)Nearest neighbor algorithm (NNA)
In the post-genomic era, knowledge of post-translational modifications (PTMs) of proteins is crucial for understanding the dynamic proteome and various signaling pathways or networks in cells (Aguilar and Wendland 2003; Saghatelian and Cravatt 2005; Herrmann et al. 2007; Hicke and Dunn 2003; Welchman et al. 2005). One of the most important and universal post-translational modifications, protein ubiquitination is a rapid and reversible biochemical process in which an iso-peptide bond forms covalently between the C-terminal double-glycine carboxy group of a ubiquitin protein and the ε-amino group of lysine residues of a substrate protein (Pickart 2001). Ubiquitination regulates a variety of biological processes, such as signal transduction, cell division/mitosis, apoptosis, and endocytosis (Sun and Chen 2004; Reinstein and Ciechanover 2006; Hoeller et al. 2006; Hicke 2001). An aberrance of the ubiquitin–proteasome system (UPS) is associated in numerous pathological diseases, such as inflammatory diseases, neurodegenerative disorders, and cancers (Hoeller et al. 2006; Reinstein and Ciechanover 2006).
Identification of ubiquitinated proteins sites is one of the greatest challenges in gaining a full understanding of the regulatory roles of ubiquitination regulation and the molecular mechanism of the ubiquitin system. It is time-consuming and labor-intensive to use conventional experimental approaches to identify the potential ubiquitinated proteins sites, such as site-mutagenesis (Lin et al. 2005), antibodies of Ub (anti-Ub) (Gentry et al. 2005), and high-throughput mass-spectrometry (MS) (Kirkpatrick et al. 2005). Therefore, it is convenient and efficient to use in silico algorithms in prediction of ubiquitination sites.
In this work, we developed a new computational method to predict lysine-ubiquitination. Specifically, we used a machine learning approach (Nearest Neighbor Algorithm) combined with feature selection (IFS based on mRMR, Peng et al. 2005a). Twenty-six parameters were used to describe each amino acid of the lysine site and its surrounding ones (from −10 to +10). The 26 parameters can be broken down into 3 categories: 20 position-specific scoring matrices (PSSM) conservation scores, 5 amino acid factors and 1 disorder score. The PSSM score quantifies the conservation status of each site in the protein sequence (Altschul et al. 1997). Amino acid factors were defined by Atchley et al. (2005) through multivariate statistical analyses on AAIndex (Kawashima and Kanehisa 2000) to produce five amino acid factors that reflected polarity (AAFactor 1), secondary structure (AAFactor 2), molecular volume (AAFactor 3), codon diversity (AAFactor 4), and electrostatic charge (AAFactor 5). Disorder score (Peng et al. 2006) represents the disorder status of each site in the protein sequence. Under physiological conditions, disordered regions in proteins do not have fixed three-dimensional structures, but they play various roles in signaling and regulation.
This study focuses on the computational identification of lysine (K) ubiquitination. The Mathew’s correlation coefficient (MCC) of lysine (K) ubiquitination site predictions was 0.142 on training set evaluated by jackknife cross-validation and 0.139 on independent test set. The following features distinguish our study from previous ubiquitination prediction models (Radivojac et al. 2010; Tung and Ho 2008): (1) a larger benchmark dataset was used, (2) the feature set was much smaller and more compact, (3) jackknife cross-validation and independent test were used to evaluate effectively and objectively the performance of our classifier, (4) the applied prediction model nearest neighbor algorithm was much simpler and faster than SVM (Tung and Ho 2008) or random forest (Radivojac et al. 2010), both of which could have easily introduced over-fitting problems, and (5) on independent test our model has better performance than two existing predictors: UbiPred and UbPred. Our analysis shows that the conservation of amino acid at and around the lysine site plays important roles in ubiquitination site prediction. It also shows that biochemical and physicochemical properties of amino acids in the flanking sequences are important for the ubiquitination process. Interestingly, disorder and ubiquitination have a strong relevance.
Materials and methods
The ubiquitinated protein sequences we used for training comes from SysPTM (Li et al. 2009). Peptides containing lysine (K) were extracted as our training samples. According to Tung’s work (Tung and Ho 2008), the best window size for ubiquitination site prediction is 21. So we adopted their window size and represent each lysine-ubiquitination site with a peptide fragment consisting of 21 residues with 10 residues upstream and 10 residues downstream of the lysine (K). The original dataset downloaded from SysPTM has 514 lysine-ubiquitination sites from 349 proteins. After removing the redundancy of the 349 protein sequences against homology bias using the program cd-hit (Li and Godzik 2006), we obtained 273 distinct sequences among which the sequence identity was lower than 0.6. We randomly selected 12 proteins to form the independent test set and the left 261 proteins to construct the training set. Since the number of ubiquitinated lysine sites and non-ubiquitinated lysine sites were highly imbalanced, we randomly selected three times negative samples (non-ubiquitinated lysine fragments) to match the positive ones (ubiquitinated lysine fragments) in the training set. In the independent test set, we retained all the positive and negative samples to make it close to real situation. There were 364 positive samples and 1,092 negative samples in the training set; meanwhile in the independent test set, there were 14 positive samples and 267 negative samples. The benchmark dataset we used was larger than Tung’s 157 ubiquitination sites (Tung and Ho 2008) or Radivojac’s 272 ubiquitinated fragments (Radivojac et al. 2010). Both the positive and negative lysine samples for training and independent test can be found in Dataset S1.
PSSM conservation scores
Evolutionary conservation usually indicates important biology function. If an amino acid at a particular site of a protein is conserved, it may locate in a functionally important region of the protein.
Position-specific iterated (PSI)-BLAST (Altschul et al. 1997) can measure the residue conservation in a given location. Each residue can be represented by a 20-dimensional vector which stands for the probabilities of conservation against mutations to 20 different kinds of amino acids. Position-specific scoring matrix (PSSM) (Ahmad and Sarai 2005) is a matrix in which each row is such a 20-dimensional vector. The rows of matrix correspond to the residues in the protein sequence. If a residue is conserved according to PSI-BLAST, it is likely to be biologically important and ubiquitinated. In this study, we encoded the conservation status of each amino acid in the protein sequence with PSSM conservation score. The program “blastpgp” (PSI-BLAST) downloaded from ftp://ftp.ncbi.nlm.nih.gov/blast was used to calculate the PSSM conservation score with three iterations (−j 3) and e-value threshold for inclusion in multipass model 0.0001 (−h 0.0001).
Amino acid factors
Atchley et al. (2005) did multivariate statistical analyses on AAIndex (Kawashima and Kanehisa 2000) which is a database of various physicochemical and biochemical properties of amino acids, to produce five multidimensional patterns of attribute covariation reflecting polarity (AAFactor 1), secondary structure (AAFactor 2), molecular volume (AAFactor 3), codon diversity (AAFactor 4), and electrostatic charge (AAFactor 5). These five transformed scores (called “amino acid factors” here) have been used to successfully solve several difficult biology problems, such as deleterious non-synonymous SNP identification (Huang et al. 2010b) and B cell epitopes prediction (Rubinstein et al. 2009). Here, we used these five amino acid factors to encode each amino acid in the lysine fragment.
Under physiological conditions, disordered regions in proteins do not have fixed three-dimensional structures, but they play various roles in signaling and regulation by multiple binding of proteins and high-specificity low-affinity interactions (Sickmeier et al. 2007). In this study, we encoded the disorder status of each amino acid in the protein sequence with disorder score calculated by VSL2 (Peng et al. 2006). The VSL2 predictors can accurately identify disordered regions of any length, especially the short disordered regions. The disorder scores of lysine site and its surrounding amino acids formed the features of disorders.
The feature space
The lysine (K) ubiquitination site was encoded by 20 PSSM conservation scores and 1 disorder score, in total 21 features. Each of its surrounding amino acids (10 residues upstream and 10 residues downstream) was encoded by 26 features, including 20 PSSM conservation scores, 5 amino acid factors, and 1 disorder score. Overall, each sample was represented by 26 × 20 + 21 = 541 features.
The feature index h indicates the importance of feature. The better a feature is, the smaller its index h will be.
Nearest neighbor algorithm
In NNA, the query vector will be designated to the same class of its nearest neighbor in training set with known classes which has the smallest distance.
Jackknife cross-validation and independent test
Incremental feature selection (IFS)
After mRMR gives the rank of features according to their importance, it is still unknown how many fore features in the list should be chosen. To identify the optimal number of features, incremental feature selection (IFS) (Huang et al. 2009, 2010a; Cai et al. 2010) was used.
Based on each of the N feature sets, NNA predictors were constructed and tested by jackknife cross-validation on training set. With MCC of jackknife cross-validation calculated, we obtain an IFS table with the number of features and the performance of them. Soptimal is the optimal feature set that achieves the highest MCC.
Using the program “mRMR” (Peng et al. 2005b) downloaded from http://penglab.janelia.org/proj/mRMR, we obtained the ranked mRMR list of 541 features. The smaller index of feature indicates more important roles in discriminating positive samples from negative ones. The mRMR list was used in IFS procedure for feature selection and analysis.
Based on the outputs of mRMR, we built 541 individual predictors for the 541 sub-feature sets to predict the lysine-ubiquitination sites. As described in the “Materials and methods”, we tested the predictors with one feature, two features, three features, etc., and obtained the IFS result which can be found in Table S1.
Independent test and comparison with other methods
We tested our model in an independent dataset in which there were 14 positive samples and 267 negative samples. The MCC of our method independent test was 0.139. Meanwhile, we also predicted the independent set with two existing ubiquitination site predictors: UbiPred (Tung and Ho 2008) and UbPred (Radivojac et al. 2010). The MCCs of UbiPred and UbPred on the same independent test set were 0.135 and 0.117, respectively. The performance of our model is better than both UbiPred and UbPred on the independent test set in which the positive and negative samples are highly imbalanced and close to real situation.
The distribution of the optimized feature set
The number of amino acid factor features in the optimal feature set was 100, which means all amino acid factor features have been selected and all the five amino acid factors were equally important.
Proteins are targeted for degradation by the covalent ligation to ubiquitin, a small 76-amino acid residue protein. Ubiquitination of target substrates is a highly collaborative process involving a three-step cascade mechanism between the ubiquitin-activating enzyme (E1), ubiquitin-conjugating enzymes (E2), and ubiquitin ligases (E3) (Hershko and Ciechanover 1998).
Within the selected physicochemical property parameters, we show that polarity (AAFactor 1), secondary structure (AAFactor 2), molecular volume (AAFactor 3), codon diversity (AAFactor 4), and electrostatic charge (AAFactor 5) share similar role in protein ubiquitination selection. The most pronounced feature of Ub sites is the abundance of charged and polar amino acids, especially negatively charged D and E, and the depletion of hydrophobic residues, such as L, I, F, and P around Ub sites (Nonaka et al. 2005; Radivojac et al. 2010). These parameters are highly related to electrostatic charge and amino acid composition in the adjacent sequence. The known E3 enzymes could be separated in two protein families: HECT domain and RING E3s. The crystal structures of these complexes reveal extraordinary specificity of interaction by a small set of loops at the end of the UbcH7 β-sheet (a subset of secondary structure) (Zheng et al. 2000; Huang et al. 1999). From these results, it is easier to understand how the presence of a few divergent surface residues could modulate the catalytic properties of ubiquitination. The similar positions of the three substrate-binding domains supported that RING E3s promote ubiquitin transfer by positioning the substrate in a manner such that the lysine is optimally E2 active size (Zheng et al. 2002; Schulman et al. 2000), spacing between the destruction motif and the ubiquitin-acceptor lysine residue as a parameter that affects the rate of substrate ubiquitination, further supporting the positioning model (Wu et al. 2003). These structure analyses emphasize the importance of secondary structure, molecular size or volume to the ubiquitination process.
The relationship between ubiquitination and protein disorder is complex and remains unclear, but researchers have observed that the percentage of residues predicted as possible ubiquitination sites increases with increasing amounts of disorder (Edwards et al. 2009). A large proportion of disordered proteins are highly expressed in many tissues (Edwards et al. 2009). These proteins may have a higher chance of degradation, as they are likely to have a higher density of ubiquitination sites.
Although much knowledge about ubiquitination has been accumulated to date, it is difficult to assume that all substrates carry a similar preexisting structure before they bind to the components of the ubiquitination machinery. Here, we examine sequence and structural preferences of all available ubiquitination sites and show that they have selected physicochemical property parameters. Regulated protein targeting and turnover through the ubiquitin–proteasome system underlies a host of critical physiological and pathological states in humans. The ability to modulate the individual steps in the ubiquitination pathway offers potential therapeutic strategies in the future.
A novel sequence-based predictor was developed for identifying the ubiquitination at lysine site. With the IFS feature selection procedure based on mRMR analysis, the predictor achieved an MCC of 0.142 by jackknife cross-validation test on benchmark dataset. In independent test, the MCC of our predictor was 0.139, higher than the existing ubiquitination site prediction tools UbiPred and UbPred. Our analysis shows that the conservation of amino acid at and around lysine plays important roles in ubiquitination site prediction. It also shows that electrostatic charge, molecular volume, secondary structure, codon diversity, and polarity of amino acids in the flanking sequences are important for the ubiquitination process. Interestingly, disorder and ubiquitination have a strong relevance. Although the results reported here are quite encouraging, the present study is merely a preliminary one. Further investigation is needed to clarify the predicted relationship between conservation, disorder and ubiquitination.
The authors acknowledge Yvonne Poindexter at the Vanderbilt University Cancer Biostatistics Center for her editing. This work was supported by grants from National High-Tech R&D Program of China (863 Program) (2006AA02Z334, 2007DFA31040), China National Key Projects for Infectious Disease (2008ZX10002-021), National Basic Research Program of China (2006CB910700), National Natural Science Foundation of China (Grant No. 31070752) and Key Research Program (CAS) (KSCX2-YW-R-112).