Amino Acids

, Volume 42, Issue 4, pp 1387–1395

Prediction of lysine ubiquitination with mRMR feature selection and analysis

Authors

    • Institute of Systems BiologyShanghai University
    • Centre for Computational Systems BiologyFudan University
  • Tao Huang
    • Key Laboratory of Systems Biology, Shanghai Institutes for Biological SciencesChinese Academy of Sciences
    • Shanghai Center for Bioinformation Technology
  • Lele Hu
    • Institute of Systems BiologyShanghai University
  • Xiaohe Shi
    • Singapore Bioimaging ConsortiumAgency for Science, Technology and Research
  • Lu Xie
    • Shanghai Center for Bioinformation Technology
    • Key Laboratory of Systems Biology, Shanghai Institutes for Biological SciencesChinese Academy of Sciences
    • Shanghai Center for Bioinformation Technology
Original Article

DOI: 10.1007/s00726-011-0835-0

Cite this article as:
Cai, Y., Huang, T., Hu, L. et al. Amino Acids (2012) 42: 1387. doi:10.1007/s00726-011-0835-0

Abstract

Ubiquitination, one of the most important post-translational modifications of proteins, occurs when ubiquitin (a small 76-amino acid protein) is attached to lysine on a target protein. It often commits the labeled protein to degradation and plays important roles in regulating many cellular processes implicated in a variety of diseases. Since ubiquitination is rapid and reversible, it is time-consuming and labor-intensive to identify ubiquitination sites using conventional experimental approaches. To efficiently discover lysine-ubiquitination sites, a sequence-based predictor of ubiquitination site was developed based on nearest neighbor algorithm. We used the maximum relevance and minimum redundancy principle to identify the key features and the incremental feature selection procedure to optimize the prediction engine. PSSM conservation scores, amino acid factors and disorder scores of the surrounding sequence formed the optimized 456 features. The Mathew’s correlation coefficient (MCC) of our ubiquitination site predictor achieved 0.142 by jackknife cross-validation test on a large benchmark dataset. In independent test, the MCC of our method was 0.139, higher than the existing ubiquitination site predictor UbiPred and UbPred. The MCCs of UbiPred and UbPred on the same test set were 0.135 and 0.117, respectively. Our analysis shows that the conservation of amino acids at and around lysine plays an important role in ubiquitination site prediction. What’s more, disorder and ubiquitination have a strong relevance. These findings might provide useful insights for studying the mechanisms of ubiquitination and modulating the ubiquitination pathway, potentially leading to potential therapeutic strategies in the future.

Keywords

UbiquitinationMaximum relevance and minimum redundancy (mRMR)Incremental feature selection (IFS)Nearest neighbor algorithm (NNA)

Introduction

In the post-genomic era, knowledge of post-translational modifications (PTMs) of proteins is crucial for understanding the dynamic proteome and various signaling pathways or networks in cells (Aguilar and Wendland 2003; Saghatelian and Cravatt 2005; Herrmann et al. 2007; Hicke and Dunn 2003; Welchman et al. 2005). One of the most important and universal post-translational modifications, protein ubiquitination is a rapid and reversible biochemical process in which an iso-peptide bond forms covalently between the C-terminal double-glycine carboxy group of a ubiquitin protein and the ε-amino group of lysine residues of a substrate protein (Pickart 2001). Ubiquitination regulates a variety of biological processes, such as signal transduction, cell division/mitosis, apoptosis, and endocytosis (Sun and Chen 2004; Reinstein and Ciechanover 2006; Hoeller et al. 2006; Hicke 2001). An aberrance of the ubiquitin–proteasome system (UPS) is associated in numerous pathological diseases, such as inflammatory diseases, neurodegenerative disorders, and cancers (Hoeller et al. 2006; Reinstein and Ciechanover 2006).

Identification of ubiquitinated proteins sites is one of the greatest challenges in gaining a full understanding of the regulatory roles of ubiquitination regulation and the molecular mechanism of the ubiquitin system. It is time-consuming and labor-intensive to use conventional experimental approaches to identify the potential ubiquitinated proteins sites, such as site-mutagenesis (Lin et al. 2005), antibodies of Ub (anti-Ub) (Gentry et al. 2005), and high-throughput mass-spectrometry (MS) (Kirkpatrick et al. 2005). Therefore, it is convenient and efficient to use in silico algorithms in prediction of ubiquitination sites.

In this work, we developed a new computational method to predict lysine-ubiquitination. Specifically, we used a machine learning approach (Nearest Neighbor Algorithm) combined with feature selection (IFS based on mRMR, Peng et al. 2005a). Twenty-six parameters were used to describe each amino acid of the lysine site and its surrounding ones (from −10 to +10). The 26 parameters can be broken down into 3 categories: 20 position-specific scoring matrices (PSSM) conservation scores, 5 amino acid factors and 1 disorder score. The PSSM score quantifies the conservation status of each site in the protein sequence (Altschul et al. 1997). Amino acid factors were defined by Atchley et al. (2005) through multivariate statistical analyses on AAIndex (Kawashima and Kanehisa 2000) to produce five amino acid factors that reflected polarity (AAFactor 1), secondary structure (AAFactor 2), molecular volume (AAFactor 3), codon diversity (AAFactor 4), and electrostatic charge (AAFactor 5). Disorder score (Peng et al. 2006) represents the disorder status of each site in the protein sequence. Under physiological conditions, disordered regions in proteins do not have fixed three-dimensional structures, but they play various roles in signaling and regulation.

This study focuses on the computational identification of lysine (K) ubiquitination. The Mathew’s correlation coefficient (MCC) of lysine (K) ubiquitination site predictions was 0.142 on training set evaluated by jackknife cross-validation and 0.139 on independent test set. The following features distinguish our study from previous ubiquitination prediction models (Radivojac et al. 2010; Tung and Ho 2008): (1) a larger benchmark dataset was used, (2) the feature set was much smaller and more compact, (3) jackknife cross-validation and independent test were used to evaluate effectively and objectively the performance of our classifier, (4) the applied prediction model nearest neighbor algorithm was much simpler and faster than SVM (Tung and Ho 2008) or random forest (Radivojac et al. 2010), both of which could have easily introduced over-fitting problems, and (5) on independent test our model has better performance than two existing predictors: UbiPred and UbPred. Our analysis shows that the conservation of amino acid at and around the lysine site plays important roles in ubiquitination site prediction. It also shows that biochemical and physicochemical properties of amino acids in the flanking sequences are important for the ubiquitination process. Interestingly, disorder and ubiquitination have a strong relevance.

Materials and methods

Dataset

The ubiquitinated protein sequences we used for training comes from SysPTM (Li et al. 2009). Peptides containing lysine (K) were extracted as our training samples. According to Tung’s work (Tung and Ho 2008), the best window size for ubiquitination site prediction is 21. So we adopted their window size and represent each lysine-ubiquitination site with a peptide fragment consisting of 21 residues with 10 residues upstream and 10 residues downstream of the lysine (K). The original dataset downloaded from SysPTM has 514 lysine-ubiquitination sites from 349 proteins. After removing the redundancy of the 349 protein sequences against homology bias using the program cd-hit (Li and Godzik 2006), we obtained 273 distinct sequences among which the sequence identity was lower than 0.6. We randomly selected 12 proteins to form the independent test set and the left 261 proteins to construct the training set. Since the number of ubiquitinated lysine sites and non-ubiquitinated lysine sites were highly imbalanced, we randomly selected three times negative samples (non-ubiquitinated lysine fragments) to match the positive ones (ubiquitinated lysine fragments) in the training set. In the independent test set, we retained all the positive and negative samples to make it close to real situation. There were 364 positive samples and 1,092 negative samples in the training set; meanwhile in the independent test set, there were 14 positive samples and 267 negative samples. The benchmark dataset we used was larger than Tung’s 157 ubiquitination sites (Tung and Ho 2008) or Radivojac’s 272 ubiquitinated fragments (Radivojac et al. 2010). Both the positive and negative lysine samples for training and independent test can be found in Dataset S1.

Feature construction

PSSM conservation scores

Evolutionary conservation usually indicates important biology function. If an amino acid at a particular site of a protein is conserved, it may locate in a functionally important region of the protein.

Position-specific iterated (PSI)-BLAST (Altschul et al. 1997) can measure the residue conservation in a given location. Each residue can be represented by a 20-dimensional vector which stands for the probabilities of conservation against mutations to 20 different kinds of amino acids. Position-specific scoring matrix (PSSM) (Ahmad and Sarai 2005) is a matrix in which each row is such a 20-dimensional vector. The rows of matrix correspond to the residues in the protein sequence. If a residue is conserved according to PSI-BLAST, it is likely to be biologically important and ubiquitinated. In this study, we encoded the conservation status of each amino acid in the protein sequence with PSSM conservation score. The program “blastpgp” (PSI-BLAST) downloaded from ftp://ftp.ncbi.nlm.nih.gov/blast was used to calculate the PSSM conservation score with three iterations (−j 3) and e-value threshold for inclusion in multipass model 0.0001 (−h 0.0001).

Amino acid factors

Atchley et al. (2005) did multivariate statistical analyses on AAIndex (Kawashima and Kanehisa 2000) which is a database of various physicochemical and biochemical properties of amino acids, to produce five multidimensional patterns of attribute covariation reflecting polarity (AAFactor 1), secondary structure (AAFactor 2), molecular volume (AAFactor 3), codon diversity (AAFactor 4), and electrostatic charge (AAFactor 5). These five transformed scores (called “amino acid factors” here) have been used to successfully solve several difficult biology problems, such as deleterious non-synonymous SNP identification (Huang et al. 2010b) and B cell epitopes prediction (Rubinstein et al. 2009). Here, we used these five amino acid factors to encode each amino acid in the lysine fragment.

Disorder score

Under physiological conditions, disordered regions in proteins do not have fixed three-dimensional structures, but they play various roles in signaling and regulation by multiple binding of proteins and high-specificity low-affinity interactions (Sickmeier et al. 2007). In this study, we encoded the disorder status of each amino acid in the protein sequence with disorder score calculated by VSL2 (Peng et al. 2006). The VSL2 predictors can accurately identify disordered regions of any length, especially the short disordered regions. The disorder scores of lysine site and its surrounding amino acids formed the features of disorders.

The feature space

The lysine (K) ubiquitination site was encoded by 20 PSSM conservation scores and 1 disorder score, in total 21 features. Each of its surrounding amino acids (10 residues upstream and 10 residues downstream) was encoded by 26 features, including 20 PSSM conservation scores, 5 amino acid factors, and 1 disorder score. Overall, each sample was represented by 26 × 20 + 21 = 541 features.

mRMR method

In this study, we applied the maximum relevance and minimum redundancy (mRMR) method (Peng et al. 2005b) to analyze the importance of different features. Each feature can be ranked based on its relevance to target variable, and the ranking process is able to consider the redundancy of these features at the same time. A “good” feature is defined as one that has the best trade-off between minimum redundancy within the features and maximum relevance to target variable. Mutual information (MI), which measures the mutual dependence of two variables, is used to quantify both relevance and redundancy in this method. MI is defined as following
$$ I(X,Y) = \iint {p(x,y)}\log {\frac{p(x,y)}{p(x)p(y)}} $$
(1)
where X, Y are vectors, p(x, y) is the joint probabilistic density, p(x) and p(y) are the marginal probabilistic densities. Given M data points drawn from the joint probability distribution (xi, yi), i = 1,…, M, the joint and marginal densities can be estimated by the Gaussian kernel estimator as following (Beirlant et al. 1997; Qiu et al. 2009)
$$ p(x,y) = \frac{1}{M}\sum {{\frac{1}{{2\pi h^{2} }}}} e^{{ - {\frac{1}{{2h^{2} }}}((x - x_{i} )^{2} + (y - y_{i} )^{2} )}} $$
(2)
$$ p(x) = \frac{1}{M}\sum {{\frac{1}{{\sqrt {2\pi h^{2} } }}}} e^{{ - {\frac{1}{{2h^{2} }}}(x - x_{i} )^{2} }} $$
(3)
$$ p(y) = \frac{1}{M}\sum {{\frac{1}{{\sqrt {2\pi h^{2} } }}}} e^{{ - {\frac{1}{{2h^{2} }}}(y - y_{i} )^{2} }} $$
(4)
h is a tuning parameter that controls the width of the kernels.
Let Ω denote the whole feature set, while Ωs denotes the already-selected feature set which contains m features and Ωt denotes the to-be-selected feature set which contains n features. Relevance D of the feature f in Ωt with the target c can be calculated by:
$$ D = I(f,c). $$
(5)
And redundancy R of the feature f in Ωt with all the features in Ωs can be calculated by:
$$ R = \frac{1}{m}\sum\limits_{{f_{i} \in \Upomega_{\text{s}} }} {I(f,f_{i} } ). $$
(6)
To obtain the feature fj in Ωt with maximum relevance and minimum redundancy, Eqs. 5 and 6 are combined with the mRMR function:
$$ \mathop {\max }\limits_{{f_{j} \in \Upomega_{\text{t}} }} \left[ {I(f_{j} ,c) - \frac{1}{m}\sum\limits_{{f_{i} \in \Upomega_{\text{s}} }} {I(f_{j, } f_{i} )} } \right](j = 1,2, \ldots ,n). $$
(7)
For a feature set with N(=m + n) features, the feature evaluation will continue N rounds. After these evaluations, we will get a feature set S by mRMR method:
$$ S = \left\{ {f_{1}^{'} ,f_{2}^{'} , \ldots ,f_{h}^{'} , \ldots ,f_{N}^{'} } \right\}. $$
(8)

The feature index h indicates the importance of feature. The better a feature is, the smaller its index h will be.

Nearest neighbor algorithm

We used nearest neighbor algorithm (NNA) to build the prediction model. NNA makes its decision by calculating similarities between the test sample and all the training samples. In our study, the distance between vector A = (a1, a2,…, an) and B = (b1, b2,…, bn) is defined as follows (Qian et al. 2006; Huang et al. 2009, 2010a; Cai et al. 2010):
$$ D(A,B) = 1 - {\frac{{\sum\limits_{i = 1}^{n} {a_{i} } b_{i} }}{{\sqrt {\sum\limits_{i = 1}^{n} {a_{i}^{2} } } \sqrt {\sum\limits_{i = 1}^{n} {b_{i}^{2} } } }}}. $$
(9)

In NNA, the query vector will be designated to the same class of its nearest neighbor in training set with known classes which has the smallest distance.

Jackknife cross-validation and independent test

We used the jackknife cross-validation (Li et al. 2007; Cai et al. 2009; Huang et al. 2008) to evaluate the performance of our classifier on training set. With jackknife cross-validation, every sample is tested by the predictor trained with all the other samples. Besides the jackknife cross-validation on training set, we also did independent test of our model. Since the positive and negative samples are highly imbalanced in training set and independent test set, the Matthews’s correlation coefficient (MCC) (Baldi et al. 2000) was used to evaluate the prediction performance and defined as
$$ {\text{MCC}} = {\frac{{{\text{TP}} \times {\text{TN}} - {\text{FP}} \times {\text{FN}}}}{{\sqrt {({\text{TP}} + {\text{FN}}) \times ({\text{TN}} + {\text{FP}}) \times ({\text{TP}} + {\text{FP}}) \times ({\text{TN}} + {\text{FN}})} }}} $$
(10)
where TP, TN, FP and FN stand for true positive, true negative, false positive and false negative, respectively.

Taking both sensitivity and specificity into account, MCC is considered as a balanced measure in dealing with imbalanced data (Baldi et al. 2000; Han et al. 2008).

Meanwhile, sensitivity (Sn), specificity (Sp) and accuracy (ACC) defined as following were also calculated
$$ {\text{S}}_{n} = {\frac{\text{TP}}{{{\text{TP}} + {\text{FN}}}}} $$
(11)
$$ {\text{S}}_{p} = {\frac{\text{TN}}{{{\text{TN}} + {\text{FP}}}}} $$
(12)
$$ {\text{ACC}} = {\frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}}} $$
(13)
where TP, TN, FP and FN stand for true positive, true negative, false positive and false negative, respectively.

Incremental feature selection (IFS)

After mRMR gives the rank of features according to their importance, it is still unknown how many fore features in the list should be chosen. To identify the optimal number of features, incremental feature selection (IFS) (Huang et al. 2009, 2010a; Cai et al. 2010) was used.

An incremental feature selection is conducted for each of the independent predictor with the ranked features. Features in a set are added one by one from higher to lower rank. If one feature is added, a new feature set is obtained, then we get N feature sets where N is the number of features, and the ith feature set is:
$$ S_{i} = \{ f_{1} , f_{2} , \ldots ,f_{i} \} (1 \le i \le N). $$

Based on each of the N feature sets, NNA predictors were constructed and tested by jackknife cross-validation on training set. With MCC of jackknife cross-validation calculated, we obtain an IFS table with the number of features and the performance of them. Soptimal is the optimal feature set that achieves the highest MCC.

Results

mRMR result

Using the program “mRMR” (Peng et al. 2005b) downloaded from http://penglab.janelia.org/proj/mRMR, we obtained the ranked mRMR list of 541 features. The smaller index of feature indicates more important roles in discriminating positive samples from negative ones. The mRMR list was used in IFS procedure for feature selection and analysis.

IFS result

Based on the outputs of mRMR, we built 541 individual predictors for the 541 sub-feature sets to predict the lysine-ubiquitination sites. As described in the “Materials and methods”, we tested the predictors with one feature, two features, three features, etc., and obtained the IFS result which can be found in Table S1.

Figure 1 shows IFS curve plotted based on Table S1. The highest MCC was 0.142 when 456 features were used. So these 456 features were considered as the optimal feature set of our classifier. The 456 optimal features were given in Table S2.
https://static-content.springer.com/image/art%3A10.1007%2Fs00726-011-0835-0/MediaObjects/726_2011_835_Fig1_HTML.gif
Fig. 1

The IFS curve of predictors. In the IFS curve, the x-axis is the number of features and the y-axis is the MCC of jackknife cross-validation. The highest MCC was 0.142 when 456 features were used. So these 456 features were considered as the optimal feature set of our classifier

Independent test and comparison with other methods

We tested our model in an independent dataset in which there were 14 positive samples and 267 negative samples. The MCC of our method independent test was 0.139. Meanwhile, we also predicted the independent set with two existing ubiquitination site predictors: UbiPred (Tung and Ho 2008) and UbPred (Radivojac et al. 2010). The MCCs of UbiPred and UbPred on the same independent test set were 0.135 and 0.117, respectively. The performance of our model is better than both UbiPred and UbPred on the independent test set in which the positive and negative samples are highly imbalanced and close to real situation.

The distribution of the optimized feature set

As described in the “Materials and methods”, there were three kinds of features: PSSM conservation scores, amino acid factors and disorder scores. The number of each type of features in optimal feature set was investigated and shown in Fig. 2a. The number of each site of features in optimal feature set was shown in Fig. 2b. In the optimized 456 features, there were 100 amino acid factor features, 8 disorder score features and 348 PSSM conservation score features. This may suggest that conservation played important role for the ubiquitination site prediction. Similar evolutionary information exploited through position-specific scoring matrices (PSSMs) was also used in two previous prediction models of ubiquitylation (Radivojac et al. 2010; Tung and Ho 2008).
https://static-content.springer.com/image/art%3A10.1007%2Fs00726-011-0835-0/MediaObjects/726_2011_835_Fig2_HTML.gif
Fig. 2

The number of each type or each site of features in optimal feature set. a The number of each type of features in optimal feature set. There were 100 amino acid factor features, 8 disorder score features and 348 PSSM conservation score features. b The number of each site of features in optimal feature set. From 10 residues upstream to 10 residues downstream (“AA1”, “AA2”, …, “AA20”, “AA21”), there were 23, 20, 21, 21, 20, 21, 23, 23, 24, 22, 20, 23, 19, 24, 20, 22, 21, 24, 22, 21 and 22 features, respectively

Since there were 348 PSSM conservation score features which count for a large proportion in the optimized 456 features, we investigated the number of each kind of amino acid of PSSM features (Fig. 3a) and the number of each site of PSSM features (Fig. 3b). The conservation of lysine site (AA11) was most important for the ubiquitination, and there were more PSSM conservation score features at nearby site AA7, AA8, AA9, AA12, AA14 and remote site AA1, AA18, AA19, AA21 than others. The importance of remote site explained why Tung found that the proper window size for ubiquitylation site prediction is 21 (Tung and Ho 2008). In addition, the conservation against mutations to 20 amino acids played different roles. Mutations to amino acids A, C, F, H, I, L, M, S, T, V, W and Y have more influence on ubiquitination than other kinds of mutations.
https://static-content.springer.com/image/art%3A10.1007%2Fs00726-011-0835-0/MediaObjects/726_2011_835_Fig3_HTML.gif
Fig. 3

The number of each type or each site of PSSM features in optimal feature set. a The number of each type of PSSM features in optimal feature set. b The number of each site of PSSM features in optimal feature set. The conservation of lysine site (AA11) was most important for the ubiquitination, and there were more PSSM conservation score features at nearby site AA7, AA8, AA9, AA12, AA14 and remote site AA1, AA18, AA19, AA21 than others

The number of amino acid factor features in the optimal feature set was 100, which means all amino acid factor features have been selected and all the five amino acid factors were equally important.

There were 8 disorder scores selected in the optimal feature set: the disorder scores at site AA6, AA7, AA8, AA9, AA10, AA14, AA17 and AA18. The disorder score of AA7 ranked first in the mRMR list. This indicated the disorder status of amino acid around the ubiquitination site could affect the ubiquitination process. It has been reported that disordered proteins have a greater proportion of predicted ubiquitination sites (Edwards et al. 2009). To better investigate the relationship between disorder and ubiquitination, we averaged the disorder scores at each site in ubiquitinated fragments and non-ubiquitinated fragments and compared them in Fig. 4. In Fig. 4, the red and blue dots were the mean of disorder scores at each site in ubiquitinated fragments and non-ubiquitinated fragments, respectively. The width of error bar represents the standard error of the mean. It is quite clear that the ubiquitinated fragments and non-ubiquitinated fragments have very different disorder score pattern. The disorder score at each site in the ubiquitinated fragments is higher than the one in the non-ubiquitinated fragments.
https://static-content.springer.com/image/art%3A10.1007%2Fs00726-011-0835-0/MediaObjects/726_2011_835_Fig4_HTML.gif
Fig. 4

The disorder scores at each site in ubiquitinated fragments and non-ubiquitinated fragments. The upper and lower dots were the mean of disorder scores at each site in ubiquitinated fragments and non-ubiquitinated fragments, respectively. The width of error bar represents the standard error of the mean

Discussion

Proteins are targeted for degradation by the covalent ligation to ubiquitin, a small 76-amino acid residue protein. Ubiquitination of target substrates is a highly collaborative process involving a three-step cascade mechanism between the ubiquitin-activating enzyme (E1), ubiquitin-conjugating enzymes (E2), and ubiquitin ligases (E3) (Hershko and Ciechanover 1998).

Within the selected physicochemical property parameters, we show that polarity (AAFactor 1), secondary structure (AAFactor 2), molecular volume (AAFactor 3), codon diversity (AAFactor 4), and electrostatic charge (AAFactor 5) share similar role in protein ubiquitination selection. The most pronounced feature of Ub sites is the abundance of charged and polar amino acids, especially negatively charged D and E, and the depletion of hydrophobic residues, such as L, I, F, and P around Ub sites (Nonaka et al. 2005; Radivojac et al. 2010). These parameters are highly related to electrostatic charge and amino acid composition in the adjacent sequence. The known E3 enzymes could be separated in two protein families: HECT domain and RING E3s. The crystal structures of these complexes reveal extraordinary specificity of interaction by a small set of loops at the end of the UbcH7 β-sheet (a subset of secondary structure) (Zheng et al. 2000; Huang et al. 1999). From these results, it is easier to understand how the presence of a few divergent surface residues could modulate the catalytic properties of ubiquitination. The similar positions of the three substrate-binding domains supported that RING E3s promote ubiquitin transfer by positioning the substrate in a manner such that the lysine is optimally E2 active size (Zheng et al. 2002; Schulman et al. 2000), spacing between the destruction motif and the ubiquitin-acceptor lysine residue as a parameter that affects the rate of substrate ubiquitination, further supporting the positioning model (Wu et al. 2003). These structure analyses emphasize the importance of secondary structure, molecular size or volume to the ubiquitination process.

The relationship between ubiquitination and protein disorder is complex and remains unclear, but researchers have observed that the percentage of residues predicted as possible ubiquitination sites increases with increasing amounts of disorder (Edwards et al. 2009). A large proportion of disordered proteins are highly expressed in many tissues (Edwards et al. 2009). These proteins may have a higher chance of degradation, as they are likely to have a higher density of ubiquitination sites.

Although much knowledge about ubiquitination has been accumulated to date, it is difficult to assume that all substrates carry a similar preexisting structure before they bind to the components of the ubiquitination machinery. Here, we examine sequence and structural preferences of all available ubiquitination sites and show that they have selected physicochemical property parameters. Regulated protein targeting and turnover through the ubiquitin–proteasome system underlies a host of critical physiological and pathological states in humans. The ability to modulate the individual steps in the ubiquitination pathway offers potential therapeutic strategies in the future.

Conclusion

A novel sequence-based predictor was developed for identifying the ubiquitination at lysine site. With the IFS feature selection procedure based on mRMR analysis, the predictor achieved an MCC of 0.142 by jackknife cross-validation test on benchmark dataset. In independent test, the MCC of our predictor was 0.139, higher than the existing ubiquitination site prediction tools UbiPred and UbPred. Our analysis shows that the conservation of amino acid at and around lysine plays important roles in ubiquitination site prediction. It also shows that electrostatic charge, molecular volume, secondary structure, codon diversity, and polarity of amino acids in the flanking sequences are important for the ubiquitination process. Interestingly, disorder and ubiquitination have a strong relevance. Although the results reported here are quite encouraging, the present study is merely a preliminary one. Further investigation is needed to clarify the predicted relationship between conservation, disorder and ubiquitination.

Acknowledgments

The authors acknowledge Yvonne Poindexter at the Vanderbilt University Cancer Biostatistics Center for her editing. This work was supported by grants from National High-Tech R&D Program of China (863 Program) (2006AA02Z334, 2007DFA31040), China National Key Projects for Infectious Disease (2008ZX10002-021), National Basic Research Program of China (2006CB910700), National Natural Science Foundation of China (Grant No. 31070752) and Key Research Program (CAS) (KSCX2-YW-R-112).

Supplementary material

726_2011_835_MOESM1_ESM.xls (128 kb)
Dataset S1 - Benchmark dataset (XLS 127 kb)
726_2011_835_MOESM2_ESM.xls (74 kb)
Table S1 - The IFS result (XLS 73 kb)
726_2011_835_MOESM3_ESM.xls (67 kb)
Table S2 - The 456 optimal features (XLS 67 kb)

Copyright information

© Springer-Verlag 2011