Introduction

Post-translational modification (PTM) at the cysteine residues is essential to the dynamic functions of proteins. S-sulfenylation is a reversible covalent modification on the thiol group of cysteine residues by hydrogen peroxide, whereby the Cys-SH is oxidized to sulfenic (Cys-SOH), sulfinic (Cys–SO2H) and sulfonic acids (Cys–SO3H). More importantly, these changes contribute strongly to the regulation of protein function under both normal and oxidative stress conditions [14]. To date, a little more than 200 transcription factors, signaling proteins, metabolic enzymes, proteostasis regulators, and cytoskeletal components have been identified as S-sulfenylated proteins [5]. Thus, investigation of S-sulfenylated proteins is crucial to our understanding of the function and regulation of this oxidative post-translational modification at cysteine residues.

Several chemoproteomic approaches have been developed for identifying specific sites in proteins that undergo S-sulfenylation [610]. Nevertheless, these experimental methods are often expensive and time-consuming. Further, the lack of information about S-sulfenylation sites and the multiplicity of redox changes contribute to many false-positive identifications [11]. To facilitate the efficiency of experimental characterization and search for S-sulfenylated peptides, construction of a resource database is necessary, and possible with the development of powerful bioinformatics technologies. In spite of the availability of several published algorithms and public servers that aim to help analyze the reactive and oxidative sites [12, 13] of cysteine residues, there are no confirmed reports about the site-specific information of S-sulfenylated proteins, except for the chemical characterization of over 1000 S-sulfenylation sites on more than 700 proteins in cell culture experiments [11].

The present study (SOHSite) concentrated on the computational identification and characterization of S-sulfenylation sites. To discriminate between S-sulfenylation and non-S-sulfenylation sites, features including amino acid composition (AAC), amino acid pair composition (AAPC), position specific scoring matrix (PSSM), position weight matrix (PWM), amino acid substitution matrix (BLOSUM62), accessible surface area (ASA), and the physicochemical properties of proteins, were examined. Support vector machine [14] (SVM) was used to learn a predictive model from each feature, as well as hybrid combinations of the features. The SVM model with the best predictive performance would be selected based on the result of five-fold cross-validation. An independent testing set was applied to further evaluate the effectiveness of the chosen model. Finally, we investigated both key amino acid and hydrophobic attributes associated with S-sulfenylation sites.

Materials and method

Data collection and preprocessing

Figure 1 presents the analytical flowchart of SOHSite. A majority of the experimental data used in this study was obtained from the Carroll laboratory database, which stores information on experimentally verified S-sulfenylated cysteines in humans based on the newly discovered S-sulfenyl-mediated redox regulation of the transcription factor H1F1A by SIRT6 [11]. As shown in Table 1, the data were composed of 1443 positive and 10521 negative data on 987 S-sulfenylated proteins. An additional set of data was collected from RedoxDB [12], UniProtKB [15] and other published literature. From the RedoxDB database, we obtained 102 S-sulfenylated sites (positive dataset) in 92 proteins, while the rest of the cysteine residues in these proteins were considered as non-S-sulfenylated cysteines (negative dataset). Moreover, properties characterized by X-Ray crystallography [16] were included, resulting in 33 S-sulfenylated and 143 non-S-sulfenylated sites. Finally, a dataset comprised of 17 S-sulfenylated and 97 non-S-sulfenylated cysteines was downloaded from UniProtKB. In order to generate the training dataset, the window length of 2n + 1 was used to extract sequence fragments centering at the S-sulfenylated or non-S-sulfenylated cysteines and containing n upstream, as well as n downstream, flanking amino acids. After extracting the sequence fragments with 2n + 1 window length (n = 10), we randomly categorized the dataset into independent testing set for the evaluation of real performance and training set for the construction of S-sulfenylated site prediction models.

Fig. 1
figure 1

Analytical flowchart of SOHSite including data collection and preprocessing, features extraction and encoding, model construction and evaluation, and independent testing

Table 1 Data statistics of S-sulfenylated and non-S-sulfenylated sites

To choose the best prediction model, a five-fold cross-validation approach was utilized to evaluate the performance of discriminating S-sulfenylated substrate sites from non-S-sulfenylated cysteine residues. Then, the model yielding the best prediction performance was further examined with an independent testing dataset. To prevent overestimation on performance evaluation, homologous sequences in the training and independent testing datasets were removed using 30 % sequence similarity as the filtering threshold, with reference to MASA [17]. As shown in (Additional file 1: Figure S1), two S-sulfenylated proteins with more than 30 % sequence identity were defined as homologous protein sequences. Then, two homologous sequences were specified to re-align the fragment sequences using a window length of 2n + 1, centered on the S-sulfenylated sites using BL2SEQ [18]. For two fragment sequences with 100 % identity, when the S-sulfenylated sites in the two proteins are in the same positions, only one site was kept while the other was discarded. Also, the training data had the higher priority so that if the sequence appeared in two datasets, the homologous fragment would be preserved in the training set and deducted from the testing set. The final positive and negative data consisted of 1145 and 8368 for the training dataset, and 289 and 2108 for the independent testing dataset, respectively.

Features extraction and encoding

In this study, numerous sequence-based features, including amino acid composition (AAC), amino acid pair composition (AAPC), solvent-accessible surface area (ASA), amino acid substitution matrix (BLOSUM62), position-specific scoring matrix (PSSM), positional weighted matrix (PWM) and physicochemical properties (AAindex), were assessed to conduct the best prediction model. After the extraction of sequence fragments with a window size of 21-mer amino acids, each sequence was encoded based on the investigated features. The orthogonal binary coding mechanism is one of the most popular coding methods for transforming amino acids into numeric vectors, called 20-dimensional binary coding (20D) [19]. Each amino acid was represented by a vector with 20 letters. For example, alanine (A) would be encoded as “10000000000000000000” while cysteine (C) would be “01000000000000000000” and so on. For each sequence fragment, the length of feature vectors with a window size of 2n + 1 was set to (2n + 1) x 20 to represent the flanking amino acids surrounding the S-sulfenylation sites. Therefore, there were a total of k vectors {xi, i = 1, 2 …, k} corresponding to the number of k sequence fragments in the training and testing dataset. For a binary classification, the labels +1 and −1 were assigned to the positive and negative data, respectively.

For the representations of amino acid compositions around S-sulfenylation sites, the vectors xi consisted of 21 elements for the amino acid composition (AAC) and 441 elements for the amino acid pair composition (AAPC). The 20 elements were defined as the occurrence frequencies of 20 amino acids in a sequence fragment, while the 400 elements were defined as the occurrence frequencies of 400 amino acid pairs in a sequence fragment. When the sequence fragments at the N- or C-terminus were less than 21-mer, non-existing residues were filled with “X” in the corresponding position. Therefore, a total of 21 types of amino acids and 441 types of amino acid pairs were presented in our setting. Additionally, the BLOcks Substitution Matrix (BLOSUM62) [20] was based on the alignments of amino acid sequences possessing no more than 62 % identity between two peptide sequences with 21 amino acids. With reference to the SulfoSite strategy [21], the position weighted matrix (PWM) of amino acids was determined using the non-homologous training data. The PWM described the frequency of occurrence of amino acids surrounding the S-sulfenylation sites, and was utilized to encode the sequence fragments. Each residue in the training dataset was represented by a matrix of m × w elements, where the window size of 21 was designated by w and the 21 elements including the 20 types of amino acids as well as the terminal signal was denoted by m. Position Specific Scoring Matrix (PSSM) profiles were generated from PSI-BLAST [22] against non-redundant sequences of S-sulfenylation sites. The scores were computed based on the multiple sequence alignment of proteins which may have similar structures with different amino acid compositions. In a PSSM profile, the rows in the matrix of (2n + 1) × 20 elements were centered on the substrate site, with 2n + 1 representing the window size and the number 20 denoting the position specific scores for each amino acid.

The solvent-accessible surface area (ASA), a feature representing the accessibility of an amino acid side-chain on the surface of a protein that experience post-translational modification [23], was also included as an attribute for the identification of S-sulfenylation sites. The RVP-Net [24, 25] was applied to compute the ASA value from the protein sequence due to the lack of experimentally verified tertiary structures of S-sulfenylated protein in the Protein Data Bank (PDB) [26]. Based on the information regarding the neighborhood amino acid composition, RVP-Net could predict the real ASA of a residue by using a neutral network approach. The possible mean absolute error, which was defined as the absolute difference between the predicted and experimental values of relative ASA per residue [25], was 18.0–19.5 %, for each measurement. The value of ASA represented the percentage of the solvent-accessible area of each amino acid on the protein. To compute the ASA values of all of the residues, full-length protein sequences were input into the RVP-Net. Then, the ASA values of amino acids surrounding the S-sulfenylation sites were extracted and normalized based on a scale from zero to one. In the investigation herein of secondary structure surrounding the SOH sites, additionally, PSIPRED [27] is employed to compute the secondary structure from the protein sequence. PSIPRED is a simple and reliable method for predicting secondary structure, which incorporates two feed-forward neural networks to analyze the output obtained from PSI-BLAST (Position Specific Iterated - BLAST) [22]. The output of PSIPRED is given in terms of “H,” “E” and “C” which stand for helix, sheet and coil, respectively.

The AAindex [28] (Version 9.1) contains a total of 544 amino acid indices which specify the physicochemical properties of twenty amino acids. After the amino acid indices with the value “NA” were eliminated, the physicochemical properties of the remaining 531 were examined to determine their ability to distinguish S-sulfenylated sites from the non-S-sulfenylated sites. A set of 20 numerical values corresponding to the physicochemical property of each of the 20 amino acids was specified. The sequence fragments were transformed from AAs surrounding S-sulfenylation sites into values associated with their physicochemical properties. In order to identify the significant physicochemical properties, the F-score method [2933] has been applied to calculate a statistical value for each position surrounding S-sulfenylation sites. The F-score of the ith physicochemical feature is defined as:

$$ \mathrm{F}\hbox{-} \mathrm{score}\ (i) = \frac{{\left({{\overline{x}}_i}^{\left(+\right)}-{\overline{x}}_i\right)}^2+{\left({{\overline{x}}_i}^{\left(-\right)}-{\overline{x}}_i\right)}^2}{\frac{1}{n^{+}-1}{\displaystyle \sum_{k=1}^{n^{+}}{\left({x}_{k,i}^{\left(+\right)}-{{\overline{x}}_i}^{\left(+\right)}\right)}^2}+\frac{1}{n^{-}-1}{\displaystyle \sum_{k=1}^{n^{-}}{\left({x}_{k,i}^{\left(-\right)}-{{\overline{x}}_i}^{\left(-\right)}\right)}^2}} $$
(1)

where \( {\overline{x}}_i \), \( {{\overline{x}}_i}^{\left(+\right)} \) and \( {{\overline{x}}_i}^{\left(-\right)} \) denote the average value of the ith feature in the whole, positive, and negative data sets, respectively; n + denotes the number of positive data set and n denotes the number of negative data set; x (+) k,i denotes the ith feature of the kth positive instance, and x (−) k,i denotes the ith feature of the kth negative instance [34]. The prediction performances obtained from using the physicochemical properties individually were evaluated, and the properties were subsequently sorted in descending order based on the accuracy of prediction.

For the construction of predictive models, hybrid features were generated by combining two or more single features. In an attempt to obtain the highest predictive accuracy, the single features were selected based on their predictive performance. Prior to classification, the data needed to be scaled in the range of [−1, 1] to enhance the effectiveness of the results [35].

Model training and evaluation

The training data set was used for building prediction models with the support vector machine (SVM). This binary classification utilizes a kernel function to transform the input samples into a higher dimensional space and attempts to find a hyper-plane to discriminate the two classes with maximal margin and minimal error. In our study, a public SVM tool (LIBSVM) [14] was implemented to build models that could discriminate between S-sulfenylation and non-S-sulfenylation sites. The radial basis function (RBF) \( K\left({S}_i,{S}_j\right)={e}^{\left(-\gamma {\left\Vert s\right.}_i-{s}_j\left\Vert {}^2\right.\right)} \) was adopted as the kernel function. Two factors were included to enhance the performance: the RBF kernel was determined by the gamma parameter, while the softness of the hyper-plane was modulated by the cost parameter.

To choose the best final model, five-fold cross-validation was carried out for each feature to evaluate the predictive performance. The training dataset was divided into five subgroups with approximately equal size. The ratio of the testing set to the training set was 1:4 and the cross-validation process was repeated five times. The five validation results were then combined to generate a single estimation. Obviously, one of the benefits of k-fold cross-validation is the improvement on the reliability of evaluation because all of the original data, including the training and testing data sets, were considered and each subset should be tested only once [36] . To estimate the predictive performance of each trained model, measures such as sensitivity (Sn), specificity (Sp), accuracy (Acc) and Matthews Correlation Coefficient (MCC) were used:

$$ Sn=\frac{TP}{TP+FN} $$
(2)
$$ Sp=\frac{TN}{TN+FP} $$
(3)
$$ Acc=\frac{TP=TN}{TP=FP=TN=FN} $$
(4)
$$ MCC=\frac{\left(TP*TN\right)-\left(FN*FP\right)}{\sqrt{\left(TP+FN\right)}*\left(TN+FP\right)*\left(TP+FP\right)*\left(TN+FN\right)} $$
(5)

where TP, TN, FP and FN represented the number of true positives, true negatives, false positives and false negatives, respectively. The MCC reflects both the sensitivity (true positive rate) and specificity (true negative rate) of a predictive model. Sometimes, accuracy is not useful when the two classes are of very different sizes [37]. Therefore, the MCC is usually regarded as a balanced measure even if the two classes are of very different sizes. Finally, after selecting the best model with the highest MCC value, an independent testing was carried out to evaluate its real predictive power.

Results and discussion

Investigation of amino acid composition of S-sulfenylation site

Through sequence-based investigation, the frequency of twenty amino acids surrounding the S-sulfenylation sites revealed the amino acid composition that forms the substrate environment for protein S-sulfenylation. Figure 2A indicates that, at S-sulfenylation sites, A (Alanine), R (Arginine), E (Glutamic acid), and K (Lysine) residues occur at a higher frequency, while C (Cysteine) and H (Histidine) residues have a lower frequency of occurrence. In this investigation, WebLogo [38] was utilized to compute the position-specific amino acid composition for both S-sulfenylation (Fig. 2B) and non-S-sulfenylation (Fig. 2C). However, it is difficult to compare the amino acid composition between S-sulfenylation and non-S-sulfenylation sites at a specific position. To effectively identify S-sulfenylation sites, this study concentrated on notable differences between S-sulfenylated and non-S-sulfenylated sites. TwoSampleLogo [39] was employed to detect statistically significant differences in position-specific amino acid composition between the positive and negative datasets. Comparing between the 1145 positive data and 8368 negative data (Fig. 2D), it is obvious that three of the four aforementioned amino acids (R, E and K) play an important role in the flanking region of S-sulfenylation sites. In particular, the positively charged Lysine (K) and Arginine (R) residues had the highest ratio at positions −10, −8 ~ −6, −4, −2, and +4 ~ +8 (p < 0.01). In contrast, at positions −1, +1 and +2 that were close to the S-sulfenylation sites, a lack of positively charged residues was observed, while a noticeable abundance of negatively charged resides (Glutamic acid or E) was apparent at positions −3, +1, +3, +4. Interestingly, three polar residues, serine (S), asparagine (N), and glycine (G), appeared to have a higher frequency of occurrence compared to the rest of the amino acids at position −1. For non-S-sulfenylated sites, however, there was an abundance of neutral amino acids, including leucine (L), cysteine (C), histidine (H), methionine (M), phenylalanine (F) and tyrosine (Y), at positions ranging from −9 to +7, while arginine (R) residue seemed to be concentrated at three positions (−1, 1 and 2) around non-S-sulfenylation sites. Another evident difference is the amino acid composition surrounding S-sulfenylation sites at positions −6, −2, −1, +3 and +4. The analytical results indicated that the positions of amino acids relative to one another in the sequence play a vital role in discriminating between S-sulfenylation and non-S-sulfenylation sites.

Fig. 2
figure 2

Amino acid composition of protein S-sulfenylation sites. a Comparison of amino acid composition between S-sulfenylation sites (blue) and non- S-sulfenylation sites (red). b Position-specific amino acid composition of S-sulfenylation sites. c Position-specific amino acid composition of non-S-sulfenylation sites. d TwoSampleLogo between S-sulfenylation sites (positive data) and non-S-sulfenylation sites (negative data)

Computed by the RVP-Net tool, ASA was also adopted as an attribute for the identification of S-sulfenylation sites. To discover how amino acids flanking the S-sulfenylation and non-S-sulfenylation sites may differ in their interaction with solvents, a comparison was performed using the average proportion of ASA in the 21-mer window (−10 ~ +10) as illustrated in Fig. 3. Amino acids surrounding the S-sulfenylated sites exhibit higher ASA compared to those around non-S-sulfenylation sites. A strong evidence for hydrophilicity at the S-sulfenylated substrate sites was found because the average percentage of ASA values of the flanking residues was higher than non-S-sulfenylated cysteines. Thus, hydrophilic amino acids flanking cysteine residues may determine their modification by sulfenylation.

Fig. 3
figure 3

Comparison of the solvent-accessible surface area between S-sulfenylation and non-S-sulfenylation sites

Performance evaluation of the trained models

Choosing suitable features is one of the crucial steps to constructing the best prediction model for the discrimination of S-sulfenylation sites from non-S-sulfenylation sites. In our study, the predictive model was trained from a variety of features, such as 20D (binary code), AAC, AAPC, BLOSUM62, ASA, PSSM, PWM and physicochemical properties. Five-fold cross-validation was performed for each model and four parameters (Sn, Sp, Acc and MCC) were included as the evaluation criteria. As shown in Table 2, the SVM model trained with PSSM feature yielded the best prediction performance: a sensitivity value of 0.71, specificity of 0.72, accuracy of 0.72, and MCC of 0.30. Slightly below the PSSM model in performance, the sensitivity, specificity, accuracy and MCC of the BLOSUM62 model were 0.68, 0.70, 0.69 and 0.26, respectively. The SVM model constructed with the ASA feature yielded the lowest predictive accuracy (0.61) and relatively lower sensitivity (0.60), specificity (0.66), and MCC (0.14).

Table 2 Five-fold cross validation results for SVM models trained with various features individually

To further investigate the physicochemical properties of S-sulfenylation sites and the adjacent amino acids, a total of 531 physicochemical properties, extracted from version 9.1 of the AAindex, were individually explored. Each physicochemical property with a significant F-score value was assessed by five-fold cross validation in order to evaluate their ability to identify S-sulfenylation sites. (Additional file 2: Table S1) shows the top 20 physicochemical properties ranked in a descending order according to their predictive performance. Descriptions of the 531 physicochemical properties from the AAindex indicated that four of the 20 physicochemical properties are associated with hydrophobicity. Overall, the accuracy of each of these features was over 0.6. In order to obtain a better prediction performance, the 20 highest performing physicochemical properties were combined with the best sequence-based feature (PSSM) by forward selection in (Additional file 3: Figure S2). A summary of the resulting performance is illustrated in Fig. 4, while the detailed results are presented in (Additional file 4: Table S2). With the consideration of more physicochemical properties, slight increases in specificity and accuracy were observed, while the sensitivity value fluctuated with different features. The best model yielded a sensitivity of 0.746, specificity of 0.737, accuracy of 0.738 and 0.337 MCC value. This model was built from PSSM and the top 12 most useful physicochemical properties (GUYH850101, JANJ790102, KIDA850101, FASG890101, KARP850101, EISD860102, LEVM760101, GUYH850104, GUYH850102, VINM940103, MIYS990104 and FUKS010111), and appeared to be the most effective at identifying S-sulfenylation sites among all of the models tested so far. Interestingly, three of the 12 selected indices (KIDA850101, FASG890101 and EISD860102) shared a common physicochemical feature, which was hydrophobicity. As analyzed previously, hydrophobicity contributes significantly to the characteristics, structures and functions of these proteins.

Fig. 4
figure 4

The predictive performance of PSSM model combined with forward selection of the top 20 physicochemical properties

Performance evaluation by independent testing dataset

For the classification of PTM sites, the prediction performance of the constructed models may be overestimated as a result of the overfitting to a training set. In order to evaluate the real performance of the predictive model, an independent testing data set, which was truly blind to the training data set, was utilized to further test the effectiveness of the selected model (PSSM + 12 AAindex) with the best performance in five-fold cross validation. This independent testing data set consisted of 289 positive data and 2108 negative data extracted from the original data. Figure 5 presents the comparison of the independent testing results between the model trained with the best single feature (PSSM) model and the model trained with the combination of PSSM and the top 12 physicochemical properties. The PSSM model yielded a sensitivity of 0.647, a specificity of 0.659, an accuracy of 0.657, and a MCC value of 0.205. Compared to the PSSM model, the model combining PSSM with the 12 AAindexes generated better sensitivity (0.720), specificity (0.690), accuracy (0.693), and a significantly better MCC (0.278) with p-value < 0.05. The detailed independent testing results, including true positive (TP), false negative (FN), true negative (TN) and false positive (FP), are described in (Additional file 5: Table S3). Overall, the model combining PSSM with the 12 AAindexes achieved promising predictive performance on the independent testing data set.

Fig. 5
figure 5

Comparison of the independent testing results between PSSM model and the hybrid model combining PSSM with the top 12 physicochemical properties

Functional exploitation of the S-sulfenylated proteins

In this study, the Database for Annotation, Visualization and Integrated Discovery (DAVID) [40] tool was employed to explore the functions and characteristics of S-sulfenylated proteins. (Additional file 6: Table S4) provides the Gene Ontology (GO) [41] annotations associated with the S-sulfenylated proteins examined in this work. According to the GO annotation of biological process, 10.8 % of the proteins were involved in RNA processing (p-value = 7.65E-32). In terms of molecular function annotated by GO, most of the proteins seemed to be involved in RNA binding (p-value = 1.06E-42) or nucleotide binding (p-value = 4.47E-35). With respect to cellular component, these S-sulfenylated proteins are mostly distributed in the nuclear lumen (p-value = 9.07E-44) and non-membrane-bounded organelles (p-value = 1.81E-42). Additionally, pathway analysis showed that S-sulfenylated proteins are involved in a variety of pathways in (Additional file 7: Table S5). Of note, 27 of the 1096 S-sulfenylated proteins appeared to be strongly associated with the mechanisms underlying Escherichia coli infection.

It has been reported that a protein-interacting domain usually recognizes a short peptide motif on the target protein but does not bind stably until the peptide has undergone appropriate PTM [4244]; this can create binding sites for specific protein-interaction domains that work together to carry out a specific cellular function [45]. The redox state and chemical modification of the thiol group of cysteine residues facilitate its interaction with various proteins to regulate a variety of intracellular and intercellular events [46]. Thus, information regarding the functional domains could be utilized to infer the functional roles of S-sulfenylation sites located in a specific protein domain. InterPro [47] is an integrated resource that provides "signatures" such as protein families, domains, and functional sites. Our investigation of the protein domains revealed that the thioredoxin-like fold domain is the most abundant functional domain in S-sulfenylated proteins in (Additional file 8: Table S6).

Distinguishing S-sulfenylation from S-nitrosylation or S-glutathionylation

The primary purpose of this study was to effectively identify S-sulfenylation sites from a large-scale proteome data with modified cysteine residues. However, an increasing number of cysteine-based redox modifications, such as S-sulfenylation, S-nitrosylation and S-glutathionylation, were reported to share the same cysteine targets [48]. Therefore, investigating the sequence characteristics among these cysteine-based PTMs is necessary for achieving reliable detection of S-sulfenylation sites. According to experimentally verified data obtained from dbSNO [49, 50], there exist 2212 S-nitrosylated proteins containing 4244 nitrosylation sites. In addition, a total of 2148 S-glutathionylated proteins harboring 3641 S-glutathionylation sites were obtained from dbGSH [51]. To avoid redundancy in data as a result of homologous sequences, sequence fragments were deleted if they were identified with 100 % similarity across the three datasets. Overall, 162 (15.74 %) of the 1029 S-sulfenylated proteins may undergo all three types of PTMs (Fig. 6A). In addition, percentages of the 1029 S-sulfenylated proteins that were also modified by S-nitrosylation and S-glutathionylation are approximately 23.23 % and 34.89 %, respectively. Investigation of the 1434 S-sulfenylated cysteine residues revealed that only 103 sites (7.18 %) could undergo all the three PTMs. As illustrated in Fig. 6B, the percentage of sites that are susceptible to both S-sulfenylation and S-glutathionylation (21.27 %) appeared to be higher than those that can be modified by S-sulfenylation and S-nitrosylation (14.99 %).

Fig. 6
figure 6

Discrimination of S-sulfenylation sites from S-nitrosylation and S-glutathionylation sites. a Number of duplicate proteins among S-sulfenylation, S-nitrosylation and S-glutathionylation; (b) Number of duplicate sites among S-sulfenylation, S-nitrosylation and S-glutathionylation; (c) Significant differences in position-specific compositions among three PTMs as identified by TwoSampleLogo

Taking advantage of the TwoSampleLogo, we uncovered the potential consensus motifs that may distinguish S-sulfenylation sites from S-nitrosylation or S-glutathionylation sites. As represented in Fig. 6C, the figure on the top panel describes notable differences in the position-specific compositions between S-sulfenylation and S-glutathionylation, while the figure on the bottom panel shows the potential amino acid composition difference between S-sulfenylation and S-nitrosylation sites. It seemed that S-sulfenylation sites could be recognized based on specific positions such as −10, −6, −5, −2, and +2 to +4 (p-value < 0.01). This investigation indicated a consistent composition of amino acids with Fig. 2D, particularly at positions −6, −2, +3 and +4. The frequency of positively charged groups, including R (Arginine) and K (Lysine), appeared to be significant at position −6 and −2, and that the occurrence of E (Glutamic acid) belonging to negatively charged residue was also apparent at positions +3 and +4.

Conclusions

This study describes a systematic investigation on the experimentally verified S-sulfenylation sites based on amino acid composition. The analysis of position-specific amino acids composition revealed that the most pronounced feature of S-sulfenylation sites is the abundance of positively charged amino acids (K and R) at surrounding positions: −10, −8 ~ −6, −4, −2, and +4 to +8. However, the depletion of positively charged residues was observed at positions −1, +1 and +2 that are close to the S-sulfenylation sites. Additionally, a noticeable abundance of negatively charged resides (E) was found at positions −3, +1, +3, +4. This investigation suggested that the amino acid position relative to one another in the sequence may play a vital role in discriminating between S-sulfenylation and non-S-sulfenylation sites. The comparison of position-specific amino acid composition between S-sulfenylation and non-S-sulfenylation sites also implicated that, for S-sulfenylation sites, distant amino acids around position −7 and +7, which may be close to S-sulfenylated cysteines in three-dimensional structure, harbor a notable abundance of positively charged amino acids (K and R).

In this work, the solvent accessibility and physicochemical properties were considered in the characterization of S-sulfenylation sites. The S-sulfenylation sites have a higher solvent accessibility, especially at the positions −7, +1, +2 and +4. According to the F-score measurements on 531 physicochemical properties, 20 physicochemical properties were revealed to have statistically significant differences between S-sulfenylation and non-S-sulfenylation sites. Based on the evaluation by five-fold cross-validation, the model trained with the combined features of evolutionary information (PSSM) and 12 physicochemical properties yielded the best prediction performance. Overall, our study not only determined the best prediction model for the identification of S-sulfenylation sites, but also discovered several characteristics and functions regarding S-sulfenylated proteins. Hydrophobic characteristics of S-sulfenylated proteins seem to be crucial for the understanding of their functions. Moreover, amino acids R (Arginine), E (Glutaminc acid), and K (Lysine) may be the potential consensus motifs for specific S-sulfenylated sites.