Introduction

Protein post-translational modifications play a very important role in the processing of protein, protein maturation, as well as altering the physical and chemical properties of proteins. As a result, the space conformation, three-dimensional location and the stability of the proteins are likely change, which can lead to the function alteration. Moreover, the structural features of the modified groups can produce a far-reaching impact on the properties, as well as, the functions of proteins. In 19981, the Nobel Prize for Physiology or Medicine was rewarded for breakthrough discoveries that showed nitric oxide to be a freely-diffusible signaling molecule and a secondary messenger. NO plays a vital role in the cardiovascular system2. It is noticed that S-nitrosylation (SNO) is the covalent interaction of nitric oxide with the thiol group of cysteine residues1,3 and is well characterized as a major source of NO bioactivity4. Many experimental methods have been applied for distinguishing the SNO sites, such as the biotin-switch technique (BST)5,6, SNO-Cys site identification (SNOSID)7,8,9, and the resin-associated capture (RAC)10. These experimental methods have successfully provided a very effective information in identifying the SNO sites. The BST was designed to purify and detect the SNO proteins, mainly composed of three principal steps: (i) The methylthiolation of free cysteine thiols with methyl methanethiosulfonate (MMTS); (ii) Reduction of SNOs to thiols with ascorbate; (iii) Ligation of the nascent thiols with N-[6-(Biotinamido)hexyl]-3′-(2′-Pyridyldithio)-propionamide (biotin-HPDP)11. In combination with the traditional mass spectrometry (MS), BST has indeed contributed to discovering a lot of potential protein SNO sites12,13,14,15. A proteomic method called SNOSID, that identified the endogenous and chemically-induced SNOs in the proteins from tissues or cells, was also developed to determine the potential SNO sites on the cysteine residues in complex protein mixtures. Furthermore, RAC based method was also developed to detect the SNO proteins10. In 2009, Foster et al.16, explored a protein microarray-based approach to screen the SNO sites. These methods made great contributions to the development of the prediction of SNOs, however, to a certain degree, they were considered to be time-consuming and also had a relatively low throughput data. Recently, several machine learning approaches have been proposed and have provided helpful information being used for further experimental verification of the protein SNO sites. Hao et al.7 developed a prediction tool for the SNO sites, which was based on the support vector machine (SVM)17 algorithm, and used a training dataset that consisted of 65 positive SNO sites and 65 non-SNO sites. A few years later, Xue et al.2 proposed a method called GRS-SNO by using a group-based predicting system based on 504 experimentally verified SNO sites in 327 unique proteins. Shortly afterward, Li et al.18 established the predictor CPR-SNO and built a web server based on a coupling pattern encoding scheme. Xu et al.19,20 developed the iSON-AApair that takes into account, the effects of sequence correlation. More recently, Jia et al.21 used an Adapted Normal Distribution Bi-Profile Bayes (ANBPB) and Chou’s PseAAC composition constituting the feature vector. The composition of Zhang et al.22 were also based on the Chou’s PseAAC, by incorporating the various sequences derived feature.

Each of the above mentioned methods had their own advantage, as well as, played an important role in the research for prediction of protein S-nitrosylation sites. However, it is noted that the prediction performance is not really satisfactory. Therefore, there is necessity to discover more efficient methods for the SNO sites identification.

In this study, we extracted nine types of features, including PC-PseAAC (25), kmer1 (20), kmer2 (400), PC-PseAAC_G (25), ANBPB (40), DBPB (38), BPB (40), IAAPair (39) and PSTAAP (18). In order to remove the redundant information, the information gain (IG) method was applied to select the features. Finally, the optimization of 425D feature vector (PC-PseAAC (25), kmer1 (20), kmer2 (180), PC-PseAAC_G (25), ANBPB (40), DBPB (38), BPB (40), IAAPair (39), and PSTAAP (18) was used to construct our prediction model. Our results suggest that IG can provide an improved performance, which is comparable to the one without the use of the IG method. The results indicated that selecting the IG feature is a promising method to predict the features with high dimension with the SNO sites.

Results and Discussion

Combination of different features

To evaluate the performances of the combined feature sets for sorting SNO sites and non-SNO sites, we tested the prediction performances on the Jackknife test23, which is considered as the most objective and always yields a unique result for a given dataset21,24. The combined features were composed of the PC-PseAAC, kmer1, kmer2, PC-PseAAC_G25, ANBPB21, DBPB, BPB26, IAAPair19, and PSTAAP20,27 models and the detailed results are shown in Supplementary Table S1. The results show that the prediction performance was enhanced through the combined features. As shown in Table 1, PC-PseAAC with the Acc of 62.82% was regarded as the basic features, and was then incorporated to kmer1 to improve the prediction performance, which reached the Acc of 64.83%. Secondly, combination of features PC-PseAAC + kmer1 were further incorporated with the component of kmer2 one by one, and new combined features PC-PseAAC, kmer1 and kmer2 reached Acc of 64.89%. This process was terminated at feature combination PC-PseAAC, kmer1, kmer2, PC-PseAAC_G, ANBPB, DBPB, BPB, IAAPair, and PSTAAP, which increased the Acc to 74.24% and MCC17,28,29,30 to 0.4837. From the above, it can be concluded that the combined features can improve Acc of 11.42%. The parameters λ and the weight factor w were found to offer the best results for the features PC-PseAAC and PC-PseAAC_G and the optimized values were λ = 5 and w = 0.5.

Table 1 Performance of the combination feature with different sequence encoding schemes in jackknife test.

Features selection via IG

To further improve the prediction performance, these features were optimized based on the above-mentioned IG optimization method. The four types of features PC-PseAAC, kmer1, kmer2, and PC-PseAAC_G are mainly related to the frequency of amino acids but are independent of the position of protein sequences. Hence, we optimized these four types of features based on the IG score of the amino acid residues. Firstly, we sorted the importance of amino acid composition (AAC) and the amino acid pair composition (i.e. kmer2) by IG score, and then applied the incremental feature selection to find out the best feature subset for maximizing prediction performance. According to the final performance evaluation, the application of IG score on kmer2 was especially distinguishable. The detailed prediction performances for different number of features combination on 10-fold cross-validation were shown in Fig. 1. It can be seen that when the dimension for the feature vector selected to be 180, the predictive performance achieved the highest value with Sn of 72.79%, Sp of 74.64%, Acc of 73.71%, and MCC of 0.4741. However, there was no obvious improvement for the other three types features PC-PseAAC, kmer1, and PC-PseAAC_G. This could be due to the low dimensions of these three types of features (less than 50). On the contrary, the dimension of kmer2 was 400, and the feature matrix was an extremely sparse matrix and hence having IG reflecting a good performance. The results of the IG score ranking importance of amino acid residues and dipeptide are displayed in Fig. 2 and Fig. 3, respectively and the detailed results are shown in Supplementary Table S2. It is noteworthy that the amino acid residues K, M, and C and the dipeptides MG, VK, and ML exhibited a great contribution to the prediction performance. Fig. 2 and Fig. 3 showed that the highest IG score reached 0.0156, 0.0112 and 0.0043 for the amino acid residues K, M, and C, respectively, while the highest IG score reached 0.0062, 0.0057 and 0.0047 for the amino acid dipeptides MG, VK, and ML, respectively.

Figure 1
figure 1

The predictive performance of different models based on incremental feature selection of features sorted by IG.

Figure 2
figure 2

The IG score of each amino acid residues.

Figure 3
figure 3

The IG score of each dipeptide.

Before the features selection, the prediction performance with the Sn of 73.46%, the Sp of 74.94%, and the Acc of 74.28%. After removing the irrelevant feature and then determining the optimal combination of features, we then obtained the best prediction performance with the Sn of 73.60%, the Sp of 75.93% and the Acc of 74.82, respectively. As can be seen, all of the three measurements have been improved slightly. But the prediction performance was not satisfied, it need us to make improvements on this work in the future. The results for the best predictive performance are shown in Table 2. An improved predictive Acc for the models that were trained with the optimized features was being seen when compared with the model with non-optimized features. As given in Supplementary Table S2, such 425 [PC-PseAAC(25) + kmer1(20) + kmer2(180) + PC-PseAAC_G(25) + ANBPB(40) + DBPB(38) + BPB(40) + IAAPair(39) + PSTAAP(18)] features regarded as the optimal feature set for the selected model. Based on the 425 features, the predictive Sn, Sp, and Acc were 73.60%, 75.93% and 74.82%, respectively. These results indicate that the key amino acid residues and the key dipeptide used in optimizing the models can enhance the prediction performance of the SNO sites. Consequently, the features combined with key amino acid residues were applied to implement a novel and high-performance tool for identifying cysteine S-nitrosylated sites.

Table 2 Features optimization based on IG on Jackknife test.

Comparison with other feature selection methods

In this paper, different feature selection methods were exploited for comparison. We made several comparisons for evaluating the performance of IG with Max-Relevance-Max-Distance31 (MRMD), a method for feature selection. MRMD contains two components, max distance and maximal relevance. The max distance selects a new feature which has the least redundancy in the residual of features, while the maximal relevance selects feature that has the strongest relevance to the target class.

We used four distance methods ED, COS, TC and Mean of MRMD to find out the best feature vectors combination through using 10-fold cross-validation. The detailed predictive performances are listed in Fig. 4. When the distance function ED was adopted, its best Acc achieved 73.32% with 356 features. And when the distance functions are COS, TC and Mean, the predictive performance is the highest with 397, 398 and 43 features, respectively, whose corresponding predictive performance is 73.14%, 73.35% and 73.34%. Suppose that the total dimension of feature vector is 400, the influence of dimension reduction is not obvious when the distance function ED, COS and TC are used (the predictive performance is the best with 356, 397 and 398 features, respectively). However, the influence of dimension reduction is prominent when the Mean distance function is used (the predictive performance is the best with 43 features), which causes a lot of information lost in the feature vector. The best performances for different feature selection methods are listed in Supplementary Tables S811.

Figure 4
figure 4

The predictive performance of different models and the comparison of their. (A) Comparison on the predictive performance of different feature selection methods. (B) The predictive performance of different models based on incremental feature selection of features sorted by distance ED of MRMD. (C) The predictive performance of different models based on incremental feature selection of features sorted by distance COS of MRMD. (D) The predictive performance of different models based on incremental feature selection of features sorted by distance TD of MRMD. (E) The predictive performance of different models based on incremental feature selection of features sorted by distance Mean of MRMD.

From Fig. 4A, we can see that although the performances of two methods, IG and four types of MRMD, are almost identical on the same datasets, Acc of IG has better advantageous. Meanwhile, its Acc is generally higher than that of MRMD method, including ED, COS, TC and Mean. Moreover, it has more advantages to achieve the dimensionality reduction of high-dimensional eigenvectors and unsure high Acc. From Fig. 4B–E, show the predictive performance of different dimensions eigenvectors are shown when MD is ED, COS, TC and Mean, respectively.

Comparison with other methods

To make a fair and fast comparison, we compared the prediction performance of our predictor with GPS-SNO2, iSNO-PseAAC20, iSNO-ANBPB21, PSNO22, iSNO-AAPair19 on the Xu training dataset by running 10-fold cross-validation test 10 times. The results were shown in Table 3. Our constructed model exhibits the best prediction performance with Acc of 83.11%, which was 1.41% higher than the previous best-performing predictor iSNO-AAPair, and 7.44% higher than Acc achieved by PSNO. Our predictor also gave a MCC of 0.6617, which was 0.0317 higher than the method of iSNO-AAPair, and 0.1498 higher than PSNO. Furthermore, Sn of our predictor was 83.33%, which was 3.73% higher than Sn of iSNO-AAPair, and 9.18% higher than PSNO. This comparison indicates that the proposed model is indeed promising and could at least play a role that complements the existing state-of-the art methods in this field. In addition, we tested the predictive power of our model with the powers of the SNOSite32, iSNO-AAPair19, iSNO-PseAAC20, iSNO-ANBPB21 on the Li test dataset; and we also compared our model with the GPS-SNO2, iSNO-PseAAC20, iSNO-AAPair19, and PSNO22 methods on Xu test dataset. The performances of the above-mentioned models against two test datasets are summarized in Supplementary Tables S3 and S4. On the Li independent test dataset, our model captured proteins O00429 (site 367), P13221 (site 83), P43235 (site 139) as S-nitrosylation sites, while methods iSNO-AAPair and iSNO-PseAAC incorrectly predicted them as non-S-nitrosylation sites. On the Xu independent test dataset, our model captured proteins O70572 (site 176), P51174 (site 342), Q8VDG5 (site 308), Q9WVQ5 (site 146), P55060 (site 344) as S-nitrosylation sites, while models iSNO-PseAAC and GPS-SNO incorrectly predicted S-nitrosylation sites as non- S-nitrosylation sites. To show the prediction results clearly, we summarized Sn, Sp, ACC and MCC that was achieved by each model in Table 4. As it can be seen that our predictor achieved the performance with Sn of 60.47%, Sp of 77.69% and Acc of 73.17% on the Li test dataset. Among the other five methods, the best prediction performance was achieved by the method of Li et al., with Sn of 51.16%, Sp of 69.42% and Acc of 64.63%. Our method is obviously superior to other methods. However on Xu test dataset, our predictor achieved the prediction performance with the Sn of 64.20%, the Sp of 75.00%, and the Acc of 70.17%, which is only better than iSNO-PseAAC with Sn of 50.2%, Sp of 75.1% and Acc of 62.8%. The results show that our predictor outperformed previous methods in terms of precision. But on the Xu test set, the results are not ideal, which may be caused as a result of not considering the physical chemistry properties. In the future work, we will consider more compressive features and further optimize the feature combination approaches.

Table 3 Compare with other methods performance on the training dataset.
Table 4 Compare with other methods performance on the test datasets.

Conclusion

The prediction of SNO sites is essential for better understanding of the basic biological theory, clinical diagnosis as well as the pharmaceuticals. In this study, we introduce the IG which is a tool for the analysis of the importance of amino acid and its position used in feature extraction. Here, we focus on the characteristics of the amino acids with its sequence. Four out of the nine characteristics of the combination are related to the amino acid residues. The four types of features were screened using the IG method, and the best dimension of the feature vector was selected. Among these features, 180 important features were screened from the feature kmer-2 whose dimension was reduced from 400 to 180, with the best prediction performance, and Sn and Acc are reached 83.33% and 83.11%, respectively. Theoretically, there is a lot of information in the protein structure when compared with the simple sequences, and this will be considered in the future scope of the work. With the development of internet and big data era coming, constructing databases33,34,35,36,37,38,39,40 and establishing powerful webserver are the direction of bioinformatics. Thus, making it convenient to most experimental scientists

Material and Methods

Datasets

The datasets were constructed using those of Li et al.22 and Xu et al.19,20 (henceforth named the Li dataset and Xu dataset, respectively). As described previously19,20,24, these datasets were derived on the basis of the experimental verification of the protein S-nitrosylation sites. Xu training dataset consisted of 731 positive SNO sites as positive samples and 810 non-SNO cysteine sites as negative samples from the 438 proteins with <=40% sequence similarity. These samples were used for training our prediction model. The Xu test dataset consisted of 81 SNO sites and 100 non-SNO sites, and the Li test dataset included 43 SNO sites and 121 non-SNO sites. In this study, Xu and Li test datasets were applied to test the prediction performance of our model.

Considering that we have a protein peptide sample P in our datasets, which can be generally formulated by:

$${\rm{P}}={R}_{-t}{R}_{-(t-1)}\ldots {R}_{-2}{R}_{-1}(C){R}_{+1}{R}_{+2}\ldots {R}_{+(t-1)}{R}_{+t}$$
(1)

where the subscript t is an integer, Rt is the t-th downstream amino acid residue from cysteine(C), Rt the t-th upstream amino acid residue, and so forth. The peptide was termed as SNO or non-SNO peptide depending on whether its center is a SNO or non-SNO sites, respectively. P belonged to one of two categories viz. the SNO sites (positive data) or non-SNO sites (negative data). In the current study, we selected t = 10. If the upstream or downstream in a protein was less than 10, the lacking residues were filled using the dummy code X. Thus, the training dataset S was formulated as (\(\cup \): in the set theory to formulate the union of datasets):

$${\rm{S}}={S}^{+}\cup {S}^{-}$$
(2)

where the positive dataset S+consisted of 731 SNO cysteine sites, while the negative dataset S contained 810 non-SNO cysteine sites; The test dataset TLi and TXu was formulated as:

$${{\rm{T}}}_{Li}={{\rm{T}}}_{Li}^{+}\cup {{\rm{T}}}_{Li}^{-}$$
(3)
$${{\rm{T}}}_{Xu}={{\rm{T}}}_{Xu}^{+}\cup {{\rm{T}}}_{Xu}^{-}$$
(4)

where the positive dataset \({{\rm{T}}}_{Li}^{+}\) and \({{\rm{T}}}_{Xu}^{-}\) contained 43 and 81 SNO peptide fragments, respectively; while the negative dataset \({{\rm{T}}}_{Li}^{-}\) and \({{\rm{T}}}_{Xu}^{-}\) contained 121 and 100 non-SNO peptide fragments, respectively. For the reader’s convenience, the three datasets used in this study are given in Supplementary Tables S57. The schematic flowchart of our work is being shown in Fig. 5.

Figure 5
figure 5

Flowchart of our predictor methodology.

Features extraction

Parallel correlation pseudo amino acid composition (PC-PseAAC)

PC-PseAAC41 is the feature extraction approach that incorporates the contiguous local and the global sequence-order information to obtain the feature vector for the protein sequence. Given a protein peptide P (Eq. 1), the PC-PseAAC feature vector for P is given by:

$${\rm{V}}={[{x}_{1},{x}_{2},\ldots ,{x}_{20},{x}_{21},\ldots ,{x}_{20+\lambda }]}^{T}$$
(5)

where

$${{\rm{x}}}_{u}=\{\begin{array}{l}\frac{{f}_{u}}{{\sum }_{i=1}^{20}\,{f}_{i}+w\,{\sum }_{j=1}^{\lambda }\,{{\Theta }}_{j}}(1\le u\le 20)\\ \frac{w{\Theta }_{u-20}}{{\sum }_{i=1}^{20}\,{f}_{i}+w\,{\sum }_{j=1}^{\lambda }\,{{\Theta }}_{j}}(20+1\le u\le 20+\lambda )\end{array}$$
(6)

where fi (i = 1,2, …, 20) is the normalized occurrence frequency of the 20 amino acids in the protein P; the parameter λ is an integer, representing the highest counted rank (or tier) of the correlation along a protein sequence; w is the weight factor ranging from 0 to 1; and Θj (j = 1,2, …, 20) is the j-tier correlation factor reflecting the sequence-order correlation between all the j-th most contiguous residues along a protein chain, which is defined as:

$${{\Theta }}_{\lambda }=\frac{1}{L-\lambda }\,\sum _{i=1}^{L-\lambda }\,{\Theta }({R}_{i},{R}_{i+\lambda })\,(0 < \lambda < L)$$
(7)

where the correlation function is given by:

$$\begin{array}{rcl}{\Theta }({R}_{i},{R}_{j}) & = & \frac{1}{3}\{{[{H}_{1}({R}_{j})-{H}_{1}({R}_{i})]}^{2}+{[{H}_{2}({R}_{j})-{H}_{2}({R}_{i})]}^{2}\\ & & +\,{[M({R}_{j})-M({R}_{i})]}^{2}\}\end{array}$$
(8)

where \({H}_{1}({R}_{i})\), \({H}_{2}({R}_{i})\) and \(M({R}_{i})\,\) are the hydrophobicity value, hydrophilicity value, and side-chain mass, respectively, of the amino acid Ri. It should be noted that before substituting the values of hydrophobicity, hydrophilicity, and side-chain mass into Eq. 7, they are all subjected to a standard conversion as described by the following equation:

$${H}_{1}(i)=\frac{{H}_{1}^{0}(i)\,-\,{\sum }_{i=1}^{20}\,\frac{{H}_{1}^{0}(i)}{20}}{\sqrt{\frac{{\sum }_{i=1}^{20}\,{[{H}_{1}^{0}(i)-{\sum }_{i=1}^{20}\frac{{H}_{1}^{0}(i)}{20}]}^{2}}{20}}}$$
(9)
$${H}_{2}(i)=\frac{{H}_{2}^{0}(i)\,-\,{\sum }_{i=1}^{20}\,\frac{{H}_{2}^{0}(i)}{20}}{\sqrt{\frac{{\sum }_{i=1}^{20}\,{[{H}_{2}^{0}(i)-{\sum }_{i=1}^{20}\frac{{H}_{2}^{0}(i)}{20}]}^{2}}{20}}}$$
(10)
$$M(i)=\frac{{M}^{0}(i)\,-\,{\sum }_{i=1}^{20}\,\frac{{M}^{0}(i)}{20}}{\sqrt{\frac{{\sum }_{i=1}^{20}\,{[{M}^{0}(i)-{\sum }_{i=1}^{20}\frac{{M}^{0}(i)}{20}]}^{2}}{20}}}$$
(11)

where \({H}_{1}^{0}(i)\) and \({H}_{2}^{0}(i)\) represent the original hydrophobicity value and the original hydrophilicity value of the i-th amino acid respectively; and \({M}^{0}(i)\) is the mass of the i-th amino acid side chain.

General parallel correlation pseudo amino acid composition (PC-PseAAC_G)

The PC-PseAAC_G approach42, not only incorporates the comprehensive built-in indices extracted from the AAindex43, but also allows the users to upload their own indices to generate the PC-PseAAC_G feature vector. For a given a protein peptide P (Eq. 1), the PC-PseAAC_G feature vector of P is defined as:

$${\rm{V}}={[{x}_{1},{x}_{2},\ldots ,{x}_{20},{x}_{21},\ldots ,{x}_{20+\lambda }]}^{T}$$
(12)

where

$${{\rm{x}}}_{u}=\{\begin{array}{l}\frac{{f}_{u}}{{\sum }_{i=1}^{20}\,{f}_{i}+w\,{\sum }_{j=1}^{\lambda }\,{{\Theta }}_{j}}(1\le u\le 20)\\ \frac{w{{\Theta }}_{u-20}}{{\sum }_{i=1}^{20}\,{f}_{i}+w\,{\sum }_{j=1}^{\lambda }\,{{\Theta }}_{j}}(20+1\le u\le 20+\lambda )\end{array}$$
(13)

where fi (i = 1,2, …, 20) is the normalized occurrence frequency of the 20 amino acids in the protein P; the parameter λ is an integer, representing the highest counted rank (or tier) of the correlation along a protein sequence; w is the weight factor ranging from 0 to 1; and Θj (j = 1,2,…,20) is called the j-tier correlation factor reflecting the sequence-order correlation between all the j-th most contiguous residues along a protein chain, which is defined as:

$${{\Theta }}_{\lambda }=\frac{1}{L-\lambda }\,\sum _{i=1}^{L-\lambda }\,{\Theta }({R}_{i},{R}_{i+\lambda })\,(0 < \lambda < L)$$
(14)

In this case, the correlation function is given by:

$${\Theta }({R}_{i},{R}_{j})=\frac{1}{\mu }\,\sum _{u=1}^{\mu }\,{[{H}_{u}({R}_{i})-{H}_{u}({R}_{j})]}^{2}$$
(15)

where µ is the number of physicochemical indices considered; \({H}_{u}({R}_{i})\) is the u-th physicochemical index value of the amino acid Ri; \({H}_{u}({R}_{j})\) is the u-th physicochemical index value for the amino acid Rj. It should be noted that before substituting the physicochemical indices values into Eq. 14, they were also all subjected to a standard conversion as described by the following equation:

$${H}_{u}(i)=\frac{{H}_{u}^{0}(i)\,-\,{\sum }_{i=1}^{20}\,\frac{{H}_{u}^{0}(i)}{20}}{\sqrt{\frac{{\sum }_{i=1}^{20}\,{[{H}_{u}^{0}(i)-{\sum }_{i=1}^{20}\frac{{H}_{u}^{0}(i)}{20}]}^{2}}{20}}}$$
(16)

where \({H}_{u}^{0}(i)\) is the u-th original physicochemical value of the i-th amino acid.

Basic kmer (kmer)

Basic kmer44 is the simplest approach to represent the proteins by a numerical vector, in which the protein sequences are represented as the occurrence frequencies of k neighboring amino acids45. Given a protein sequence P (Eq. 1), the kmer feature vector of P is formulated as follows:

$${\rm{V}}({\rm{kmer}}\,-\,1)={[{x}_{1},{x}_{2},\ldots ,{x}_{i},\ldots ,{x}_{20}]}^{T}(0 < i\le 20)$$
(17)
$${\rm{V}}({\rm{kmer}}\,-\,2)={[{y}_{1},{y}_{2},\ldots ,{y}_{i},\ldots ,{y}_{400}]}^{T}(0 < i\le 400)$$
(18)

where xi and yi are the normalized occurrence frequency of the 20 amino acid residues and 400 dipeptides in the protein P, respectively.

Bi-Profile Bayes (BPB)

BPB26 comprehensively considers the information contained in the two aspects of positive and negative samples that have been successfully applied in many fields of bioinformatics and has made effective predictions26,46,47,48. Given a protein peptide P (Eq. 1), the BPB feature vector of P is defined as:

$${\rm{V}}={[{x}_{1},{x}_{2},\ldots ,{x}_{n},{x}_{n+1},\ldots ,{x}_{2n}]}^{T}$$
(19)

where V is the posterior probability vector; \({x}_{1},{x}_{2},\ldots ,{x}_{n}\) represents the posterior probability of each amino acid at each position in positive peptide sequence datasets; \({x}_{n+1},\ldots ,{x}_{2n}\) represents the posterior probability of each amino acid at each position in negative peptide sequence datasets. Two position-specific profiles for final model training, positive position-specific profiles and negative position-specific profiles, were generated by calculating the frequency of each amino acid at each position in the positive datasets and negative datasets, respectively.

Double Bi-Profile Bayes (DBPB)

DBPB is an improvement of BPB that was proposed by Shao et al.24. As mentioned above, BPB is the posterior probability of each single amino acid at each position in the positive and negative datasets, while DBPB is the posterior probability of each two adjacent amino acids at each position in the positive and negative datasets. Given a protein sequence P (Eq. 1), the DBPB feature vector of P is defined as:

$${\rm{V}}={[{x}_{1},{x}_{2},\ldots ,{x}_{n-1},{x}_{(n-1)+1},\ldots ,{x}_{2(n-1)}]}^{T}$$
(20)

where V is the posterior probability vector; \({x}_{1},{x}_{2},\ldots ,{x}_{n-1}\) that represents the posterior probability of each dipeptide at each position in positive peptide sequence datasets; \({x}_{(n-1)+1},\ldots ,{x}_{2(n-1)}\) represents the posterior probability of each dipeptide at each position in the negative peptide sequence datasets. Two position-specific profiles for the final model training, positive position-specific profile and negative position-specific profile were generated by calculating the frequency of each amino acid pair at each position in the positive datasets and negative datasets, respectively.

Adapted Normal distribution Bi-Profile Bayes (ANBPB)

ANBPB21,49 is the improvement of BPB in another aspect. Given a protein sequence P (Eq. 1), the ANBPB feature vector of P is defined as:

$${\rm{V}}={[{p}_{1},{p}_{2},\ldots ,{p}_{n},{p}_{n+1},\ldots ,{p}_{2n}]}^{T}$$
(21)

where \({p}_{1},{p}_{2},\ldots ,{p}_{n}\) is the posterior probability of each amino acid at each position in positive peptide sequences datasets; \({p}_{n+1},\ldots ,{p}_{2n}\) is defined based on the posterior probability of each amino acid at each position in negative peptide sequences datasets. The posterior probability \({p}_{1},{p}_{2},\ldots ,{p}_{2n}\) was coded by the adapted normal distribution as follows:

$${p}_{i}={\rm{\phi }}(x)=\frac{1}{\sqrt{2\pi }}\,{\int }_{-\infty }^{x}\,{e}^{-\frac{{t}^{2}}{2}}{\rm{dt}}$$
(22)

where φ(x) is the standard normal distribution function and the detailed description of the formula is given21,49.

Incorporating Amino Acid Pairwise (IAAPair)

The posterior probability of every two adjacent amino acids and each two next nearest amino acids at each position in the positive peptide sequence datasets is subtracted from in the negative peptide sequence datasets19. Given a protein sequence P (Eq. 1), the IAAPair feature vector of P is defined as:

$${\rm{V}}={[{p}_{1},{p}_{2},\ldots ,{p}_{j},\ldots ,{p}_{[(n-1)+(n-2)]}]}^{T}$$
(23)
$$+\,{\rm{V}}={[\begin{array}{ccc}+{{p}^{0}}_{1,1},+{{p}^{0}}_{1,2}, & \cdots & ,+{{p}^{0}}_{1,n-1},\\ \vdots & \ddots & \vdots \\ +{{p}^{0}}_{{21}^{2},1},+{{p}^{0}}_{{21}^{2},2}, & \cdots & ,+{{p}^{0}}_{{21}^{2},n-1},\end{array}\begin{array}{ccc}+{{p}^{1}}_{1,1},+{{p}^{1}}_{1,2}, & \cdots & ,+{{p}^{1}}_{1,n-2},\\ \vdots & \ddots & \vdots \\ +{{p}^{1}}_{{21}^{2},1},+{{p}^{1}}_{{21}^{2},2}, & \cdots & ,+{{p}^{1}}_{{21}^{2},n-2},\end{array}]}^{T}$$
(24)
$$-\,{\rm{V}}={[\begin{array}{ccc}-{{p}^{0}}_{1,1},-{{p}^{0}}_{1,2}, & \cdots & ,-{{p}^{0}}_{1,n-1},\\ \vdots & \ddots & \vdots \\ -{{p}^{0}}_{{21}^{2},1},-{{p}^{0}}_{{21}^{2},2}, & \cdots & ,-{{p}^{0}}_{{21}^{2},n-1},\end{array}\begin{array}{ccc}-{{p}^{1}}_{1,1},-{{p}^{1}}_{1,2}, & \cdots & ,-{{p}^{1}}_{1,n-2},\\ \vdots & \ddots & \vdots \\ -{{p}^{1}}_{{21}^{2},1},-{{p}^{1}}_{{21}^{2},2}, & \cdots & ,-{{p}^{1}}_{{21}^{2},n-2},\end{array}]}^{T}$$
(25)
$${\rm{V}}=(\,+\,{\rm{V}})-(\,-\,{\rm{V}})$$
(26)
$${p}_{j}=\{\begin{array}{ll}\pm {{p}^{0}}_{1,j} & when\,{R}_{t}{R}_{t+1}=AA\,and\,1\le j\le n-1\\ \pm {{p}^{0}}_{2,j} & when\,{R}_{t}{R}_{t+1}=AC\,and\,1\le j\le n-1\\ & \vdots \,\,\,\,\,\,\,\vdots \\ \pm {{p}^{0}}_{({21}^{2}-1),j} & when\,{R}_{t}{R}_{t+1}=XY\,and\,1\le j\le n-1\\ \pm {{p}^{0}}_{{21}^{2},j} & when\,{R}_{t}{R}_{t+1}=XX\,and\,1\le j\le n-1\\ \pm {{p}^{1}}_{1,j} & when\,{R}_{t}{R}_{t+1}=AA\,and\,n\le j\le [(n-1)+(n-2)]\\ \pm {{p}^{1}}_{2,j} & when\,{R}_{t}{R}_{t+1}=AC\,and\,n\le j\le [(n-1)+(n-2)]\\ & \,\,\vdots \,\,\,\,\,\,\,\vdots \\ \pm {{p}^{1}}_{({21}^{2}-1),j} & when\,{R}_{t}{R}_{t+1}=XY\,and\,n\le j\le [(n-1)+(n-2)]\\ \pm {{p}^{1}}_{{21}^{2},j} & when\,{R}_{t}{R}_{t+1}=XX\,and\,n\le j\le [(n-1)+(n-2)]\end{array}$$
(27)

where V is the posterior probability vector (in this feature, the C in the middle of peptide sequence must not be omitted). When \(1\le j\le n\,-\,1\) pj is the representative posterior probability of every two nearest amino acids, and when \(n\le j\le [(n-1)+(n-2)]\) pj is the representative posterior probability of each two next nearest amino acids. +pi,j and −pi,j represent the posterior probability of every two nearest amino acids and each two next nearest amino acids at each position in positive and negative peptide sequence datasets, respectively. \(\pm {p}_{i,j}=(\,+\,{p}_{i,j})-(\,-\,{p}_{i,j})\) is the feature vector.

Position-specific Tri-Amino Acid Propensity (PSTAAP)

The posterior probability of every three adjacent amino acids at each position in the positive peptide sequence datasets is subtracted from in the negative peptide sequence datasets20,27. Given a cysteine peptide fragment P (Eq. 1), the feature vector of PSTAAP for P is defined as follows:

$${\rm{V}}={[{p}_{1},{p}_{2},\ldots ,{p}_{j},\ldots ,{p}_{n-2}]}^{T}$$
(28)
$$+{\rm{V}}={[\begin{array}{ccc}+{p}_{1,1},+{p}_{1,2}, & \cdots & ,+{p}_{1,n-2}\\ \vdots & \ddots & \vdots \\ +{p}_{{21}^{3},1},+{p}_{{21}^{3},2}, & \cdots & +{p}_{{21}^{3},n-2}\end{array}]}^{T}$$
(29)
$$-{\rm{V}}={[\begin{array}{ccc}-{p}_{1,1},-{p}_{1,2}, & \cdots & ,-{p}_{1,n-2}\\ \vdots & \ddots & \vdots \\ -{p}_{{21}^{3},1},-{p}_{{21}^{3},2}, & \cdots & ,-{p}_{{21}^{3},n-2}\end{array}]}^{{T}^{T}}$$
(30)
$${\rm{V}}=(\,+\,{\rm{V}})-(\,-\,{\rm{V}})$$
(31)
$${p}_{j}=\{\begin{array}{ll}\pm {p}_{1,j} & when\,{R}_{t}=AAA\\ \pm {p}_{2,j} & when\,{R}_{t}=AAC\\ \vdots & \,\,\,\,\,\,\vdots \\ \pm {p}_{({21}^{3}-1),j} & when\,{R}_{t}=XXY\\ \pm {p}_{{21}^{3},j} & when\,{R}_{t}=XXX\end{array}$$
(32)

where V is the posterior probability vector; +pi,j represents the posterior probability of each tri-amino acids at each position in the positive dataset; −pi,j represents the posterior probability of each tri-amino acids at each position in the negative dataset; \(\pm {p}_{j}=(\,+\,{p}_{i,j})-(\,-\,{p}_{i,j})\) is the feature vector.

It should be indicated that recently a very powerful web-server called ‘Pse-in-One’25, and its updated version ‘Pse-in-One2.0’45 have been established and can be used to generate any desired feature vectors for protein/peptide and DNA/RNA sequences according to the user study needs or desires. In the current study, the feature vectors PC-PseAAC, PC-PseAAC_General, and basic kmer are obtained from the web-server.

Information Gain (IG)

The IG44,46,47,48 method is usually used to rank the importance of positions and amino acid residues. IG measures the decrease in entropy when a given feature is used to group values of another (class) feature. The entropy of a feature X is defined by:

where {xi} is a set of values of X and P(xi) is the prior probability of xi. If Y is considered as another feature, the conditional entropy of X is defined as:

$${\rm{H}}({\rm{X}}|{\rm{Y}})=-\,\sum _{j}\,P({y}_{j})\,\sum _{i}\,P({x}_{i}|{y}_{j})\,{lo}{{g}}_{2}\,(({x}_{i}|{y}_{j}))$$
(33)

where \(P({x}_{i}|{y}_{j})\) is the posterior probability of X with the value yj of Y. The amount by which the entropy of X decreases reflects the additional information about X provided by Y and is called the information gain:

$${\rm{IG}}({\rm{X}}|{\rm{Y}})={\rm{H}}({\rm{X}})\,-\,{\rm{H}}({\rm{X}}|{\rm{Y}})$$
(34)

According to this measure, Y has a stronger correlation with X than with Z, if IG (X|Y) > IG (Z |Y). It is obvious that Y represents the amino acid type, when extracting the IG score for positions. On the other hand, Y represents the amino acid frequency, when extracting IG score for amino acids.

Calculating IG score of positions and amino acid residues:

  1. (1)

    The importance of positions: The 20 amino acid residues (A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, and Y) were coded into digits from 1 to 20. The query sequences segments were coded into an X-dimension digital sequence (protein).

  2. (2)

    The importance of amino acid residues: The amino acid frequency in the surrounding sequence query site (the site itself is not counted) was calculated. The query sequences were also coded into a 20-dimension feature.

  3. (3)

    Calculation of the IG score for positions by (1) and the IG procedure was performed. Subsequently, the calculation of the IG score for amino acid residues by (2) and the IG procedure was done. Then, we ranked the corresponding positions and amino acid residues by their IG score and selected the key positions and key amino acid residues.

In this work, we used the IG score to calculate the importance of amino acid residues:

$$\begin{array}{rcl}{\rm{IG}}({{\rm{X}}|{\rm{Y}}}_{i}) & = & {\rm{H}}({\rm{X}})-{\rm{H}}({{\rm{X}}|{\rm{Y}}}_{i})\\ & = & -\,\sum _{x\in (0,1)}\,P(x)\,lo{g}_{2}((x))\,\sum _{{y}_{j}\in (0,1)}\,P({y}_{i})\\ & & \times \,\sum _{x\in (0,1)}\,P(x|{y}_{j})\,lo{g}_{2}(({x}_{i}|{y}_{j}))\end{array}$$
(35)

Equation (36) is divided into two parts, the former is the entropy H(X) of class X, and the latter part is the conditional entropy \({\rm{H}}({{\rm{X}}|{\rm{Y}}}_{i})\) of X a given amino acid Y.

Suppose the number of training samples is N. Initially, we count each training sample. Subsequently, if each characteristic yj is added in the training sample x, two times will be counted once:

$$P({y}_{j}=1)=\frac{count({y}_{j}=1)}{N}$$
(36)
$$P({y}_{j}=0)=1-P({y}_{j}=1)$$
(37)
$$P(X=0|{y}_{j}=1)=\frac{count(X=0,{y}_{j}=1)}{count({y}_{j}=1)}$$
(38)
$$P(X=1|{y}_{j}=1)=\frac{count(X=1,{y}_{j}=1)}{count({y}_{j}=1)}$$
(39)
$$P(X=0|{y}_{j}=0)=\frac{count(X=0,{y}_{j}=0)}{count({y}_{j}=0)}$$
(40)
$$P(X=1|{y}_{j}=0)=\frac{count(X=1,{y}_{j}=0)}{count({y}_{j}=0)}.$$
(41)

Max-Relevance-Max-Distance (MRMD)

MDMR31 is a feature selection method for reducing dimensionalities, which can be further divided into two aspects.

  1. (1)

    One is the relevance between sub-feature set and target class. Here, Pearson’s correlation coefficient is exploited to measure the relevance. With the increase of Pearson’s correlation coefficient, the relevant between feature and target class also increases.

  2. (2)

    The other is redundancy of sub-feature set. Three kinds of distance functions are utilized to calculate the redundancy. The larger the feature distance, the lower the redundancy for sub-feature set becomes.

The features with large sum of relevance and distance would be chosen as the ultimate sub-feature set. Finally, the sub-feature set generated by MRMD has low redundancy and strong relevance with the target class.

In order to describe the algorithm clearly, we listed some functions in following section. Given the input datasets tabled as N instances, M features \({\rm{F}}=\{{{\rm{f}}}_{i},i=1,\ldots ,M\}\) and the target class C, the aim is to find a subspace of M features, which is selected from the M dipeptides original space, and makes the greatest contribution to classify the target class C.

Max-relevance (MR)

Making the greatest contribution for classifying the target class condition and this often requires the maximal relevance for the target class C on the subspace, which needs us to select a feature set with the highest relevance to target class C. We use the Pearson’s correlation coefficient to measure positive correlation and negative correlation. Because it is suitable for calculating continuous variables and easy to implement, Pearson’s correlation coefficient is adopted as the measure of relevance between feature and target class C.

The value of MR for feature i can be defined as follows.

$${\rm{mac}}\,{{\rm{MR}}}_{i}=|{\rm{PCC}}(\mathop{\to }\limits_{{F}_{i}}\mathop{\to }\limits_{{C}_{i}})|\,(1\le i\le M)$$
(42)

where \(\mathop{\to }\limits_{{F}_{i}}\) is a vector composed from ith features from each instance, and \(\mathop{\to }\limits_{{C}_{i}}\) is also a vector whose every element comes from the target class C of each instance. Their Pearson’s correlation coefficient is defined as \({\rm{PCC}}(\mathop{\to }\limits_{{F}_{i}}\mathop{\to }\limits_{{C}_{i}})\).

Max-Distance (MD)

MDMR proposed a new approach to realize Max-Redundancy based on distance function, namely maximal distance, to measure the level of similarity between two feature vectors. There are three types of distance functions that can be chosen, which are Euclidean distance, cosine similarity and Tanimoto coefficient. Compared with the commonly used methods, Euclidean distance is easier to calculate. As compared to the Euclidean distance, cosine similarity focuses on the angle between two vectors. The last one, Tanimoto coefficient, is also called Jaccard coefficient in the broad sense. Under the binary condition, it is similar to Jaccard coefficient. For each feature, its value of distance defined as follows is based on t three types of distance functions mentioned above. According to the following formula, we can obtain their values for the feature i (\(1\le i\le M\)) EDi, COSi and TCi, respectively.

$${{\rm{ED}}}_{i}=\frac{1}{M-1}\,\sum \,ED(\mathop{\to }\limits_{{F}_{i}}\mathop{\to }\limits_{{F}_{k}})\,(1\le k\le M,k\ne i)$$
(43)
$${\mathrm{COS}}_{i}=\frac{1}{M-1}\,\sum \,COS(\mathop{\to }\limits_{{F}_{i}}\mathop{\to }\limits_{{F}_{k}})\,(1\le k\le M,k\ne i)$$
(44)
$${{\rm{TD}}}_{i}=\frac{1}{M-1}\,\sum \,TD(\mathop{\to }\limits_{{F}_{i}}\mathop{\to }\limits_{{F}_{k}})\,(1\le k\le M,k\ne i)$$
(45)

From three formulas above, we have four ways to obtain the final value of MD.

$$max\,{{\rm{MD}}}_{i}={{\rm{ED}}}_{i}\,(1\le i\le M)$$
(46)
$$max\,{{\rm{MD}}}_{i}={\mathrm{COS}}_{i}\,(1\le i\le M)$$
(47)
$$max\,{{\rm{MD}}}_{i}={{\rm{TC}}}_{i}(1\le i\le M)$$
(48)
$$mean\,{{\rm{MD}}}_{i}=\frac{1}{3}({{\rm{ED}}}_{i}+{\mathrm{COS}}_{i}+{{\rm{TC}}}_{i})\,(1\le i\le M)$$
(49)

We can obtain top m features which are considered to be the sub-feature set with minimal redundancy by MD.

MRMD

The criterion used for combining the two constraints above is called “Max-Relevance-Max-Distance” (MRMD). After having done all the above preparations, we could start to select the features subspace. The algorithm optimizes the following condition.

For a specific problem, the condition for feature selection take into consideration that the MR is not as important as MD. Therefore, the variables wr \((1\le {{\rm{w}}}_{r}\le M)\) and wd \((1\le {{\rm{w}}}_{d}\le M)\) are the weights of MR and MD, respectively.