Introduction

Cells are the fundamental units of the structure and function in the organism. Typically, a small cell may contain many thousands of proteins. Protein, as an essential substance in cells, affects the cells by interacting with other components, which plays a vital role in maintaining normal physiological functions in living organisms. In these interactions, protein–protein interactions (PPIs) have always been a hot spot for studying biological processes and thus have received widespread attention from more scholars. To fully understand both cell functions and biological phenomena, it is imperative to consider such an interesting and key question, namely, whether and how proteins interact with their partners, which is a special type of PPIs that are called self-interacting proteins (SIPs). SIPs are those proteins that have more than two copies of the protein that can actually interact with each other, among which the two SIP partners are the same copies and that can be expressed as the same gene. Hence, it can cause problems in the formation of homo-oligomer [1, 2]. Researchers found that homo-oligomerization has been an essential function of many biological processes, particularly in enzyme activation, signal transduction, immune response and gene expression regulation [3,4,5]. Previous works have demonstrated that SIPs play a critical role in the evolution of cellular physiological functions and protein interaction networks (PINs), which will also facilitate us to enhance our understanding of cellular functions through a systematic approach as well as provide a theoretical basis for developing novel drug targets and drug design methods [6,7,8]. Additionally, SIPs can effectively improve the stability of proteins and avoid the denaturation of proteins through decreasing its surface area. Consequently, it has become increasingly important to design an efficient and reliable computational method as a complement to the traditional experimental method for identifying SIPs.

Previously, numerous studies have been devoted to develop computational-based approaches for PPIs prediction [9, 10]. For instance, Wang et al. [11] presented a computational method for predicting PPIs from protein sequences by combining the Zernike moments descriptor with the probabilistic classification vector machines model. Zahiri et al. [12] introduced a sequence-based evolutionary information model named PPIevo for predicting PPIs, which extracts features from the position-specific scoring matrix of protein sequences and the results show that it has a better prediction performance in detecting PPIs. Huang et al. [13] proposed a novel computational method to predict PPIs. The proposed method was applied to global encoding on substitution matrix representation of protein sequences with the combination of weighted sparse representation classifier. In order to construct a sequence-based multiple classifier system for identifying PPIs, Xia et al. [14] adopted auto-correlation descriptors as a feature extraction algorithm to code both interacting and non-interacting protein pairs. An et al. [15] reported a method that used gray wolf optimization algorithm for generating feature vectors from protein sequences and adopted K-fold cross-validation as well as the relevance vector machine classifier to identify PPIs by considering the features of local and global of protein interaction positions. Shi et al. [16] proposed an efficient computational model based on protein sequences to predict PPIs by integrating correlation coefficient feature representation of protein sequences and support vector machine (SVM) classifier. Wang et al. [17] developed a pure biological language processing approach for the prediction of PPIs, which employed biological sequence features called bio-to-vector (Bio2Vec) as a novel representation and used the convolution neural network (CNN) to execute classification. Liu et al. [18] presented an approach called SPAR (self-interacting Protein Analysis serveR) to predict PPIs. The SPAR combined critical residues substitution (CRS) and tenfold cross-validation with the random forest algorithm for detecting PPIs. The SPAR obtained a good prediction performance in cross-species application. Nevertheless, although these methods can be used for predicting PPIs and have yielded some good prediction results, they also have some limitations. First, these computational models of PPIs are not fully applicable to the prediction of SIPs, and even if they can be predicted, the prediction results are usually not very effective. Second, compared with the computational models of PPIs, the existing computational models of SIPs are relatively few, and these models tend to ignore the problem of unbalanced data sets in SIPs, so the prediction performance of the models can be further improved by introducing reasonable computer techniques to deal with the problem of unbalanced data in the task of SIPs prediction. Therefore, it is particularly important to explore efficient and stable computational methods for large-scale SIPs detection by automated means nowadays.

In this paper, we put forward a novel computational scheme that integrates the gray level co-occurrence matrix (GLCM) feature extraction algorithm, adaptive synthetic (ADASYN) technique with weighted sparse representation based classification (WSRC) model for predicting SIPs from protein primary sequence information. We first transformed the SIPs sequences into position specific scoring matrices (PSSM) that can contain protein evolutionary information. Second, a novel feature descriptor called gray level co-occurrence matrix is employed to abstract salient and invariant feature vectors from the PSSM, and then the ADASYN technique is applied to balance the training dataset to create new feature vectors for classification. Finally, the optimized features are fed into the WSRC model to detect whether a protein is interacting or non-interacting with itself. The proposed model was performed on two benchmark SIPs datasets including yeast and human, which obtained high average accuracies of 98.10% and 91.51% using the five-fold cross-validation, respectively. Meanwhile, the comparison results, validated experimentally by the SVM-based method and other existing methods, reveal that the proposed model is effective and robust, and it is suitable for detecting potential SIPs.

The rest of the paper is organized as follows. In “Methods” section, we first introduce a highly reliable dataset for SIPs prediction, then give the evolutionary matrix representation, the position specific scoring matrix, and finally describe the proposed computational methods, which include adaptive synthetic sampling approach, gray level co-occurrence matrix (GLCM) feature descriptors, and weighted sparse representation based classification. In “Results and discussion” section, we give evaluation metrics for the prediction of SIPs, and discuss the prediction performance of the proposed model as well as compare it with other existing computational models through several comparison experiments. In “Conclusion” section, we give the conclusion of the paper.

Methods

Dataset

In this experiment, to construct a high reliability data source, we accessed the Uniprot database and downloaded human protein sequence data with a number of 20,199 from the database [19]. As we know, protein–protein interaction data can be collected from various databases, including BioGRID [20], DIP [21], InnateDB [22], IntAct [23] and MatrixDB [24]. In order to obtain the SIPs dataset required for the experiment, it is necessary to collect the PPIs data that can interact with itself in advance, that is, the data only contain the identical two interacting protein sequences, whose interaction type is referred to as ‘direct interaction’ in relational databases. In this way, we finally established 2994 human self-interacting protein sequences that were used to construct the experimental datasets.

To evaluate the prediction performance of the proposed model scientifically and efficiently, we screened the 2994 human SIPs datasets by the following three steps: [18] Firstly, we only retained those protein sequences with a length of more than 50 residues and less than 5000 residues from the whole human proteome. Secondly, to ensure the high quality of the SIPs data, we constructed the positive dataset used for this experiment, which has to meet at least one of the following three conditions: (a) There have been reported at least two publications for protein self-interaction; (b) The protein is referred to as a homo-oligomer (containing homodimer and homotrimer) in UniProt; (c) It is revealed by at least one small-scale experiment or two types of large-scale experiments. Finally, for constructing the human negative dataset, we remove all known SIPs from the entire human proteome (including proteins annotated as more extensive ‘physical association’ and ‘direct interaction’. Hence, the human dataset in this experiment consisted of 1441 SIPs and 15,938 non-SIPs. In addition, we also created the yeast dataset employing the same strategy, which contained 710 positive sample SIPs and 5511 negative sample non-SIPs.

Position specific scoring matrix

As a useful tool, position specific scoring matrix (PSSM) is formed through a set of sequences with structural or sequence similarity and was proposed by Gribskov et al. [25]. Meanwhile, it contains both the position information and evolutionary information of protein sequences, which is commonly employed to detect distantly related proteins. In addition, the PSSM is also widely applied in other areas such as prediction of membrane protein types [26], DNA-binding proteins prediction [27], prediction of protein structural classes [28], and drug-target interactions prediction [29] as well as obtaining excellent prediction results. Thus, considering that PSSM can preserve the evolutionary information of protein sequences as much as possible, we used it for predicting SIPs in this study. In this experiment, we employed Position Specific Iterated BLAST (PSI-BLAST) tool [30] to transform each protein sequence into a PSSM, and the vectors represented by these matrices can then be used to substitute protein sequences. Given a protein sequence, its PSSM may be represented as an \(H \times 20\) matrix, which can be denoted as below:

$$M = \{ M_{\alpha \beta } :\alpha = 1 \ldots H,\beta = 1 \ldots 20\}$$
(1)

where the row H on the M matrix indicates the length of a given protein sequence, and 20 indicates a total of 20 amino acids due to the fact that each protein sequence consists of 20 types of amino acids. Next, for the query protein sequence, PSSM assigns a score \(M_{\alpha \beta }\) to the \(\beta \begin{array}{*{20}c} {th} \\ \end{array}\) amino acid in the \(\alpha \begin{array}{*{20}c} {th} \\ \end{array}\) position by computing a position frequency matrix of each nucleotide in each position, so that the score \(M_{\alpha \beta }\) can be represented as:

$$M_{\alpha \beta } = \sum\limits_{k = 1}^{20} {p(\alpha ,k)} \times q(\beta ,k)$$
(2)

where \(p(\alpha ,k)\) stands for a matrix whose elements are the mutation value between two different amino acids, and \(q(\beta ,k)\) stands for the value of the Dayhoff’s mutation matrix between \(\beta \begin{array}{*{20}c} {th} \\ \end{array}\) and \(k\begin{array}{*{20}c} {th} \\ \end{array}\) amino acids.

In summary, to create experimental datasets for predicting SIPs and obtain highly and widely homologous information, PSI-BLAST was employed to generate the PSSM for each protein sequence in this paper. Here, the e-value parameter and iteration number of PSI-BLAST were set to 0.001 and 3, respectively. Eventually, we can express the PSSM of each protein sequence as a 20-dimensional matrix consisting of \(M \times 20\) elements.

Adaptive synthetic sampling method

Adaptive synthetic (ADASYN), an oversampling approach of processing the minority classes, was first introduced by He et al. [31] for learning from imbalanced data sets. The essential idea of ADASYN is to employ a systematic approach, weighted distribution for specific minority categories of observations, to adaptively generate different numbers of synthetic observations in accordance with their distribution. By balancing original data sets with large differences between positive and negative samples, this algorithm can synthesize more observations from the minority classes that are relatively difficult to classify and fewer observations from the minority classes that are fairly easy to classify, thus addressing the class imbalance problem [32]. Not only does the ADASYN algorithm reduce the bias caused by class imbalance, but also adaptively shifts the classification decision boundary for the classifier toward those minority observations that are relatively difficult to classify.

Suppose \(D_{o}\) is an original dataset consisting of N samples, in which \(n_{s}\) denotes the number of observations in the minor class (interacting pairs) and \(n_{l}\) denotes the number of observations in the major class (non-interacting pairs). The steps of ADASYN algorithm are briefly introduced below.

(1) Evaluate the level of class imbalance between samples below:

$$\begin{array}{*{20}c} {{{I = n_{s} } \mathord{\left/ {\vphantom {{I = n_{s} } {n_{l} }}} \right. \kern-\nulldelimiterspace} {n_{l} }},} & {} & {I \in } \\ \end{array} (0,1]$$
(3)

(2) To obtain new data \(D_{n}\) for the whole minority class, the ADASYN approach will generate some minority class observations into the original data set \(D_{o}\). The total number of synthetic observations to be generated can be computed as:

$$C = \alpha \left( {n_{l} - n_{s} } \right)$$
(4)

where \(\alpha \in [0,1]\) is a parameter, which is employed to identify the desired balance level after generating the synthetic observations. If \(\alpha { = }1,\) it means that a new data set whose samples of classes are fully balanced will be created after generating the synthetic observations.

(3) For every small instance \(x_{i} ,\) \((x_{i} \in n_{s} )\) in each minority class, find its k-nearest neighbors in the n-dimensional space using the Euclidean distance method, and then we can calculate the ratio \(r_{i}\) defined as:

$$r_{i} = {\Delta }_{i} /k,\quad i = 1, \ldots ,n_{s}$$
(5)

where \(r_{i} \in [0,1]\) and \({\Delta }_{i}\) is the number of observations that are contained in the majority class, namely the k-nearest neighbors of \(x_{i} .\)

(4) Normalize the value of \(r_{i}\) to a density distribution \(\widehat{{r_{i} }}\) according to the following equation, while making the sum of all \(\widehat{{r_{i} }}\) values equals 1.

$$\widehat{{r_{i} }} = r_{i} /\mathop \sum \limits_{i = 1}^{{n_{s} }} r_{i} \begin{array}{*{20}c} {} & {} & {\mathop \sum \limits_{i} \hat{r}_{i} = 1} \\ \end{array}$$
(6)

(5) For every small instance \(x_{i}\) in the minor class, the number of synthetic observations which need to be generated is calculated as:

$$c_{i} = C\hat{r}_{i}$$
(7)

(6) By selecting an instance \(x_{j}\) from the k-nearest neighbors of \(x_{i}\) in the minor class, we finally can generate a new synthetic observation according to the following formula:

$$s_{i} = x_{i} + \left( {x_{j} - x_{i} } \right) \times \lambda$$
(8)

where \(\lambda\) denotes a random number between 0 and 1, \(x_{j}\) is one of the nearest neighbor observations of \(x_{i} ,\) \(\left( {x_{j} - x_{i} } \right)\) represents the difference vector in n-dimensional spaces, and \(s_{i}\) is the new synthetic observation. Therefore, the ADASYN approach can automatically vote on the number of synthetic observations which need to be generated for each minority observation by utilizing a density distribution as a criterion.

Gray level co-occurrence matrix (GLCM) features descriptor

In the process of classifying interacting and non-interacting protein pairs using a computational approach, a good feature extraction algorithm is essential for predicting SIPs efficiently and accurately. Although PSSM can effectively represent the evolutionary information of proteins, different proteins may contain different sequence lengths, which makes the size of the constructed PSSM inconsistent and cannot be directly used to compose the feature vectors of protein sequences. Thus, in this paper, we use GLCM to extract protein evolutionary information from PSSM to obtain the same length of feature vector descriptors. The GLCM algorithm, a classical texture-based feature extraction method, was introduced by Haralick et al. [33], which is widely employed in a variety of different tasks, especially for extracting spatial variation of the matrix in image processing applications. A GLCM is generated by computing the pixel brightness values (gray levels) that have specific values and a specified spatial relationship in an image. This spatial relationship is defined by a parameter pair \((\vartheta ,d)\) in which \(\vartheta\) and \(d\) represent the direction of two pixels and the separation distance between two pixels, respectively, which denotes the pixel of interest and the pixel that is horizontally adjacent to it. Typically, we need to define a set of parameter pairs \((\vartheta ,d)\) and combine them with GLCM matrices to describe the rotational invariance of the GLCM by employing a set of rotational parameters. Generally, this parameter is set to eight orientations and spaced to \(\pi /4\) radians apart. The number of gray values \(N_{g} ,\) as an integer, denotes the number of unique brightness values presented in the image. Normally, the image is scaled from \([0,255]\) to \([0,N_{g} ]\) before calculating the GLCM, where \(N_{g}\) indicates the gray level and also determines the size of the gray level co-occurrence matrix [34].

In this experiment, the prominently used texture features of PSSM are extracted by using the GLCM algorithm, including contrast, correlation, energy, and homogeneity, which were defined by Haralick et al. [33]. The feature expressions of GLCM are shown below. Here, the variable \(P(i,j)\) in each expression denotes the value at the \((i,j)th\) position in a grey level co-occurrence matrix.

$$contrast = \mathop \sum \nolimits_{i,j = 0}^{{N_{g} - 1}} (i - j)^{2} P(i,j)$$
(9)
$$correlation = \mathop \sum \nolimits_{i,j = 0}^{{N_{g} - 1}} \frac{{\left( {i - \mu_{x} } \right)\left( {j - \mu_{y} } \right)P(i,j)}}{{\sigma_{x} \sigma_{y} }}$$
(10)

where \(\mu_{x} ,\) \(\mu_{y} ,\) \(\sigma_{x} ,\) \(\sigma_{y}\) are the averages and the variances of the row and column, respectively, which are defined as follows:

$$\begin{array}{*{20}c} {\mu_{x} = \mathop \sum \nolimits_{i,j = 0}^{{N_{g} - 1}} i \cdot P(i,j)} & {} & {\mu_{y} = \mathop \sum \nolimits_{i,j = 0}^{{N_{g} - 1}} j \cdot P(i,j)} \\ \end{array}$$
(11)
$$\begin{array}{*{20}c} {\sigma_{x} = \sqrt {\mathop \sum \nolimits_{i,j = 0}^{{N_{g} - 1}} \left( {i - \mu_{x} } \right)^{2} \cdot P(i,j)} } & {} & {\sigma_{y} = \sqrt {\mathop \sum \nolimits_{i,j = 0}^{{N_{g} - 1}} \left( {j - \mu_{y} } \right)^{2} \cdot P(i,j)} } \\ \end{array}$$
(12)
$$energy = \mathop \sum \nolimits_{i,j = 0}^{{N_{g} - 1}} P(i,j)^{2}$$
(13)
$$homogeneity = \mathop \sum \nolimits_{i,j = 0}^{{N_{g} - 1}} \frac{P(i,j)}{{1 + (i - j)^{2} }}$$
(14)

As a result, we obtained a set of 60-dimensional statistical feature vectors from the PSSM of each protein sequence, using the GLCM feature extraction algorithm. In order to reduce the redundancy of features and computational burden, we first normalize all input feature vectors using the zero-mean normalization method. Second, considering the existence of unbalanced data samples, these feature vectors are fed into the ADASYN algorithm to generate new training samples to construct a relatively balanced data set so as to obtain an efficient and robust prediction model. Finally, the obtained new training features from the ADASYN algorithm are fed into the classification model for further feature classification.

Weighted sparse representation based classification

Currently, machine learning algorithms used to construct classification models including Naive Bayes, decision trees and sparse representation classifiers, have been widely used in many fields. Sparse representation classifier (SRC), a popular nonparametric algorithm that is extensively applied in machine learning, was originally proposed by Wright et al. [35], which is analogous to the nearest neighbor and the nearest subspace approaches [36, 37]. The basic idea of SRC is to represent the individual test set by linearly combining the whole training set from original sample data. Then the sparsest representation of the individual test set is found in the dictionary. Finally, the new tests are assigned to the category with the minimum residual according to those representations. Although SRC has achieved good experimental results in many applications such as face recognition and text classification [38, 39], its prediction accuracy and classification effectiveness can be further enhanced. The SRC algorithm is briefly described as follows.

Considering a training instance set \(T \in R^{d \times n} ,\) where \(d\) means the dimension of feature vectors and \(n\) means the number of training instances. Let \(k\) denote the class number in the instance dataset. The \(n_{k}\) instances belonging to the kth class can be expressed as a submatrix \(T_{k} = \left[ {l_{k1} ,l_{k2} \ldots l_{{kn_{k} }} } \right],\) then the whole training set can be further rewritten as \(T = \left[ {\begin{array}{*{20}c} {T_{1} } & {T_{2} } & \ldots & {T_{K} } \\ \end{array} } \right],\) where \(K\) denotes the class number of the whole instance. Assuming that there is a new testing instance \(x \in R^{d}\) belonging to the kth class, the sparse representation is to find such a column vector \(\alpha = \left[ {\alpha_{k,1} ,\alpha_{k,2} , \ldots ,\alpha_{{k,n_{k} }} } \right]\) which satisfies the following condition:

$$x = \alpha_{k,1} l_{k,1} + \alpha_{k,2} l_{k,2} + \cdots + \alpha_{{k,n_{k} }} l_{{k,n_{k} }}$$
(15)

When representing the entire training instance set, this equation can be further rewritten as follows:

$$x = T\alpha_{0}$$
(16)

According to the sparse representation method, we note that the nonzero entries in \(\alpha_{0}\) are only related to the kth class, which can be denoted as

$$\alpha_{0} = \left[ {0, \cdots ,0,\alpha_{k,1} ,\alpha_{k,2} \cdots \alpha_{{k,n_{k} }} ,0, \cdots ,0} \right]^{T}$$
(17)

Next, in the SRC algorithm, we need to solve the following \(l_{0}\)-norm minimization problem:

$$\begin{array}{*{20}c} {\hat{\alpha }_{0} = \arg \min \parallel \alpha \parallel_{0} } & {} & {{\text{subject}}} \\ \end{array} {\text{ to }}x = T\alpha$$
(18)

Since the solution of \(\hat{\alpha }_{0}\) is an NP-hard problem, we need to optimize this problem. When \(\alpha\) is sufficiently sparse, the problem can be solved in this way by solving the \(l_{1}\)-minimization problem instead of solving the \(l_{0}\)-minimization directly.

$$\begin{array}{*{20}c} {\hat{\alpha }_{1} = \arg \min \parallel \alpha \parallel_{1} } & {} & {{\text{subject}}} \\ \end{array} {\text{ to }}x = T\alpha$$
(19)

To avoid the occlusion problem and enhance the generalization capability of the SRC algorithm, the \(l_{1}\)-norm minimization is further extended to the following stable \(l_{1}\)-norm minimization problem by introducing \(\varepsilon ,\) a threshold of the reconstruction error.

$$\begin{array}{*{20}c} {\hat{\alpha }_{1} = \arg \min \parallel \alpha \parallel_{1} } & {} & {{\text{subject}}} \\ \end{array} {\text{ to }}\parallel x - T\alpha \parallel \le \varepsilon$$
(20)

Subsequently, the given test instance \(x\) is assigned to class \(k\) by computing the smallest reconstruction residual, which can be expressed as follows:

$$r_{k} (x) = \parallel x - T\hat{\alpha }_{1}^{k} \parallel ,\quad k = 1,2, \ldots ,K$$
(21)

where \(T\hat{\alpha }_{1}^{k}\) is the reconstructed value that is obtained from the training instance of class \(k,\) \(r_{k}\) denotes the residual, and \(K\) is the class number of the whole instance. Finally, the class with the minimum residual will be obtained, which will be used as the prediction label of the test instance \(x.\)

In this study, a new classification model, weighted sparse representation-based classifier (WSRC), is used to predict SIPs based on a novel feature extraction description of protein sequences. WSRC is a variant of the traditional sparse representation classifier, which can enhance the classification performance of prediction models [40]. WSRC utilizes the distance information to represent the test samples and assigns weights to the samples in the training set, whereas the typical SRC does not explore the distance or similarity relationship from individual training samples to the test samples. However, previous research has confirmed that the locality of data is also more essential than sparsity in some cases [41, 42]. According to this assumption, the WSRC model needs to integrate the locality structure of the data based on the traditional sparse representation in order to evaluate the importance of each training instance in representing the testing instance. WSRC employs the Gaussian kernel distance that can capture the nonlinear information within the original dataset to calculate the weights. For the given two instances, \(s_{1}\) and \(s_{2} ,\) the distance based on the Gaussian kernel between them is as follows:

$$d_{G} \left( {s_{1} ,s_{2} } \right) = e^{{ - \parallel s_{1} - s_{2} \parallel^{2} /2\sigma^{2} }}$$
(22)

where \(\sigma\) is the Gaussian kernel width that needs to be specified in advance in the experiment. Using Gaussian kernel distance as a nonlinear mapping to compute weights, WSRC can effectively capture the locally nonlinear information within the dataset. By this way, the WSRC algorithm will turn to solve the following \(l_{1}\)-norm minimization problem:

$$\begin{array}{*{20}c} {\hat{\alpha }_{1} = \arg \min \parallel W\alpha \parallel_{1} } & {} & {{\text{subject}}} \\ \end{array} {\text{ to }}x = T\alpha$$
(23)

and more specifically,

$$diag(W) = \left[ {d_{G} \left( {x,t_{1}^{1} } \right), \ldots ,d_{G} \left( {x,t_{{n_{k} }}^{k} } \right)} \right]^{T}$$
(24)

where \(W\) is a block-diagonal matrix and \(n_{k}\) is the number of the training samples in the kth class. Similarly, to simplify this problem, the WSRC algorithm will further turn to solve the stable \(l_{1}\)-norm minimization problem, which can be expressed as follows:

$$\begin{array}{*{20}c} {\hat{\alpha }_{1} = \arg \min \parallel W\alpha \parallel_{1} } & {} & {{\text{subject}}} \\ \end{array} {\text{ to }}\parallel x - T\alpha \parallel \le \varepsilon$$
(25)

where \(\varepsilon > 0\) is a threshold, namely the tolerance value of the reconstruction error using a linear combination of the training samples to denote the test samples.

Results and discussion

Performance evaluation

In order to assess the effectiveness and feasibility of the proposed method in this paper, we used the following measures, namely accuracy (Acc.), sensitivity (Sen.), specificity (Spe.), precision (Pre.) and Matthews correlation coefficient (MCC), as the prediction performance indicators of the model in this experiment, which are expressed as:

$$Acc. = \frac{TP + TN}{{TP + TN + FP + FN}}$$
(26)
$$Sen. = \frac{TP}{{FN + TP}}$$
(27)
$$Spe. = \frac{TN}{{FP + TN}}$$
(28)
$$Pre. = \frac{TP}{{TP + FP}}$$
(29)
$$MCC = \frac{TP \times TN - FP \times FN}{{\sqrt {{(}TP + FP{)} \times {(}TP + FN{)} \times {(}TN + FP{)} \times {(}TN + FN{)}} }}$$
(30)

In the above formula, where TP represents true positives, meaning the count of those samples that have interacted pairs are predicted correctly by the model, FP represents false positives, meaning the number of those samples that are true non-interacting pairs are judged to be interacting pairs by the model, TN represents true negatives, meaning the number of those samples that have true non-interacting pairs are predicted correctly by the model, and FN represents false negatives, meaning the count of those samples that are true interacting pairs are judged to be non-interacting pairs by the model. Additionally, to clearly visualize the performance of our model for predicting SIPs classification results, we also plotted the receiver operating characteristic curve (ROC) and computed the AUC (area under the ROC) and the AUPR (area under the precision-recall curve) as an important evaluation metric [43]. The main workflow of the proposed model is shown in Fig. 1.

Fig. 1
figure 1

Flow chart of the proposed model for predicting potential SIPs

Performance of the proposed method

In this study, we used two standard datasets, namely yeast and human for validating the performance of the proposed model in predicting SIPs. For the sake of preventing the overfitting phenomenon affecting the prediction results as much as possible, we employed a five-fold cross-validation method in the experiment and separated the original SIPs experimental dataset into the training set and independent test set. Here, taking the yeast dataset as an example, we described it in detail by splitting the entire dataset into five non-overlapping parts, where 4 parts are used as training samples and the remaining 1 part is taken as testing samples. After this, we can obtain five separate models and can then perform five separate SIPs experiments utilizing the proposed method. In the end, the experimental results achieved by our method on the yeast dataset were represented as the mean and standard deviation of five predicted outcomes. Similarly, the same strategy was also applied to the human dataset.

As shown in Tables 1 and 2, the proposed method for predicting SIPs combined with five-fold cross-validation yielded satisfying results on yeast and human datasets. From Table 1, it is evident that the overall accuracies of these five experiments are above 97% for the yeast dataset. More specifically, the accuracies of five experiments are 98.23%, 98.07%, 97.35%, 98.23% and 98.63%, respectively. The average accuracy is 98.10%, the average sensitivity is 87.17%, the average specificity is 99.51%, the average precision is 95.87%, and the average MCC is 90.51%, with standard deviations of 0.47%, 2.08%, 0.26%, 2.09% and 2.09%, respectively. Similarly, our method also obtained better experimental results on larger human datasets. The accuracies of each experiment are 92.40%, 91.80%, 91.19%, 91.02% and 91.15%, respectively. The values of average accuracy, sensitivity, specificity, precision and MCC are 91.51%, 12.75%, 98.63%, 46.11% and 25.31%, respectively, with the corresponding standard deviations of 0.58%, 1.40%, 0.28%, 4.14% and 1.58%, respectively. Meanwhile, the ROC and AUPR curves plotted by our model for detecting SIPs on yeast and human datasets are shown in Figs. 2 and 3. Among them, the average AUC values of the model on yeast and human datasets are 99.10% and 70.45%, respectively, and the average AUPR values of the model on yeast and human datasets are 85.21% and 15.75%, respectively. In these figures, y-axis and x-axis refer to true positive rate (TPR) and false positive rate (FPR), respectively. In addition, these high assessment standard values and relatively small standard deviations in the experimental results indicate that the proposed method, as a computational method, is accurate and reliable in predicting SIPs.

Table 1 Five-fold cross-validation prediction results using the WSRC-based method on the yeast dataset
Table 2 Five-fold cross-validation prediction results using the WSRC-based method on the human dataset
Fig. 2
figure 2

The ROC and AUPR performance of WSRC-based method on yeast SIPs dataset

Fig. 3
figure 3

The ROC and AUPR performance of WSRC-based method on human SIPs dataset

Prediction performance of the support vector machine-based method

It is remarkable that the proposed model achieved better prediction performance on two benchmark datasets and can be used to detect SIPs fairly well. However, Support Vector Machine (SVM), as a widespread data mining algorithm, has strong practicality both in machine learning and in pattern recognition, which has an excellent prediction performance especially in dealing with classification and regression problems [44]. Hence, to better understand the predictive performance of our classifier, we try to adopt the most popular SVM instead of WSRC to perform SIPs, which is a comparison experiment for the proposed method. Specifically, the same feature extraction method was employed in conjunction with the SVM classifier to execute the prediction of SIPs on yeast and human datasets, respectively. Here, we carry out the SVM classification task employing the LIBSVM toolbox [45], which can be downloaded from website https://www.csie.ntu.edu.tw/~cjlin/libsvm/. For ensuring fairness, we optimized the relevant parameters of the SVM by selecting the radial basis function as the kernel function. The parameters c and g were set to 0.5 and 1, respectively, on the yeast dataset, and the parameters c and g were set to 4 and 8, respectively, on the human dataset, which was determined by using the grid search method and other parameters were set as default values.

Tables 3 and 4 summarized the predicted results of SIPs using two classifiers in combination with the fivefold cross-validation method on the yeast and human datasets. As seen from Table 3, the GLCM-SVM method obtained an average accuracy of 95.60% on the yeast dataset, wherein the accuracies of the five models are 95.42%, 95.34%, 94.77%, 96.54%, and 95.90% respectively. Nevertheless, the GLCM-WSRC approach achieved an average accuracy of 98.10% in predicting SIPs, which is indeed 2.5% larger than the average accuracy gained by the SVM approach. Likewise, as shown in Table 4, the GLCM-SVM method yielded an average accuracy of 87.86% on the human dataset, of which the accuracies of the five experiments are 88.40%, 87.74%, 87.71%, 87.42%, and 88.01% respectively. The highest accuracy rate obtained based on the SVM model is 2.62% lower than the minimum accuracy rate achieved based on the WSRC model as compared to the predicted results of the WSRC method on the human dataset. The predictive performance of the ROC and AUPR curves based on the SVM method on the yeast and human datasets is shown in Figs. 4 and 5. Among them, the average AUC values of the model on yeast and human datasets are 98.96% and 69.35%, respectively, and the average AUPR values of the model on yeast and human datasets are 69.87% and 14.88%, respectively. Consequently, it can be seen from these evaluation metrics that the experimental results obtained by the WSRC classifier are superior to those obtained by the SVM classifier in detecting SIPs, which further indicates that the proposed computational model can provide a useful supplementary tool for predicting SIPs, as well as other bioinformatics tasks.

Table 3 Five-fold cross-validation prediction results using the SVM-based method on the yeast dataset
Table 4 Five-fold cross-validation prediction results using the SVM-based method on the human dataset
Fig. 4
figure 4

The ROC and AUPR performance of SVM-based method on yeast SIPs dataset

Fig. 5
figure 5

The ROC and AUPR performance of SVM-based method on human SIPs dataset

Comparison with other methods

Currently, several computational models based on protein sequences have been proposed to detect SIPs. In this section, to further objectively evaluate the superior performance of our model, we compared it with the following six methods that have been shown to achieve a good prediction performance on the same two standard datasets. These existing methods include SLIPPER [46], CRS [18], SPAR [18], DXECPPI [47], PPIevo [12] and LocFuse [48], which are all classical methods designed to detect SIPs. The comparison results between the proposed method and these six methods are clearly given in Tables 5 and 6, which indicated the good performance of our method, using ADASYN algorithm, GLCM feature extraction, and WSRC classifier to predict SIPs in this paper, as compared to the previous computational methods. From Table 5, we can see that the proposed method yielded a high accuracy of 98.10% when detecting the SIPs of the yeast dataset, which is significantly higher than the six existing methods obtaining relatively low accuracies between 66.28% and 87.46%. Furthermore, compared with the other methods, the proposed method achieves relatively low standard deviations with respect to these evaluation metrics, which also implies that our model has a more robust predictive performance in predicting SIPs. At the same instant, we obtained relatively good prediction results from Table 6 when detecting the SIPs of the human dataset. The accuracy of the proposed predictor is 91.51%, which is 60.61% higher than the DXECPPI method, 13.47% higher than the PPIevo method, and 10.85% higher than the LocFuse method. These comparison results further demonstrate that the proposed method is capable of efficiently predicting SIPs from protein sequences.

Table 5 Predictive performance of seven different methods on yeast dataset
Table 6 Predictive performance of seven different methods on human dataset

The main reasons for the proposed method yielding better prediction results come from the following points: (1) PSSM can contain protein evolutionary information. (2) The GLCM feature extraction algorithm can accurately capture useful knowledge from the PSSM. (3) The ADASYN method can efficiently integrate training features to balance the training sample set and reduce the impact of noise. (4) WSRC can effectively discover differences between different types and improve the prediction performance for the classification tasks dealing with interacting and non-interacting proteins, by integrating both sparsity and data locality structure into traditional SRC. (5) Compared with other methods, WSRC can automatically obtain a good prediction result in detecting SIPs, which is mainly due to the fact that it could keep the same parameters in all SIPs experiments and does not require much manual intervention to adjust its parameters.

Conclusion

Recently, the rise of machine learning techniques constantly promotes highly interdisciplinary research in different fields such as protein–protein interaction prediction, drug-target interaction prediction, and drug-disease association prediction. In this study, we present a novel computational method named GLCM-WSRC, which combines GLCM feature extraction algorithm, ADASYN technique with WSRC classification model for predicting SIPs based on protein evolutionary information from protein primary sequences. Specifically, each protein sequence was transformed into a PSSM, a two-dimensional matrix that can contain protein evolutionary information, by using the PSI-BLAST method. The GLCM algorithm is then employed to capture the valuable information from these PSSMs and form feature vectors of the proteins, after which the ADASYN technique is applied to balance the training data set to form new feature vectors used as the input of classifier from the obtained GLCM feature vectors. Finally, the weighted sparse representation based classification machine learning strategy is adopted to predict SIPs. Promising experimental results reveal that the constructed model is feasible and efficient when dealing with the classification task of interacting and non-interacting pairs of protein sequences, which achieves better prediction performances than other existing works on the same benchmark dataset. Thus, our work provides a useful tool for large-scale self-interacting protein prediction, which is beneficial for the detection of other bioinformatics tasks in the future.