Background

Proteins are diverse in shape and molecular weight and are relevant to their function and chemical bonds [1]. Therefore, there are various types of proteins according to their benefits and applications [2]. There are some factors that lead to mutations in the protein shape and lack of protein function, including temperature variations, pH, and chemical reactions [3]. According to the polypeptide structure, proteins are categorized into four classes: primary, secondary, tertiary, and quaternary. Analysis of protein behavior can be difficult due to next-generation sequencing (NGS) technology, time-consuming, and low accuracy, especially for non-homologous protein sequences. Therefore, deep learning algorithms are applied to handling huge datasets for computational protein design by predicting the probability of 20 amino acids in a protein [4]. Because the experimental biologist suffered from the limited availability of 3D protein structure, protein structure prediction is effectively used to define 3D protein structure that supports more genetic information [5]. The prognosis of protein 3D structure from the amino acid sequence has several applications in biological processes such as drug design, discovery of protein function, and interpretation of mutations in structural genomics [6].

Protein folding is a thermodynamic process to create a 3D structure via minimum energy conformation based on entropy [7]. The traditional methods for studying protein folding are minutely discussed [8]. On the other hand, the computational procedures of protein folding are focused on the prediction of protein stability, kinetics, and structure by using Levinthal’s paradox or energy landscape or molecular dynamics [9]. The common algorithm is the dictionary secondary structure protein (DSSP) [10], which is based on hydrogen bond estimation. The DSSP algorithm assigns protein secondary structure to eight various groups: H (α-helix), E (β-strand), G (310-helix), I (π-helix), B (isolated β-bridge), T (turn), S (bend), and (rest). This algorithm holds more information for a range of applications, but it is more complex for computational analysis.

Previously, Pauling et al. [11] presented a PSSP model for recognizing the polypeptide backbone by separating two regular states, α-helix (H) and β-strand (E). The poor PSSP relied on training large datasets that lead to overfitting and classifier inability to estimate unknown datasets [12].

Yuming et al. [13] applied a PSSP model by using the data partition and semi-random subspace method (PSRSM) with a range of accuracy of 85%. Generally, machine learning algorithms are implemented for PSP, but the evaluated accuracy is still limited [14]. To improve the PSSP model, several algorithms used neural networks (NNs) [15], K-nearest neighbors (KNNs) [16], and SVMs [17]. Additionally, deep learning algorithms such as deep conditional neural fields (CNF) [18], MUFOLD-SS [19], and SPINE-X [20] have achieved success with an accuracy of 82–84%.

Also, the output of SVM is employed as input features for a decision tree to extract the rules governing PSSP [21] with high accuracy. It was found that the accuracy rate of protein prediction is based on the gap between current rules from algorithms and rules from biological meaning.

In this work, we developed a former technique [21] by using an SVM model to guess the protein secondary structure and using a decision tree for SVM production to derive regulations surrounding PSSP.

Methods

Data description

The proposed model implemented 126 protein sequences (RS126 set) [22] to predict the PSSP. The dataset contains 23,349 amino acids that formed from 32% α-helices, 23% β-strands, and 45% coils. The proposed model is designed under MATLAB R2010a version 7.10.0 using a Windows platform with an Intel Core i7-6700T@ 2.8.

Proposed model for PSSP

Figure 1 displays the proposed model for PSSP. The following steps explained the four steps of the proposed model.

  • The first step includes converting the amino acid residue into a binary number by orthogonal encoding.

  • In the second step, the dataset is divided into seven sets using seven-fold cross-validation by the SVM classifier.

  • In the third step, compute the accuracy of prediction and select such results with high accuracy and pass it as a training set into the decision tree

  • In the fourth step, those rules that are produced by the decision tree are extracted and recorded.

Fig. 1
figure 1

Architecture of the PSSP model

Orthogonal encoding

Orthogonal encoding was used to convert the amino acid residues to numerical values and to read the inputs of the sliding window. In this paper, a window of size 12 is adopted; in the sliding window method, only the central amino acid is predicted, and binary encoding was utilized to allocate numeric data to the amino acid characteristics. Therefore, there are 20 locations for the characteristics of amino acids. For example, for every window of size 12, the window comprises 12 input amino acids, each amino acid will be denoted by the value 1 depending on its location in the window, and each other location will be assigned 0’s. In this case, the input pattern will be 20 × 12 inputs, 12 of which will be assigned the value 1 and all others to 0’s. A good example of the sliding window problem is shown in Fig. 2; suppose our input pattern consists of the following protein sequences and secondary structure pattern. If the window size is 7 and the pattern NTDEPGA in Fig. 2 is assumed to be the training pattern, it is applied to estimate the residue ‘E’ and the next residue ‘P’ in the window slide ‘TDEPGACP.’ The window will slide to the next residue until the end of the pattern. The orthogonal encoding of the pattern KLNTDEPGACPQACYA is shown in Table 1.

Fig. 2
figure 2

The sliding window problem

Table 1 Orthogonal encoding of 12 amino acids

In this work, a DSSP [10] model for secondary structure assignment is used because it is a frequently utilized and consistent technique for the PSSP approach. To lessen the complexity of assignments and training, the eight classes of DSSP were reduced to three classes [23]. The reduction problem of eight classes to three classes is shown in Table 2.

Table 2 Conversion of eight secondary structures to three classes

SVM classifier

The SVM classifier [17] constructs a hyperplane that separates the protein dataset after orthogonal encoding into various classes. Six categories, namely, (H/~H), (H/~E), (E/~E), (E/~C), (C/~C), and (H/~C), are used. For SVM, the selection of kernel function, kernel parameter, and cost parameter (C) are investigated to evaluate the classification accuracy. In this paper, the RBF kernel is used, and the kernel parameter γ is constant throughout the experiment, but the C varies over the following values: 0.2, 0.4, 0.7, 0.9, 1, and 4 as used in a previous study [24].

Decision tree

A decision tree is composed of several nodes and leaves [25]. Each leaf represents one class corresponding to the target value, and the leaf node may take the probability of the target label. A decision tree inducer is an object that takes a training set and creates a model that generates a link between the input instance and the target variable.

Let DT denote a reference of the decision tree and DT(T) denote the classification tree. These symbols are created by applying DT to the training set T. The prediction of the target variable indicates DT(T)(x).

We can use a classifier created by a decision tree inducer to classify an unknown data set in one of the two ways: by allocating it to a specific class or by supplying the probability of given input data belonging to each class variable. We can estimate the conditional probability in the decision tree by \( \overset{\frown }{P} DT(T)\left(y|a\right) \) (probability of class variable given an input instance). In a decision tree, the probability is evaluated for each leaf node distinctly by computing the occurrence of the class through the training samples according to the leaf node.

When a particular class never appears in a specific leaf node, we may end with a zero probability. However, we can avoid such a case by using Laplace rectification.

Laplace’s law states the likelihood of the event j = xi where j is a random parameter and xi is a potential output of j that has been noticed ni times out of n notices. It is given by: \( \frac{n_i+ wp}{n+w} \) where p is the prior probability of the event and w is the pattern size that refers to the weight of the prior estimation according to the noticed data. Additionally, w is described as equivalent pattern size because it denotes the increase of the n tangible notices by other w practical patterns estimated relative to p. Due to assumptions, we can rewrite the prior and posterior probability in the following equations:

$$ \frac{n_i+w\cdot p}{n+w} $$
$$ \frac{n_i}{n}\cdot \frac{n}{n+w}+p\cdot \frac{w}{n+w} $$
$$ {p}_p\cdot \frac{n_i}{n+w}+p\cdot \frac{w}{n+w} $$
$$ {p}_p\cdot {n}_1+p\cdot {n}_2 $$

In this case, we used the following expression:

$$ {P}_{laplace}\left({a}_i|y\right)=\frac{\mid T\mid +w.p}{\mid T\mid +w} $$

By utilizing this expression, the values of p and w are chosen.

Rules’ confidence

To define the trust of the rules, we must create the probability allocation that controls the accuracy calculation. The classification task is modeled as a binomial test.

Suppose the test set consists of N records, X is the quantity of sample portions accurately prophesied by the system and p is the correct accuracy of the system. The forming the overall function as a binomial ranking by mean p and variance p(1 − p)/N based on the normal ranking, the empirical accuracy for rules’ confidence can be derived from.

$$ P\left(-{Z}_{\alpha /2}\le \frac{ecc-p}{\sqrt{p\left(1-p\right)/N}}\le {Z}_{1-\alpha /2}\right)=1-\alpha $$

where −Zα/2 and Z1 − α/2 are the high and low bounds provided from a normal ranking at a trust interval of (1 − α).

Results

Evaluation criteria for secondary structure prediction

To find optimal rules governing PSSP by the decision tree, a Q3 accuracy measure is used to estimate the value of exactly predicted secondary structural elements of the protein sequence.

$$ {Q}_3=\frac{\sum_{\mathrm{i}\in \left(\mathrm{H},\mathrm{E},\mathrm{C}\right)}\mathrm{number}\ \mathrm{of}\ \mathrm{correctly}\ \mathrm{predicted}\ \mathrm{residue}}{\sum_{\mathrm{i}\in \left(\mathrm{H},\mathrm{E},\mathrm{C}\right)}\mathrm{number}\ \mathrm{of}\ \mathrm{secondary}\ \mathrm{structure}\ \mathrm{elements}\ \mathrm{observe}} $$

Performance of SVM

The results of the experiment are summarized in Table 3. A comparison of the accuracy obtained by PSSP based on NMR chemical shift with SVM (PSSP_SVM) [26], based on the codon encoding (CE) scheme with SVM (PSSP_SVMCE) [23], and based on the compound pyramid (CP) model with SVM (PSSP_SVMCP) [27].

Table 3 Accuracy comparison of various algorithms for protein secondary structure prediction

From Table 3, the accuracy among the classifiers varies significantly. In the proposed model, the prediction accuracy is in the range of 85–63%. The best prediction accuracy is recoded for the E/~E classifier, and the least accuracy of the prediction is recorded for the C/~C classifier. For the PSSP_SVMCP method [27], the best prediction accuracy is recoded for the H/~H, C/~C, H/~E, and H/~C classifiers compared to the proposed model.

The proposed model achieved the best prediction accuracy compared to other previous models such as PSSP_SVM [26], and PSSP_SVMCE [23]. In contrast, the PSSP_SVMCP [27] model achieved the best prediction accuracy compared to the proposed model.

During the experiment, the general observation is made and observes that the accuracy of the classifier increases with an increase of the C. The better C is obtained at 4 and the least C is also obtained at 0.1.

Performance of decision tree

Figure 3 displays the decision tree of the training dataset extracted from the SVM algorithm. Tables 4, 5, and 6 show some of those rules produced by the decision tree using three different categories (H/~H, E/~E, and C/~C). The x variable specifies the column number, the compared values denote the column’s data, and the nodes specify the nodes of the tree. Figures 4, 5, and 6 show the percentage of prediction accuracy related to the proposed rules with a bold symbol referring to the amino acid pattern that created a special protein secondary structural type.

Fig. 3
figure 3

Screenshot of the decision tree

Table 4 Rules produced by the decision tree for the H/~H classifier
Table 5 Rules produced by the decision tree for the E/~E classifier
Table 6 Rules produced by the decision tree for the C/~C classifier
Fig. 4
figure 4

Rules extracted for the PSSP model using the location of the α-helix

Fig. 5
figure 5

Rules extracted for the PSSP model using the location of the β-helix

Fig. 6
figure 6

Rules extracted for the PSSP model using the location of the coil structure

Discussions

Initially, it was noted that the relationship between hydrophobic side chains could lead to α-helix occurrence [28]. In Fig. 4, the forecast of the α-helix is based on four rules according to four patterns, namely, IKLW, IKLC, YACD, and YVM. In rule 1, the IKLW pattern achieved 100% accuracy for α-helix prediction due to isoleucine I, lysine K, leucine L, and tryptophan W displays at the first, second, third, and fourth locations, respectively. Both amino acids I and W are hydrophobic, and their presence at location i, i + 3 referred to a helix manifestation [29]. In rule 2, the IKLC pattern confirmed that I and C are hydrophobic and indicated helix stabilization [29]. In rule 3, the YACD pattern achieved 100% accuracy for α-helix prediction. In rule 4, both amino acids Y and M are hydrophobic, and their occurrence at two locations during the sequence leads to α-helix construction. Valine V has a low rate of helix occurrence [28].

In Fig. 5, the forecast of the β-strand is based on seven rules according to seven patterns, namely, HIKLW, RTWYC, CGNPPR, DHQWHE, CGCSA, HCTW, and VWCD. In rule 1, the HIKLW pattern achieved 100% accuracy for β-strand prediction due to histidine H, isoleucine I, lysine K, leucine L, and tryptophan W displays at the first, second, third, fourth, and fifth locations, respectively. In rule 2, the RTWYC pattern achieved 79% accuracy due to arginine R, threonine T, tryptophan W, tyrosine Y, and cysteine C displays at the first, second, third, fourth, and fifth locations, respectively. The amino acids T, R, and D are employed as N-terminal β-breakers, while S and G are employed as C-terminal β-breakers [30]. Additionally, these patterns, namely, CGNPPR, CGCSA, and HCTW, achieved 100% accuracy for β-strand prediction.

The strengthening of protein structure and protein regulation is related to the appearance of specific amino acids in the loop structure. Proline P and glycine G are considered the most important amino acids in the loop structure. The high load proclivities are achieved when there are nearest to Proline P [30]. On the other hand, low load proclivities are achieved when cysteine C, isoleucine I, leucine L, tryptophan W, and valine V are present [31].

In Fig. 6, the forecast of the coil structure is based on seven rules according to seven patterns, namely, EFG, PEH, RYGSVY, TMPA, DTMPV, PTE, and LRKL. In rules 2 and 3, the occurrence of coil structure referred to high load proclivities due to the presence of amino acids P and G. In rule 1, the EFG pattern achieved 90% accuracy for coil prediction. This confirmed that E is considered hydrophilic, while F and G are hydrophobic amino acids. In rule 2, the PEH pattern achieved 100% accuracy. In rule 4, it was confirmed that T is hydrophilic, while M, P, and A are hydrophobic amino acids. In rule 6, the PTE pattern achieved 67% accuracy due to proline P occurrence with threonine T and glutamic E during the series. In rule 7, the LRKL pattern achieved 100% accuracy due to the arginine R display at a location with lysine K and leucine L at the first, third, and fourth locations, respectively, through the series.

For comparative analysis, the recent algorithm [32] based on convolutional, residual, and recurrent neural network (CRRNN) showed 71.4% accuracy for DSSP. This indicated that our algorithm is more accurate than that in [30]. On the other hand, the quality of protein structure prediction can affect poor alignments, protein misfolding, few similarity rates between known sequences, evolution theory, and machine learning performance [33].

For results analysis, instead of taking the three binary classifiers: (H/~H), (E/~E), (C/~C) into PSSP account [21], we compared the proposed algorithm with previous studies based on six classifiers: (H/~H), (H/~E), (E/~E), (E/~C), (C/~C), (H/~C) for PSSP as in Table 3 and predict the residue identity of each position one by one. It also found that the PSSP_SVMCP model has shown superior accuracy rather than the proposed model in terms of H/~H, C/~C, H/~E, and H/~C classifiers.

Conclusions

The goal of this paper is to predict the RS126 dataset of 126 protein sequences as a secondary structure via the SVM classifier and decision tree. The proposed model has presented a framework of PSSP for the appearance of α-helix, β-strand, and coil structures. The experiential results coincided with the work of Kallenbach that the presence of isoleucine I and tryptophan W at positions i and i+3 along the sequence proved to be a helix stabilizing. In a β-strand, the presence of arginine R and lysine K is proven to be β-strand. In a coil structure, it is known that proline P and glycine G are the most significant amino acids in the coil structure, which concurs with our findings. This proposed model obtained benefits in the protein analysis domain with a correct prognosis for anonymous sequences. In future work, we expand the proposed algorithm to apply it to the other protein datasets for producing an effective competitive analysis in the PSSP schema.