Background

It has been shown that sequences contain rich information for protein tertiary structure prediction as well as functional study [1, 2]. But it is challenging to directly predict tertiary structure from primary sequence, so the hierarchical approach has been widely accepted as one of the most efficient methods. That means to transform the ultimate goal into several sub-problems, such as secondary structure prediction, solvent accessibility prediction, residue-residue contact prediction, etc. [3] reviewed the progress in the field of intermediate state or one-dimensional property prediction. It has been shown that predicted secondary structure is useful in the prediction of disordered and flexible regions, fold recognition and function prediction. However, secondary structure states are described as discrete classes and there is no clear boundary between coil and helical/strand states. It is a significant step towards establishing the structure and function of a protein to predict local conformation of the polypeptide chain. The local structural bias information restricts the possible conformations of a sequence segment and therefore narrows down the conformation space of the whole polypeptide chain significantly. Thus, prediction of dihedral angles is especially useful for protein tertiary structure prediction.

On the whole, dihedral angle prediction may benefit protein tertiary structure prediction in several aspects. Firstly, dihedral angle prediction may act as substitute or supplement for secondary structure prediction [46]. Secondly, It can be used in generation of sequence/structure alignment. For one thing, it can be directly applied to structure alignment methods based on dihedral angles [7, 8] and may aid refinement of target-template structure alignment. For another, considering predicted angles to refine multiple sequence alignment may narrow the gap between sequence and structure alignment, thus aiding de novo prediction of structural properties. In addition, dihedral angle prediction may also find applications in protein structure prediction that includes but not limits to fold recognition approaches [9, 10], fragment-free tertiary structure prediction [11], tertiary structure refinement and structure quality assessment [12] and functional study, such as ligand-binding site prediction [13].

There are mainly two kinds of problems in dihedral angle prediction: angle region prediction and real value prediction, which corresponds to two different representations of protein backbone local structural bias.

Initially, Ramachandran basin is an intuitive description of local structural bias [14]. A Ramachandran basin is a specific region of a Ramachandran plot and illustrates the preference of torsion angle values. Each angle pair can be assigned a basin label. With more basins, the assignment would be harder but the representation would be more accurate and vice versa. Colubri et al. tested the ability to recover the native structure from a given basin assignment for each residue to investigate the level of representation required to simulate folding and predict structure, resulting in five basins [15]. Gong et al. partitioned ϕ,ψ-space into a uniform grid of 36 squares, each 60° × 60°, thus resulting in 36 basins, and showed that they successfully reconstructed six proteins solely from their mesostate (basin label) sequences [16]. There are also some other methods to define basins and do angle region prediction with different definitions of basins [1720]. Although it is vital to determine the proper number of regions and clearly define the boundary, a universal algorithm to generate Ramachandran basins and assign basin labels remains to be developed. In our study, k-means clustering serves as the basin generator and label assigner.

While Ramachandran basin provides an overall description of conformation, it is a coarse-grained representation and lacks statistical explanations describing the torsion angle distributions of each basin. In consideration of the circular nature of angles, traditional parametric or non-parametric density estimation methods cannot work properly to approximate Ramachandran distributions. Fortunately, directional distributions such as von Mises distribution could solve the problem [21]. Bivariate von Mises distribution (mixtures) has been used to model protein dihedral angle distribution [22, 23], which removes arbitrariness in defining the boundary between discrete states. In this study, we assume angle pairs in each basin follow a bivariate von Mises distribution to derive the log-likelihood of each clustering.

Thanks to the rapid growth of Protein Data Bank and computational and algorithmic development in machine learning (especially deep learning), several supervised machine learning methods have been proposed to predict real values of dihedral angles. As ϕ values in α-helices and β-sheets are quite similar, ψ seems more informative. Wood et al. first developed a method DESTRUCT for prediction of real-valued dihedral angle ψ and used this information for prediction of the protein secondary structure with high accuracy [4]. Wu et al.proposed a composite machine-learning algorithm called ANGLOR to predict real-value protein backbone torsion angles from protein sequences [24]. The input features of ANGLOR include sequence profiles, predicted secondary structure and solvent accessibility. The mean absolute error (MAE) of the ϕ/ ψ prediction was reported to be 28° / 46°. Later Song et al. developed TANGLE based on a two-level support vector regression approach using a variety of features derived from amino acid sequences, including the evolutionary profiles and natively disordered region as well as other global sequence features [25]. The MAE of the ϕ/ ψ was 27.8° / 44.6°. Xue et al. established a neural network method called Real-SPINE, with sequence profiles generated from multiple sequence alignment and predicted secondary structures as inputs [26]. In 2015, they presented SPIDER2 [27] by improving SPIDER [28] through iterative learning, which used a deep artificial neural network (ANN) with three hidden layers of 150 nodes. They fed the predicted torsion angles of last layer as the input to the following generation and reported 19° and 30° for mean absolute errors of backbone ϕ and ψ angles, respectively. As it is impossible to introduce all methods here, interested readers can refer to excellent reviews [29, 30].

Although there has been tremendous development, their performance is still limited by their shallow architectures. Inspired by the excellent performance of convolution neural network in predicting secondary structure [31] and order/disorder regions [32] and also the success of residual framework to do contact prediction [33], we adopt the ultra deep residual framework of convolutional neural network to do k-means basin label probability prediction.

However, even though a protein backbone conformation can be highly accurately rebuilt from its respective native dihedral angles, accumulation of errors in predicted angles can lead to large deviation in three-dimensional structures, which prevents angle prediction from its direct use in building protein structures [27]. It is of great significance to produce the corresponding confidence scores for the real value predictions, i.e., we need to know the confidence level of the predictions. Otherwise the effect of predicted dihedral angles as restraints for three dimensional structure prediction would be limited [34]. Zhou et al. had developed SPIDER2 [27] to predict real-valued angles and then separately SPIDER2-Delta [35] to predict error of those predicted structural properties. Here we describe a simple hybrid technique to predict angles and confidence scores simultaneously.

Another problem that need to be considered is the periodicity of angles. For example, if an angle θ=179° is predicted to be −179°, the error would be treated as 358° instead of 2°. There are some approaches proposed to reduce the impact of cyclic nature of angles. One was angle shifting to reduce confusion at 0° and 360° (or −180° and 180°), e.g., shifting ψ by 100° and ϕ by −10° [26] or adding 100° to the angles between −100° and 180° and adding 460° to the angles between −180° and −100° [34]. But the improvement was limited and strongly depended on the angle range. For amino acids such as alanine that had minimal residues in the affected range, angle shifting made little difference [29]. A better choice was to take advantage of the inherent angle periodicity of trigonometric functions, that is, mapping the angles to their sine and cosine values [27], which has achieved best performance so far. Inspired by this, we deal with equivalent trigonometric representations of dihedral angle pairs, rather than real value angles.

Considering dihedral angles share similar patterns in alpha helix and beta strand, the acceptable (ϕ,ψ) patterns are limited. Moreover, it is much easier to do classification than regression. Also indebted to mixture models and Expectation-maximization algorithm, we develop a hybrid method of k-means clustering and deep learning to do angle prediction, combining advantages of discrete and continuous representation of dihedral angles. Specifically, we firstly generate a set of clusters of (ϕ,ψ) from training data, in which we could get the distribution of each cluster; then we use deep learning methods to predict discrete labels; lastly we predict real value angles by mixing empirical clusters with their predicted probabilities. We employ a residual framework of convolutional neuron network in RaptorX-Angle to predict the cluster label probabilities. We test our method on filtered PDB25 dataset as well as CASP (Critical Assessment of protein Structure Prediction) targets and compare with other three state-of-art methods. Tested on the subset of PDB25, our method gains about 0.5°and 1.4°for ϕ and ψ better MAE than SPIDER2, currently among the best backbone angle predictors. Our method also performs better than SPIDER2 on the CASP11 and CASP12 test targets. The advantage is even more obvious when looking into detailed secondary structural regions.

Methods

K-means clustering of angle vectors

Genearating k-means “centers” from angle vectors

For a dihedral angle pair (ϕ,ψ), we can equivalently denote it by an angle vector

$$\mathbf{v}=\left(\cos(\phi), \sin(\phi), \cos(\psi), \sin(\psi)\right). $$

Conversely, given the vector representation v, we can easily derive the corresponding angles ϕ and ψ (Additional file 1: S1.1). We run k-means on angle vectors to cluster dihedral angle pairs in training set into K=10,20,…,100 clusters. Then we normalize the K centres \(\left \{\mathbf {C}_{k}\right \}_{k=1}^{K}\) and get the final “centers” \(\left \{\widetilde {\mathbf {C}}_{k} =(\widetilde {c}_{k0}, \widetilde {c}_{k1}, \widetilde {c}_{k2}, \widetilde {c}_{k3})\right \}_{k=1}^{K}\), so that each “centre” \(\widetilde {\mathbf {C}}_{k}\) is a valid representation for some angle pair (Additional file 1: S1.2).

Predicting “true” labels from k-means

Given the K normalised vector “centres” \(\left \{\widetilde {\mathbf {C}}_{k}\right \}_{k=1}^{K}\), we could assign the “true” label for each dihedral angle pair as the one whose corresponding normalised centre was closest to its respective vector representation. Then the “true” labels can be used as the training labels to build a deep learning model as a classifier to predict labels for testing data.

Deep learning model details

Deep Convolutional Neural Network (DCNN)

DCNN consists of multiple convolutional blocks. A convolutional block is a neural network that implements a composite of linear convolution and nonlinear activation transformation. Convolution is used in place of general matrix multiplication, which can better capture local dependency. It has been widely accepted that protein torsion angles strongly depend on neighbour residues [3638]. So DCNN is ideal to abstract angle information from sequence.

Residual Network (ResNet)

DCNN can integrate features in hierarchical levels and some work has shown the significance of depth [39]. However, with the depth increasing, accuracy gets saturated and even degraded. That is because adding more layers may lead to higher training error as identity mapping is difficult to fit with a stack of nonlinear layers [40]. ResNet was proposed as a residual learning framework to ease the training of substantially deeper networks [41]. Figure 1 demonstrates the basic architecture of ResNet in RaptorX-Angle. Figure 1a is a residual block, which consists of 2 convolution layers and 2 activation layers, and the ResNet consists of stacked residual blocks (Fig. 1b). The activation layer conducts a simple nonlinear transformation of its input depending on the activation function with no additional parameters. In this work, we used the ReLU activation function [42].

Fig. 1
figure 1

Illustration of the ResNet model in RaptorX-Angle. a A building block of ResNet with x i and xi+1 being input and output, respectively. b The ResNet model architecture as a classifier with stacked residual blocks and a logistic regression layer. Here L is the sequence length of the protein or total number of residues under prediction and K is the number of clusters

Logistic regression layer

DCNN and ResNet can capture information from data and output abstract features. To do classification for residues, a logistic regression layer is added as the final layer in RaptorX-Angle, which could output the marginal probability of K labels (Fig. 1b).

Loss function

We train model parameters through maximizing the probability of angle pairs belong to the “true” labels. Naturally, the loss function is defined as the negative log-likelihood averaged over all residues of the training proteins.

Regularization and optimization

As is widely used in machine learning, the log-likelihood objective function is penalized with a L2-norm of the model parameters to prevent overfitting. Thus, the final objective function has two items: loss function and regularization item, with a regularization factor λ to balance the two items. That is, the final objective function is:

$$ \max_{\theta}\quad\log{P_{\theta}(Y|X)}-\lambda\|\theta\|^{2} $$

where X is the input features, Y is the output labels, θ is the model parameters and λ is the regularization factor used to balance the log likelihood and regularization. We use Adam [43] to minimize the objective function, which usually can converge within 20 epochs. The whole algorithm has been implemented by Theano [44] and mainly run on a GPU card.

Input features

For each residue in each protein sequence, we generate a total of 66 input features, of which 20 from position specific scoring matrix(PSSM) of PSI-BLAST [45], 20 from position-specific frequency matrix (PSFM) of HHpred [46, 47], 20 from primary sequence, 3 from predicted solvent accessibility (ACC) and 3 from predicted secondary structure(SS) probabilities (Additional file 1: S1.3).

Predicting real-value angles from predicted marginal probability

From the last logistic regression layer of the deep learning model, we could predict the marginal probability P=(p1,p2,…,p K ) of an angle pair for each label. We use the marginal probability rather than the single predicted label to reduce bias. Concretely, we calculate the weighted mean by:

$$\widehat{\mathbf{v}}=(v_{0},v_{1},v_{2},v_{3})=\sum_{k=1}^{K}p_{k}\widetilde{\mathbf{C_{k}}}, $$

Finally, we normalise \(\widehat {\mathbf {v}}\) to get

$$\widehat{\cos(\phi)}=\frac{v_{0}}{\sqrt{{v_{0}}^{2} + {v_{1}}^{2} }}, \widehat{\sin(\phi)}=\frac{v_{1}}{\sqrt{{v_{0}}^{2} + {v_{1}}^{2} }}, $$
$$\widehat{\cos(\psi)}=\frac{v_{2}}{\sqrt{{v_{2}}^{2} + {v_{3}}^{2} }}, \widehat{\sin(\psi)}=\frac{v_{3}}{\sqrt{{v_{2}}^{2} + {v_{3}}^{2} }}. $$

and we could derive the predicted real values \(\widehat {\phi },\widehat {\psi }\) from this angle vector (Additional file 1: S1.1). We also tried to predict real-value angles from labels with top R(R<K) probabilities when K is well chosen (Additional file 1: S2.3).

Programs to compare and evaluation metrics

We compare our method with three available standalone softwares SPIDER2 [27], SPINE X [11], and ANGLOR [24]. All the programs are run with parameters suggested in their respective papers.

We evaluate the performance by Pearson Correlation Coefficient (PCC) and Mean Absolute Error (MAE) as described by [48], for assessing the prediction of ϕ/ ψ angles. Considering the periodicity of angles, PCC is calculated between the cosine (sine) values of predicted and experimentally determined angles. MAE is the average absolute difference between predicted and experimentally determined angles. The periodicity of an angle has been taken care of by utilizing the smaller value of the absolute difference d(=|θ pred θ exp |) and 360−d for average, where θ pred is the predicted angle and θ exp is the true angle value.

Results

Datasets

We use the targets from PDB25 updated in February, 2016. The set consists of 10820 non-redundant protein chains, in which any two chains share no more than 25% sequence identity. To remove impact of disordered regions, we filter out proteins with internal disordered regions by DSSP [49]. Finally we get 7604 proteins. We then randomly select 5070 proteins as the candidate training set, 1267 as validation set (VL1267, see Additional file 2) and the remaining 1267 as test set (TS1267, see Additional file 3). We also test on 85 CASP11 targets (see Additional file 4) and the latest 40 CASP12 targets (see Additional file 5) with publicly released native structures. To remove redundancy between training proteins and CASP targets, we run MMseqs2 [50], which is similar but more sensitive and faster than BLAST (PSI-BLAST) for protein sequence homology search, with seqID cutoff 0.25 and also E-value cutoff 0.001 to filter 5070 the candidate training proteins, resulting in 5046 training proteins (TR5046, see Additional file 6).

Choosing a proper number of clusters

A vital problem is how to select the number of clusters, which can be reduced to defining measures for clustering evaluation. Here we adopt two measures: (i) entropy loss based on discrete distribution; (ii) loglikelihood based on continuous distribution to evaluate 10 different clusterings (K=10,20,…,100). Firstly, we do k-means clustering on TR5046 and get K empirical clusters. Secondly, we train the deep learning models and do classification on VL1267, then we can obtain the predicted marginal probability of the K clusters P i =(pi1,pi2,…,p iK ),i=1,2,…,N, where i is the index of residue and N is the total number of residues in VL1267.

Entropy loss

Entropy H(·) is always used to measure the information of a distribution. From k-means clustering on TR5046, the background distribution among clusters P0=(p01,p02,…,p0K) could be derived. Then the entropy loss of this clustering on VL1267 can be calculated as the mean difference between entropy of background distribution and predicted marginal distribution:

$$\begin{array}{@{}rcl@{}} EL & = & \frac{1}{N}\sum_{i=1}^{N}\left(H\left(\mathbf{P_{0}}\right)-H\left(\mathbf{P_{i}}\right)\right)\\ & = & \frac{1}{N}\sum_{i=1}^{N}\left(\sum_{k=1}^{K}p_{0k}\log\left(p_{0k}\right)-\sum_{k=1}^{K}p_{ik}\log\left(p_{ik}\right)\right) \end{array} $$

which can roughly evaluate the information gain from the clustering. Here N is the number of residues in VL1267.

Loglikelihood

To demonstrate the detailed information of each cluster, we need a continuous angular(circular) distribution defined on the torus. Mixture bivariate von Mises distributions are successfully used to describe the local bias of torsion angle pair (ϕ,ψ) [2123], we assume that angle pairs belong to the same cluster k obey a common bivariate von Mises distribution f k with parameters \(\Theta _{k}=\left (\kappa _{1}^{k}, \kappa _{2}^{k}, \kappa _{3}^{k}, \mu ^{k}, \nu ^{k}\right)\). Here,

$$\begin{array}{@{}rcl@{}} f_{k}\left(\phi, \psi \right) & \,=\, & c\left(\kappa_{1}^{k}, \kappa_{2}^{k}, \kappa_{3}^{k}\right)\exp \left\{ \kappa_{1}^{k}\cos\left(\phi-\mu^{k}\right)\right.\\ & & \left.+ \kappa_{2}^{k}\cos\left(\!\psi-\nu^{k}\!\right) \,+\, \kappa_{3}^{k}\cos\left(\!\phi-\mu^{k}-\psi+\nu^{k}\!\right)\! \right\} \end{array} $$

where μk and νk are the mean value of ϕ and ψ, respectively; \(\kappa _{1}^{k}, \kappa _{2}^{k}\) are the concentrations, \(\kappa _{3}^{k}\) allows for the dependency between the two angles and \(c\left (\kappa _{1}^{k}, \kappa _{2}^{k}, \kappa _{3}^{k}\right)\) is a normalization constant:

$$\begin{aligned} c\left(\kappa_{1}^{k}, \kappa_{2}^{k}, \kappa_{3}^{k}\right) \!\,=\, \left(2\pi\right)^{2}\left\{\! I_{0}\left(\kappa_{1}^{k}\right)I_{0}\left(\!\kappa_{2}^{k}\!\right)I_{0}\left(\!\kappa_{3}^{k}\!\right) \,+\, 2\sum_{p=1}^{\infty}I_{p}\left(\!\kappa_{1}^{k}\!\right)I_{p}\left(\!\kappa_{2}^{k}\!\right)I_{p}\left(\kappa_{3}^{k}\right) \!\right\} \end{aligned} $$

in which I p (κ) is the modified Bessel function of the first kind and order p. Parameters \(\left \{\Theta _{k}=\left (\kappa _{1}^{k}, \kappa _{2}^{k}, \kappa _{3}^{k}, \mu ^{k}, \nu ^{k}\right)\right \}_{k=1}^{K}\) can be intuitively estimated from the empirical clusters \(\left \{(\phi, \psi)_{k}\right \}_{k=1}^{K}\) [51]. Then the density function for the torsion angle pair (ϕ,ψ) can be approximately described as:

$$f(\phi, \psi)=\sum_{k=1}^{K} p_{k}f_{k}(\phi,\psi) $$

where p k is the predicted marginal probability of (ϕ,ψ) belongs to cluster k. Then the log-likelihhod for the VL1267 can be calculated as:

$$LL=\frac{1}{N}\sum_{i=1}^{N}\log{f(\phi_{i}, \psi_{i})}=\frac{1}{N}\sum_{i=1}^{N}\log{\sum_{k=1}^{K} p_{ik}f_{k}(\phi_{i},\psi_{i})} $$

Selecting proper K

Figure 2 shows the result of entropy loss and loglikelihood with respect to the number of clusters. As expected, the loglikelihood increases along with K, which means it can better describe the data with more clusters. But when K goes larger than 30, there is an obvious decrease in entropy loss. Maybe that is because the more clusters are used, the more challenging it would be to do angle prediction. As there is a soaring information gain when K goes from 10 to 20 and little difference when K increases from 20 to 30, we test every single clustering between 20 and 30 and there is no significant benefit with more clusters. So we just choose K=20 to do following studies.

Fig. 2
figure 2

Selecting proper number of clusters K. Left: relationship between entropy loss of discrete label probabilities and number of clusters K; Right: relationship between loglikelihood of mixture bivariate von Mises distribution and number of clusters K

Feature contribution study

The features can be divided into three categories: sequence information including amino acid (aa) and profile, predicted secondary structure (SS) and solvent accessibility (ACC). Sequence profile information are generated from PSI-BLAST (PSSM) and HHpred (PSFM) (See Additional file 1: S1.3 for more details). To test the impact of different feature combinations, we design six experiments: (1) basic1 = 20 PSSM + 20 aa; (2) basic2 = 20 PSFM + 20 aa; (3) basic = 20 PSSM + 20 PSFM + 20 aa; (4) basic + 3 ACC; (5) basic + 3 SS; (6) basic + 3 ACC + 3 SS. The network architecture is fixed as N layers =5,N nodes =100,halfWinSize=3 (ResNet 5-100-3), and the regularization factor is fixed to be 0.0001.

Table 1 shows the MAE performance of different feature combinations on TS1267. From the first three experiments with only sequence information involved, there is little performance difference between PSSM and PSFM, and the combination of PSSM and PSFM gains the best accuracy. So PSSM and PSFM are complementary and both unignorable. ACC and SS both contribute significantly and also the combination gain the best accuracy. Finally we use the whole set of features.

Table 1 The mean absolute error of different feature combinations with ResNet 5-100-3 on TS1267

Overall PCC performance of cosine values compared with other methods

To tune proper regularization factor and also network architectures, we perform 5-fold cross validation on TR5046 (Additional file 1: S2.1 and S2.2). Finally, we choose an ensemble of 6 networks (Additional file 1: S2.2). We test our method on TS1267 and also the popular CASP targets, including 85 CASP11 targets and 40 CASP12 targets. Table 2 shows the PCC performance of cosine values on the three benchmarks. RaptorX-Angle has gained the highest PCC on all datasets. We also evaluate PCC performance of sine values (See Additional file 1: S2.4) and get similar results.

Table 2 Pearson correlation coefficient of cosine values between predicted and true angles

Overall MAE performance compared with other methods

Table 3 shows the MAE performance on the three benchmarks in different secondary structural regions of our RaptorX-Angle comparing with other three methods. All methods have larger MAE on CASP targets than on TS1267. It is reasonable since CASP targets are usually hard to predict. It can be seen that RaptorX-Angle performs the best on all benchmarks, with about 0.5° and 1.4° for ϕ and ψ better MAE on both TS1267 and CASP12 and slightly better performance on CASP11 than the second best method SPIDER2. We perform Student’s t test of absolute errors between RaptorX-Angle and SPIDER2. As a result, the p-values for ϕ/ψ are 8.65e−12/2.79e−33, 5.13e−2/8.36e−2 and 1.28e−5/2.59e−8 on TS1267, CASP11 and CASP12, respectively. That is, the advantage of RaptorX-Angle over SPIDER2 on TS1267 and CASP12 is statistically more significant than on CASP11. These results demonstrate the rationality of representing the Ramachandran plot with a limited number of clusters, say 20 clusters, and also reflect the power of deep learning methods.

Table 3 Mean absolute error of four methods for different secondary structural regions on three benchmarks: TS1267, 85 CASP11 targets and 40 CASP12 targets

Mean absolute error performance study in VL1267

In methodology, the conversion from angle pair to trigonometric vector is nonlinear, the prediction error may depend on angles. And in biology, prediction error may differ for different amino acids with different microscopic biochemical properties, and also for different protein classes with different macroscopic structures. So we perform detailed studies on prediction error in VL1267.

Study mean absolute error performance for different clusters

As each cluster corresponds to a certain angle region, we calculate the MAE for each cluster in VL1267. We observe that the 20 clusters are well consistent with Ramachandran plot and also the two peaks for ϕ and ψ [11] (Fig. 3 Left). And the prediction errors differ a lot between clusters. It turns out that clusters with more residues in coil region tend to result in larger prediction errors. Moreover, prediction error for ϕ is smaller than for ψ. But there are three uncommon clusters with larger MAE for ϕ, i.e., 5, 6 and 10 (Fig. 3 Right). Clusters 5 and 6 are totally in one of the peak areas in Ramachandran plot, which may indicate some interesting biological discoveries.

Fig. 3
figure 3

Mean absolute error performance for different clusters in VL1267. Left: visualization of 20 cluster centers on the Ramachandran plot with smaller number indicating smaller size. Right: mean absolute error for different clusters

Mean absolute error performance for each amino acid type

As different amino acids have different stereochemical and physiochemical properties, they are anticipated to have different degrees of difficulty for the torsion angle prediction. In Table 4, we examine the MAE performance for each of 20 amino acid types. Glycine, with no side-chain atom except for a proton, has least steric restriction to backbone dihedral angle motions. As a result, it has the largest prediction error (43.32° / 39.59° for ϕ/ψ). In contrast, Proline has the least MAE (8.84°) for ϕ but has an unusually large MAE (33.00°) for ψ prediction due to its special side-chain structure, which is consistent with [24]. In addition, three of the amino acids (Ile, Leu and Val) with the smallest MAE are all hydrophobic.

Table 4 Mean absolute error performance for each amino acid type in VL1267

Mean absolute error performance for different protein classes

After studying on MAE performance in microcosmic view, we intend to study the performance for different macroscopical structures. We abstract 99, 117, 171, 117 proteins from VL1267 (resulted in 17696, 24874, 47304 and 19645 residues) in all α, all β, α/β and α+β classes, respectively. We calculate the absolute error for every residue in each class. Figure 4 shows the violin plot of prediction error for ϕ (Left) and ψ (Right). A violin plot is similar to box plot except that it also shows the probability density of the data. We can see although the MAE for ϕ are smaller for all protein classes, prediction errors belong to each protein class have their own distribution pattern and the pattern is similar between ϕ and ψ. Overall, prediction errors are smallest in all α proteins and largest in all β for both ϕ and ψ predictions.

Fig. 4
figure 4

Mean absolute error performance for different protein classes in VL1267. Left: for ϕ prediction. Right: for ψ prediction

Estimating confidence score of predicted angles

Generally, variance σ2 includes variance within cluster \(\sigma _{w}^{2}\) and variance between cluster \(\sigma _{b}^{2}\). To produce the confidence score of our predicted angles, we calculate the standard deviation from variances within a cluster. Specifically, for each cluster k, we can get the in-cluster variance \(\sigma _{k}^{2}(\theta)\) from training data, where θ=ϕ or ψ. Then we derive the variance of prediction by:

$$var(\theta)=\sigma^{2}(\theta)=\sum_{k=1}^{K}p_{k}\sigma_{k}^{2}(\theta) $$

Figure 5 shows the mean standard deviation for ϕ and ψ in different regions. As expected, the smallest variance appears in helix region, and then strand and lastly coil region. The standard deviation in disordered regions are rather large and quite similar to coil regions, which is consistent with our prior knowledge that disordered region resembles loop region and is rather flexible.

Fig. 5
figure 5

Mean standard deviation for different secondary structural regions in TS1267

Figure 6 demonstrates the relationship between MAE and mean standard deviation for ϕ and ψ in different regions on VL1267. Roughly, the relationship is linear (R2=0.8911). So the MAE can be bounded well by the standard deviation. We predict the error for each residue in each target from TS1267 and calculate corresponding Pearson and Spearman correlation coefficients (PCC and SCC) between prediction errors and true errors, and also the mean absolute error for prediction errors (MAEPE). Finally, we obtain PCC=0.3109,SSC=0.5427,MAEPE=13.94 for ϕ and PCC=0.2597,SCC=0.4751,MAEPE=26.21 for ψ. We also try to fit two linear models for ϕ and ψ separately on the all data points in VL1267 and get similar testing results. This indicates that the mean for different secondary structural regions almost contains enough information about the relationship between the estimated standard deviation and prediction error (Additional file 1: S2.6).

Fig. 6
figure 6

Relationship betwee n prediction error and standard deviation. Eight points are for two kinds of angles (ϕ,ψ) in four secondary structural regions (total, helix, strand, coil)

Computational cost analysis

All mentioned methods could do angle prediction target by target, so the computational cost is bounded by the longest protein (i.e., protein with the largest number of residues). To generate angle predictions for 1xdoA, the largest protein in TS1267 with 685 residues, it takes 726s, 123s, 370s and 524s for ANGLOR, spineX, SPIDER2 and RaptorX-Angle, respectively.

As far as we see, the computational cost is mainly determined by method outline, network complexity, feature engineering and technical resources. ANGLOR is a composite method and the technology was not so developed at that time, it needs the most time. While spineX just adopted a simple network, SPIDER2 used more features iteratively in a more complex network and it takes longer than spineX.

Compared with the second best SPIDER2, RaptorX-Angle used much deeper networks and also adopted profile information from hhblits (PSFM), besides PSSM from PSI-BLAST harnessed by spineX and SPIDER2. As a result, it takes SPIDER2 360s to generate features with 4 CPUs and 20s to predict angles using a CPU, while it takes RaptorX-Angle 385s to generate features with 4 CPUs, and 200s to predict angles from the features using a GPU card.

However, we can integrate the features of a total batch of proteins and run them all at once. Actually, it just takes 750s to do angle prediction for all proteins in TS1267, while other methods needs many CPUs in parallel. Overall, our method is faster for prediction of many proteins and has gained better prediction accuracy.

Discussion

We have transformed the hard real-valued prediction problem into a discrete label assignment problem, which has simplified the problem and also gained better results. Overall, our RaptorX-Angle gains the best PCC in terms of cosine and sine of angles on all datasets. It has about 0.5° and 1.4° for ϕ and ψ better MAE than the second best method SPIDER2 on a subset of PDB25. We have also calculated the two-state accuracy to see how much improvement there would be in large angle errors. RaptorX-Angle performs the best and has about 0.15 and 1 percent improvement over SPIDER2 for ϕ and ψ on TS1267(See Additional file 1: S2.5). Our method also works very well on the CASP targets. Moreover, we have estimated the prediction errors at each residue by a mixture of the clusters with their predicted probabilities. It has been shown that there is approximately linear relationship between the real prediction error and in-cluster standard deviation. That is a unique feature of our method. In addition, we check the prediction for disordered regions. As there is no angle information, we just analyze the standard deviation and get quite large values and similar patterns to coil region. It is consistent with our prior knowledge that disordered region is rather flexible and resembles loop region. We also do comprehensive studies on prediction performance in VL1267, both in microscopic and macroscopic view.

This simple technique has gained better performance than other state-of-art methods. It demonstrates that for protein structures, the 20 clusters contain enough information for (ϕ,ψ), which is an efficient compression of information. The idea that to predict dihedral angles from clustering has turned out to be successful due to three aspects. The first is the continuous growth of the solved structures [52], so we have enough training data. The second is the novel idea to predict real-value angles by mixing a set of clusters with their respective predicted probabilities. Conversely, such good performance demonstrated that the distribution of protein backbone dihedral angles can be described through a set of clusters. Last but not the least, the everlasting development of deep learning models and optimization methods proves to be a powerful tool to promote new ideas and exploit new methods.

But there is still room for improvement. RaptorX-Angle just used one-dimensional features and adopted 1D CNN. It cannot extract information of long range interaction. Heffernan et al. has developed more accurate SPIDER3 employing Long Short-Term Memory (LSTM) Bidirectional Recurrent Neural Networks (BRNNs), which are capable of capturing long range interactions [53]. That is, considering pairwise interaction can further increase prediction accuracy. We will include two-dimensional features and exploit 2D CNN to see how much improvement could be achieved.

Moreover, as mentioned before, accumulation of prediction errors has buried the usefulness of torsion angles to construct 3D models. There is a great demand to develop a proper technique to deal with the errors. A general pipeline to add angle restraints and confidence to improve protein tertiary structure prediction need to be developed.

Conclusions

In conclusion, this study has made a more reliable prediction of dihedral angles and may facilitate protein structure prediction and functional study. In the near future, we can use the angle restraints to do tertiary structure prediction, which should be considered carefully to deal with errors and flexibility. We can also adopt the angle prediction to aid structure alignment and fold recognition.