SPIDER2: A Package to Predict Secondary Structure, Accessible Surface Area, and Main-Chain Torsional Angles by Deep Neural Networks
Predicting one-dimensional structure properties has played an important role to improve prediction of protein three-dimensional structures and functions. The most commonly predicted properties are secondary structure and accessible surface area (ASA) representing local and nonlocal structural characteristics, respectively. Secondary structure prediction is further complemented by prediction of continuous main-chain torsional angles. Here we describe a newly developed method SPIDER2 that utilizes three iterations of deep learning neural networks to improve the prediction accuracy of several structural properties simultaneously. For an independent test set of 1199 proteins SPIDER2 achieves 82 % accuracy for secondary structure prediction, 0.76 for the correlation coefficient between predicted and actual solvent accessible surface area, 19° and 30° for mean absolute errors of backbone φ and ψ angles, respectively, and 8° and 32° for mean absolute errors of Cα-based θ and τ angles, respectively. The method provides state-of-the-art, all-in-one accurate prediction of local structure and solvent accessible surface area. The method is implemented, as a webserver along with a standalone package that are available in our website: http://sparks-lab.org.
Key wordsSecondary structure prediction Solvent accessible surface area Backbone torsion angles Deep neural networks C alpha-based angles
With the rapid development of DNA sequencing techniques, there is a continuously increasing gap between the number of sequences available from genomic analysis and the number of structures and functions determined or annotated by expensive experimental techniques. It is highly desirable to develop theoretical methods to predict protein structures and functions from their one-dimensional sequences. However, methods for highly accurate prediction of protein three-dimensional structures (except homology modeling) are not yet available. This has significantly limited the ability to annotate protein functions based on their three-dimensional structures. As a result, predicted one-dimensional structural properties of proteins have often been utilized for predicting protein functions [1, 2, 3, 4], their binding sites to other molecules [5, 6, 7], and other studies [8, 9, 10, 11]. They have also been widely employed to improve protein structure prediction methods: both ab initio [12, 13, 14] and template-based techniques [15, 16, 17, 18]. Thus any improvement in predicted one-dimensional structural properties will benefit protein structure and function modeling.
The most commonly predicted one-dimensional structural property of a protein is three-state secondary structure (helix, sheet, and coil). Secondary structure prediction accuracy without using homologous sequences in training has gradually been improved to above 81 % in recent years [19, 20], due to improved machine-learning algorithms, better features, and available larger training datasets.
An alternative to secondary structures is angle-based representation of backbone structure. Angle-based description such as torsion angles φ and ψ offers a continuous representation of local conformation , rather than discontinuous and somewhat arbitrary definition of three secondary-structure states. The advantage of angle-based representation leads to methods for predicting torsional angles φ and ψ [12, 21], and Cα-based angles [an angle between Cα i−1 − Cα i − Cα i+1 (θ) and a dihedral angle rotated about the Cα i−1 − Cα i bond (τ)] .
Another important one-dimensional structure property is solvent Accessible Surface Area (ASA) that measures exposure of amino acid residues of proteins to solvent, which is important for understanding and predicting protein structure, function, and interactions [23, 24, 25, 26]. Earlier multistate prediction [23, 27, 28] has been gradually moved to continuous real value prediction [29, 30, 31, 32, 33].
In a recent study, we have developed SPIDER2, an iterative deep-learning neutral network, to predict all above-mentioned structural properties at the same time . The iterative and cross-learning method achieved 82 % accuracy for secondary structure prediction, 0.76 for the correlation coefficient between predicted and actual solvent accessible surface area, 19° and 30° for mean absolute errors of backbone φ and ψ angles, respectively, and 8° and 32° for mean absolute errors of Cα-based θ and τ angles, respectively, for an independent test dataset of 1199 proteins. The resulting method provides state-of-the-art, all-in-one accurate prediction of local structure and solvent accessible surface area.
SPIDER2 server version was trained on a dataset of 5789 nonredundant (25 % cutoff), high resolution (<2.0 Å) structure by employing a three consecutive deep neural networks trained iteratively. In each iteration, we employed a deep neural network (DNN) consisting of three hidden layers with 150 hidden nodes in each layer. The weights were initialized by stacked sparse auto-encoder  and then refined by standard back-propagation through fine-tuned supervised training [36, 37]. The learning rates for backward propagation were 1, 0.5, 0.2, and 0.05, respectively, with 30 epochs at each learning rate. The input layer for the DNN in the first iterative learning consists of 459 features (27 features per residue for a sliding window of 17 residues centered at the query residue). These 27 features include seven representative physical chemical properties parameters (steric parameter (graph shape index), hydrophobicity, volume, polarizability, isoelectric point, helix probability, and sheet probability properties of the amino acids), and 20 substitution probabilities obtained from 3 iterations searching by PSIBLAST . All input features are normalized to the range of 0 to 1. For residues near the ends of a protein, the features of the amino acid residue at the current end of the protein were duplicated so that a full window could be used. Predicted outputs are 12 values of predicted probabilities for three secondary structure states, relative ASA, and sine and cosine of four angles θ, τ, ϕ, and ψ. The input layers for the DNN in the second and third iterative learning are 12 predicted values in the previous iteration plus 27 above-employed features per residue, that is, 663 features [=(12 + 27) × 17].
3 Web Server
- 1.As shown in Fig. 1a, your protein sequence can be entered (or copy-pasted) in the FASTA format into the text area. Only one protein sequence is allowed each time. The sequence must contain 20 standard amino acids only. The first comment line in the FASTA format (“>” followed by the protein name) is employed to identify the name of the query protein. Without this line, the protein name will be set as “unknown” by default. The email address and target name in the webpage are optional. If you have a DNA/RNA sequence, you need first to convert them into a protein sequence (see Note 1 ).
By clicking the “submit” button, the job will be sent to a queue, and the webpage will be directed to a new page, where the “Click the link” points to a to-be-available result file. This webpage will be automatically refreshed every 60 s until the job is completed and the result is displayed on the web page.
Each prediction is usually completed within 10 min, but may take up to a few hours depending on how busy the server is and how long the protein chain is. If an email address is provided in submission, the link to the result webpage will be sent to the mailbox as soon as the prediction is finished. All prediction results are kept in the server for 1 month and automatically deleted afterwards.
If the users have their own Position Specific Substitution Matrix (PSSM) file for their query protein sequence, SPIDER2 prediction can be made by submitting the PSSM file to the server. Using an external PSSM file can skip the most time-consuming step of generating the evolution profile by PSI-BLAST, and the executive time reduce to a few seconds.
To save computing resources, please do not submit query sequences more than once. The status of your job can be found by clicking the link “Check the current Queue to prevent DUPLICATE submission” on the server webpage.
Figure 1b shows an example for the output webpage. Aligned lines started with “SEQ,” “SS,” and “rASA” represent query sequence, predicted secondary structure, and predicted relative accessible surface area, respectively. For SS, predicted coil, helix and sheet residues are represented by “−,” red “E,” and green “H,” respectively. For rASA, the relative ASA is represented by 0–9 with “0” for up to 10 % of its surface exposed and “9” for above 90 % exposed. The residues of rASA less than 20 % (buried residues) are labeled in blue. Here, rASA is normalized by a residue-specific reference value (the ASA in the fully exposed state of a residue when connected by an ALA in each side). This output page does not contain predicted secondary structure probability, predicted angles, and actual real values of ASA. The complete prediction file “pro1.spd3” (see Subheading 4, step 6 for explanation of the file) together with other intermediate files such as PSSM can be downloaded following the link in this output webpage.
4 Standalone Software
Download the software package from our homepage with a shortcut link: http://sparks-lab.org/pmwiki/download/index.php?Download=yueyang/SPIDER2_local.tgz after entering your name and email address. This information will be used only for notification of future updates. You can fill in “none” if you prefer not to leave your information.
Unzip the package by command “tar zxvf SPIDER2_local.tgz” which creats the directory “SPIDER2_local” containing a “Readme” file and three subdirectories “dat,” “ex,” and “misc.” The “dat” directory contains three npz files of trained parameters for three iterative neural networks, respectively, and the “misc” directory contains the program and auxiliary script files.
If BLAST or BLAST+ package is not installed in your computer, the software can be obtained from NCBI website. This program further requires correctly formatted nonredundant protein sequence databases, which can be downloaded from NCBI ftp://ftp.ncbi.nlm.nih.gov/blast/db (all files starting with “nr”). Until Oct 2015, the NR database contains a total of 40 files in 22GB before uncompressing. Alternatively, you can utilize a database by removing highly homology sequences, e.g., Uniref90 (see Note 2 ). This will speed up the calculation without making significant changes in prediction accuracy. This step can be skipped if you have prepared PSSM files (see Note 3 ).
SPIDER2 is called by the command “run_local.sh,” followed by all sequence files in FASTA format. Here, one input file can contain a protein sequence only (see Note 4 ).
- 5.Results will be saved in an output file with extension “spd3.” An example of output is shown in Fig. 2. The output file contains 11 columns that represent the residue index, residue type, predicted secondary structure type, ASA, φ, ψ, θ, τ, and probabilities as coil (C), sheet (E), and helix (H). The predicted secondary structure is the secondary structure type with the highest probability. The θ angle at residue index i is the angle between Cα i−1 − Cα i − Cα i+1, and τ is the dihedral angle formed by Cα i−2 − Cα i−1 − Cα i − Cα i+1. Three torsional angles φ, ψ, and τ range from −180 to 180°, and angle θ mostly ranges between 70 and 180°.
In addition, the package includes one program “pred_nopssm.py” that makes prediction without using the PSSM from PSI-BLAST. Instead, the profile is replaced by the BLOSUM62 substitution matrix. This replacement allows a fast calculation at a lower accuracy (For example, secondary structure accuracy at 68.9 %, compared to 81.8 % by using PSI-BLAST profile). This may be useful for large-scale calculations in genome level. However, it should be noted that all parameters were not optimized for the evolution-profile free prediction, and the development of a specific predictor by using sequence only is in progress.
The query sequence must be a protein sequence in the FASTA format. The gene in the DNA/RNA sequence has to be converted to the sequence of amino acids first. This conversion can be made by using http://web.expasy.org/translate or any other tools. Nonstandard amino acids (e.g., X) must be removed, prior to the use of SPIDER2.
The package employs PSI-BLAST to generate PSSM generated by scanning NR database. Alternatively, you can employ the sequence database uniref90 that can be downloaded from ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.fasta.gz. This database can be converted to BLAST-readable format by the command “gunzip -c uniref90.fasta.gz | ~/aspen/software/ncbi-blast-2.2.30+/bin/makeblastdb -in - -dbtype prot -parse_seqids -out uniref90 -title uniref90.” This operation skips the step of unzipping the large database.
For users with their own PSSM files, they can obtain predictions by utilizing the script “pred_pssm.py” followed by PSSM file names. This command will skip running PSI-BLAST and prediction can be finished in a few seconds.
If your sequence file contains more than one protein sequence, you can use the script file “splitseq.py” to split your sequence files to many files, and each file will be named according to protein names in the FASTA file.
This work was supported in part by National Health and Medical Research Council (1059775) of Australia and Australian Research Council’s Linkage Infrastructure, Equipment and Facilities funding scheme (project number LE150100161), the Taishan Scholars Program of Shandong province of China, National Natural Science Foundation of China (61540025) to Y.Z. and National Natural Science Foundation of China (61271378) to Y.Y. and J.W. We also gratefully acknowledge the support of the Griffith University eResearch Services Team and the use of the High Performance Computing Cluster “Gowonda” to complete this research. This research/project has also been undertaken with the aid of the research cloud resources provided by the Queensland Cyber Infrastructure Foundation (QCIF).
- 1.Lou W, Wang X, Chen F, Chen Y, Jiang B, Zhang H (2014) Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and Gaussian naïve Bayes. PLoS One 9(1)Google Scholar
- 8.Folkman L, Yang Y, Li Z, Stantic B, Sattar A, Mort M, Cooper DN, Liu Y, Zhou Y (2015) DDIG-in: detecting disease-causing genetic variations due to frameshifting indels and nonsense mutations employing sequence and structural properties at nucleotide and protein levels. Bioinformatics 31(10):1599–1606CrossRefPubMedGoogle Scholar
- 10.Zhao H, Yang Y, Lin H, Zhang X, Mort M, Cooper DN, Liu Y, Zhou Y (2013) DDIG-in: discriminating between disease-associated and neutral non-frameshifting micro-indels, Genome Biology, 14, R43Google Scholar
- 11.Lyons J, Dehzangi A, Heffernan R, Yang Y, Zhou Y, Sharma A, Paliwal K (2015) Advancing the accuracy of protein fold recognition by utilizing profiles from Hidden Markov models, IEEE Transactions on NanoBioscience, 14, 761–772Google Scholar
- 13.Bradley P, Chivian D, Meiler J, Misura KM, Rohl CA, Schief WR, Wedemeyer WJ, Schueler-Furman O, Murphy P, Schonbrun J, Strauss CE, Baker D (2003) Rosetta predictions in CASP5: successes, failures, and prospects for complete automation. Proteins 53(Suppl 6):457–468. doi: 10.1002/prot.10552 CrossRefPubMedGoogle Scholar
- 15.Yang Y, Faraggi E, Zhao H, Zhou Y (2011) Improving protein fold recognition and template-based modeling by employing probabilistic-based matching between predicted one-dimensional structural properties of query and corresponding native properties of templates. Bioinformatics 27(15):2076–2082. doi: 10.1093/bioinformatics/btr350 CrossRefPubMedPubMedCentralGoogle Scholar
- 35.Bengio Y, Lamblin P, Popovici D, Larochelle H (2007) Greedy layer-wise training of deep networks. Adv Neural Inform Process Syst 19:153Google Scholar