Discovering MoRFs by trisecting intrinsically disordered protein sequence into terminals and middle regions
Abstract
Background
Molecular Recognition Features (MoRFs) are short protein regions present in intrinsically disordered protein (IDPs) sequences. MoRFs interact with structured partner protein and upon interaction, they undergo a disorder-to-order transition to perform various biological functions. Analyses of MoRFs are important towards understanding their function.
Results
Performance is reported using the MoRF dataset that has been previously used to compare the other existing MoRF predictors. The performance obtained in this study is equivalent to the benchmarked OPAL predictor, i.e., OPAL achieved AUC of 0.815, whereas the model in this study achieved AUC of 0.819 using TEST set.
Conclusion
Achieving comparable performance, the proposed method can be used as an alternative approach for MoRF prediction.
Abbreviations
- ASA
Accessible surface area
- AUC
Area under the curve
- FPR
false positive rate
- HSE
Half-sphere exposure
- IDPs
Intrinsically disordered proteins
- IDRs
Intrinsically disordered regions
- MoRFs
Molecular recognition features
- RBF
Radial basis function
- SS
Secondary structure
- SVM
Support vector machine
- TPR
True positive rate
Background
In the traditional view, the function of protein critically depends on the well-defined three-dimensional structure. This concept implies that protein sequence defines the structure, which in turn outlines the protein function. However, recent studies have revealed that many proteins do not form a defined three-dimensional structure but they are functional [1, 2, 3, 4]. These proteins are called intrinsically disordered proteins (IDPs) or intrinsically disordered regions (IDRs). IDPs and IDRs lack the hydrophobic cores which makeup the structured domain. Thus, the functionality of these proteins arises in a different manner compared to the protein structure-function paradigm.
IDPs consist of functional sites that are associated with important cellular functions, such as transcriptional regulation and signal transduction [2, 3]. Molecular recognition features (MoRFs) are one of the important functional sites that reside in IDPs and they permit interaction with structured partner proteins [2, 5, 6]. Upon interaction, they undergo a disorder-to-order transition and adopt conformations such as α-helix (α-MoRFs), β-strand (β-MoRFs), and γ-coil (γ-MoRFs) or mixtures of these complex-MoRFs. For a deeper understanding of disordered proteins and MoRFs, several studies have been done and databases have been introduced [5, 6, 7, 8, 9, 10].
Analyses of MoRFs can be done using experimental methods, however, these experiments are time-consuming and expensive to perform. Therefore, it is prudent to computationally identify MoRFs in disordered protein sequences. Many machine learning methods for predicting MoRFs have been studied [8, 9, 11, 12, 13, 14, 15] in this respect. A detailed literature review of the available state-of-the-art methods has been thoroughly done in our previous work [15].
Analyzing the structural properties of MoRFs, their conformational behavior, and their interaction mechanism with various binding region helps in the understanding of MoRF properties. The disordered regions may fluctuate between several states including coil-like states, localized secondary structure and more compact states. The structural characteristics and the individual states of conformation are determined by the nature of amino acids in the disordered sequences. Thus, to this end, we predicted the structural properties of the disordered region using the structural predictor [16] and utilized it to identify the MoRFs.
To predict amino acid residues of the protein sequence as MoRF and non-MoRF, a learning algorithm requires information of the residue itself and the information of the neighboring residues. However, to predict the terminal residues of the disordered protein sequence, complete neighboring information is not available and this adds complexity to the learning algorithm if a single model is trained to predict all the amino acids of the protein sequence. Therefore, we believe that if separate models are trained to predict the middle and the terminal regions, the performance is thought to improve as the neighboring information of the residues is appropriately incorporated for prediction.
In this paper, we present a MoRF prediction scheme which involves support vector machine (SVM) models to predict MoRFs in protein sequences. In the proposed scheme, separate SVM models are used to predict the terminal and middle regions of a protein sequence. To do this, we have constructed two SVM models, the first one is trained using the terminal regions of training sequences and the second SVM model is trained using the middle region of training sequences. The presented scheme is different from the design approach of other state-of-the-art methods as here separate models are used to predict terminal and middle regions. To complement information present in the protein regions, we followed a similar approach as presented in Malhis et al., [12, 13] and Sharma et al., [15] where scores of many MoRF prediction models are combined. Therefore, we selected the following predictors MoRFpred-plus [14], PROMIS [15] and MoRFchibi [11], and combined their scores with the scores of the proposed model. The main aim of this amalgamation is the use of different sources of information encoded in the protein regions, as this has been proved to improve the MoRF prediction accuracies. The proposed model uses structural information, MoRFpred-plus uses evolutionary profiles and physicochemical properties, MoRFchibi uses physicochemical properties, PROMIS uses structural information and all are developed using a different learning algorithm. The reported performance of the combined model in this study is closer to the benchmarked predictor.
Method
Benchmark dataset
Datasets used to train and test a MoRF predictor
Data sets | No. of Sequences | Total residues | No. of MoRF residues | No. of non-MoRF residues | |
---|---|---|---|---|---|
training set | TRAIN | 421 | 245,984 | 5396 | 240,588 |
test sets | TEST | 419 | 258,829 | 5153 | 253,676 |
NEW | 45 | 37,533 | 626 | 36,907 | |
TEST464 | 464 | 296,362 | 5779 | 290, 583 | |
TEST266 | 266 | 154,399 | 3305 | 151,094 | |
validation set | EXP53 | 53 | 25,186 | 2432 | 22,754 |
Overview of the proposed method
Overview of the proposed method. Fuse score means that the model scores are combined to provide the whole sequence scores
There exist many tools to obtain structural information of a protein sequence. In this study, we utilized SPIDER2 predictor [16] to predict the structural attributes such as SS, ASA, HSE and backbone torsion angles of the protein sequences. SS represents the structural description of the protein sequence in a number of discrete states, such as helix, coil, and sheet. SS output is a three-dimensional vector containing the transition probabilities to three secondary structures. ASA represents the exposure level of the amino acids to solvent in a protein sequence and the output is a one–dimensional vector representing the structural property. Backbone angles contain the backbone dihedral angles of the amino acids in the protein sequence. These angles are Phi, Psi, Theta (θ) and Tau (τ). HSE provides the number of C alpha atoms in the upper and lower spheres of the amino acids. We used the measures including HSE alpha and HSE beta along with the contact numbers for the amino acids.
Support vector machine
An SVM classifier with radial basis function (RBF) is used for MoRF prediction. We have used the same values of C and gamma (1000 and 0.0038) as in our previous study [15] to evaluate the proposed method. We have selected these values because the datasets used and features computed in both studies are similar and also these values provided good results in our previous study [15].
Training
where Aj is the j-th amino acid in the sequence, T is the total number of protein sequences in the training set and ni is the length of protein sequence Pi. Before we define the positive and negative segments representing MoRFs and non-MoRFs, it is essential to select a suitable flank size (the length of neighboring residues), as this size will determine the length of the terminal regions. We selected the flank size as 20 from our previous study [15] because this flank size provided good performance for MoRF prediction. Using flank size as 20, the segments were extracted in the following way: suppose for a protein Pi if the j-th amino acid is part of MoRF region for 1 ≤ j ≤ 20 and ni − 20 < j ≤ ni, we extract the MoRF region plus flank regions of 20 amino acids upstream and downstream (if exist) of MoRF region as a positive segment for STENMoRF; and, if j-th amino acid is a part of MoRF region for 20 < j ≤ ni − 20, we extract the MoRF region plus flank of 20 amino acids upstream and downstream of MoRF region as a positive segment for MIDMoRF. Besides, a negative segment (same size as a positive segment) is extracted from a non-MoRF region in a similar way for STENMoRF and MIDMoRF, respectively.
We extract an equal number of positive and negative samples using the steps of the StructMoRF method described in Sharma et al., [15], i.e., positive sample is extracted from a positive segment and negative sample is extracted from a negative segment, and to compute the feature vector for the samples, we used structural attributes. Suppose if the u-th number of the attribute is considered, the structural matrix M for a sample S of length l will be given as:
where Mi, j is the element of a matrix M for 1 ≤ i ≤ l and 1 ≤ j ≤ u. To extract features from matrix M, we use auto-covariance based features for STENMoRF. Auto-covariance feature is computed from matrix M as follows:
where DF is the distance factor. The computed feature matrix ACk, j will be of size DF × u and can be rearranged in a vector form by reshaping it into a vector of length DF × u. Observing the performance, the effective value of DF was obtained as 10. Moreover, to extract features for MIDMoRF, we use feature extraction procedure of structMoRF method described in Sharma et al., [15].
Test
To score each residue in the query protein sequence, we extract a sample for each query residue using the window of size 41 (flank size× 2 + 1). Except for the terminal region residues, the sample length will be of 41 amino acids. For a query residue, sample Sj is defined as
where Aj is the query residue in the query sequence, j=1,2,...L and L is the length of the query protein sequence. Samples for a query sequence of length L can be is interpreted using eq. (4) as:
Schematic illustration of extracting samples to score a query sequence. Aj is the j-th amino acid in the query sequence and L refers to the length of the query protein sequence
Performance measure
We use the performance measures AUC, true positive rate (TPR) and false positive rate (FPR) to evaluate the models in this study, where AUC is defined as the area under the receiver characteristics curve.
Combined model
Combined model. MoRFpred-plus and PROMIS are our predictors while we download MoRFchibi predictor and integrate it with our proposed model
Results
The performance in this study is reported using the same datasets that were used to analyses MoRF predictors such as MoRFchibi, MoRFpred, MoRFpred-plus, MoRFchibi-web, and OPAL. In this section, we present the model tuning scheme followed by the performance comparison.
Model tuning
Feature selection techniques are very crucial for machine learning algorithms, as it reduces the computational complexity of the algorithm by reducing the feature dimension and it also selects best features to represent the data. In this study, we used successive feature selection scheme in the forward direction [17] to choose structural attributes for each of the model. Evaluating the scheme using structural attributes, the proposed models provided good performance (AUCs) with attributes from half-sphere exposure (HSE) α and β group. HSE is a measure of solvent exposure of a residue and it gives the number of C alpha atoms in the upper and lower spheres [18]. As more structural attributes are concatenated using the scheme, the performance deteriorates. Therefore, we used the attribute HSEu from the HSE α group to extract features for the proposed models.
AUCs for the proposed model with varying window flank size values to process the output scores
Performance comparison
AUCs using the test sets
Predictors/models | TEST | TEST464 | TEST266 | EXP53 ALL | EXP53 LONG | EXP53 SHORT |
---|---|---|---|---|---|---|
ANCHOR | 0.6 | 0.605 | 0.599 | 0.615 | 0.586 | 0.683 |
MoRFpred | 0.673 | 0.675 | 0.651 | 0.62 | 0.598 | 0.673 |
MoRFchibi | 0.74 | 0.743 | 0.709 | 0.712 | 0.679 | 0.79 |
MoRFpred-plus | 0.755 | 0.724 | 0.740 | 0.712 | 0.67 | 0.821 |
MoRFchibi-light | 0.775 | 0.777 | 0.762 | 0.799 | 0.77 | 0.869 |
PROMIS | 0.791 | 0.788 | 0.770 | 0.818 | 0.815 | 0.823 |
MoRFchibi-web | 0.8 | 0.805 | 0.785 | 0.797 | 0.758 | 0.886 |
OPAL | 0.815 | 0.816 | 0.795 | 0.836 | 0.823 | 0.870 |
Proposed Model | 0.760 | 0.757 | 0.729 | 0.787 | 0.754 | 0.864 |
Combined Model | 0.819 | 0.818 | 0.797 | 0.838 | 0.819 | 0.881 |
We further evaluated the performance of the proposed model against the benchmarked OPAL predictor. For comparison, we plotted the propensity score of proteins P15337, P26645, P02686, P42768 and Q99967 from the EXP53 set. (Additional file 1: Figures S1 to S5) shows the propensity scores for each of the protein. We particularly observe that where OPAL performs poorly, the proposed model upgrades the scores of the verified MoRF regions. The analysis also showed that for some non-MoRF residues, the propensity scores of the proposed model are lower compared with that of OPAL.
In detail, comparing the proposed method with MoRFchibi-web and OPAL, we obtained performance improvement (in terms of AUCs) of 1.9% and 0.4% using TEST set, 1.3% and 0.2% using TEST464 set, 1.2% and 0.2% using TEST266 set, and 4.1% and 0.2% using EXP53 ALL set, respectively. Furthermore, we observe that OPAL performed better in predicting long MoRFs, whereas MoRFchibi-web obtained good performance in scoring short MoRFs. Thus, on an average scale, the proposed method has boosted the performance of scoring short MoRFs by 1.1% compared to OPAL.
Discussion
Percentage of MoRFs present in terminal and middle regions
The sequences in the TRAIN set contain MoRFs of variable size from 5 to 25 residues, and a single MoRF is present per sequence. Thus, this brings the issue of unbiased data, as the number of non-MoRF residues is more significant compared to the number of MoRF residues. To overcome this issue, during training step we have selected positive samples from MoRFs and we have extracted the same number of negative samples from non-MoRFs.
Percentage of MoRFs per respective length for the TRAIN, TEST464 and EXP53 sets
- (1)
use of different sources of information of disordered regions such as structural attributes; evolutionary profiles, and physicochemical attributes.
- (2)
use of different learning algorithms obtained by combining scores of the proposed model with the scores of MoRFpred-plus, PROMIS and MoRFchibi.
- (3)
selecting an equal number of positive and negative training samples from unbiased MoRF and non-MoRF regions.
- (4)
processing output scores, this processing provided extra information to see if the neighboring residues have high scores to form a MoRF region or not.
FPR for a given TPR value for the combined model and OPAL using EXP53 SHORT
TPR | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 |
OPAL | 0.0113 | 0.0158 | 0.0414 | 0.0691 | 0.0902 | 0.1144 | 0.216 | 0.334 |
Combined model | 0.0118 | 0.0175 | 0.0323 | 0.0593 | 0.0889 | 0.1150 | 0.1852 | 0.2913 |
Conclusion
In this study, disordered protein sequences are trisected into the terminal and middle regions for MoRF prediction. Incorporating structural, evolutionary and physicochemical information of disordered proteins, a comparable performance is achieved compared with the performance of the state-of-the-art MoRF predictors. Thus, the proposed method can be used as an alternative approach for MoRF prediction.
Notes
Funding
Publication charge for of this article is funded by RIKEN, Center for Integrative Medical Sciences, Japan and CREST, JST, Yokohama 230–0045, Japan.
Availability of data and materials
The data and materials are available at https://github.com/roneshsharma/BMC_Models2018/wiki.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 19 Supplement 13, 2018: 17th International Conference on Bioinformatics (InCoB 2018): bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-19-supplement-13.
Authors’ contributions
RS performed the analysis and wrote the manuscript under the guidance of AS and AP. TT provided computational resources. AS helped in method development. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary material
References
- 1.Dyson HJ, Wright EP. Intrinsically unstructured proteins and their functions. Nat Rev Mol Cell Biol. 2005;6:197–208.CrossRefGoogle Scholar
- 2.Lee RVD, Buljan M, Lang B, Weatheritt RJ, Daughdrill GW, Dunker AK, Fuxreiter M, Gough J, Gsponer J, Jones DT, et al. Classification of intrinsically disordered regions and proteins. Chem Rev. 2014;114:6589–631.CrossRefGoogle Scholar
- 3.Uversky V. Introduction to intrinsically disordered proteins (IDPs). Chem Rev. 2014;114:6557–60.CrossRefGoogle Scholar
- 4.Wright PE, Dyson HJ. Intrinsically disordered proteins in cellular signalling and regulation. Nat Rev Mol Cell Biol. 2015;16:18–29.CrossRefGoogle Scholar
- 5.Vacic V, Oldfield CJ, Mohan A, Radivojac P, Cortese MS, Uversky VN, Dunker AK. Characterization of molecular recognition features, MoRFs, and their binding partners. J Proteome Res. 2007;6(6):2351–66.CrossRefGoogle Scholar
- 6.Mohan A, Oldfield CJ, Radivojac P, Vacic V, Cortese MS, Dunker AK, Uversky VN. Analysis of molecular recognition features (MoRFs). J Mol Biol. 2006;362(5):1043–59.CrossRefGoogle Scholar
- 7.Liu J, Perumal NB, Oldfield CJ, Su EW, Uversky VN, Dunker AK. Intrinsic disorder in transcription factors. Biochemistry. 2006;45(22):6873–88.CrossRefGoogle Scholar
- 8.Disfani FM, Hsu WL, Mizianty MJ, Oldfield CJ, Xue B, Dunker AK, Uversky VN, Kurgan L. MoRFpred, a computational tool for sequence-based prediction and characterization of short disorder-to-order transitioning binding regions in proteins. Bioinformatics. 2012;28:i75–83.CrossRefGoogle Scholar
- 9.Dosztányi Z, Mészáros B, Simon I. ANCHOR: web server for predicting protein binding regions in disordered proteins. Bioinformatics. 2009;25(20):2745–6.CrossRefGoogle Scholar
- 10.Gypas F, Tsaousis GN, Hamodrakas SJ. mpMoRFsDB: a database of molecular recognition features in membrane proteins. Bioinformatics. 2013;29(19):2517–8.CrossRefGoogle Scholar
- 11.Malhis N, Gsponer J. Computational identification of MoRFs in protein sequences. Bioinformatics. 2015;31(11):1738–44.CrossRefGoogle Scholar
- 12.Malhis N, Jacobson M, Gsponer J. MoRFchibi SYSTEM: software tools for the identification of MoRFs in protein sequences. Nucleic Acids Res. 2016;44(Web Server issue):W488–93.CrossRefGoogle Scholar
- 13.Malhis N, Wong ETC, Nassar R, Gsponer J. Computational identification of MoRFs in protein sequences using hierarchical application of Bayes rule. PLoS One. 2015;10(10):e0141603.CrossRefGoogle Scholar
- 14.Sharma R, Bayarjargal M, Tsunoda T, Patil A, Sharma A. MoRFPred-plus: computational identification of MoRFs in protein sequences using physicochemical properties and HMM profiles. J Theor Biol. 2018;437(Supplement C):9–16.CrossRefGoogle Scholar
- 15.Sharma R, Raicar G, Tsunoda T, Patil A, Sharma A. OPAL: prediction of MoRF regions in intrinsically disordered protein sequences. Bioinformatics. 2018;34(11):1850–8.CrossRefGoogle Scholar
- 16.Yang Y, Heffernan R, Paliwal K, Lyons J, Dehzangi A, Sharma A, Wang J, Sattar A, Zhou Y. SPIDER2: a package to predict secondary structure, accessible surface area and main-chain torsional angles by deep neural networks. Methods Mol Biol. 2017;1484:55–63.CrossRefGoogle Scholar
- 17.Sharma A, Paliwal KK, Dehzangi A, Lyons J, Imoto S, Miyano S. A strategy to select suitable physicochemical attributes of amino acids for protein fold recognition. BMC Bioinformatics. 2013;14(233):1–11.Google Scholar
- 18.Hamelryck T. An amino acid has two sides: a new 2D measure provides a different view of solvent exposure. Proteins. 2005;59(1):38–48.CrossRefGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.