Abstract
Characterizing and classifying regularities in protein structure is an important element in uncovering the mechanisms that regulate protein structure, function and evolution. Recent research concentrates on analysis of structural motifs that can be used to describe larger, fold-sized structures based on homologous primary sequences. At the same time, accuracy of secondary protein structure prediction based on multiple sequence alignment drops significantly when low homology (twilight zone) sequences are considered. To this end, this paper addresses a problem of providing an alternative sequences representation that would improve ability to distinguish secondary structures for the twilight zone sequences without using alignment. We consider a novel classification problem, in which, structural motifs, referred to as structural fragments (SFs) are defined as uniform strand, helix and coil fragments. Classification of SFs allows to design novel sequence representations, and to investigate which other factors and prediction algorithms may result in the improved discrimination. Comprehensive experimental results show that statistically significant improvement in classification accuracy can be achieved by: (1) improving sequence representations, and (2) removing possible noise on the terminal residues in the SFs. Combining these two approaches reduces the error rate on average by 15% when compared to classification using standard representation and noisy information on the terminal residues, bringing the classification accuracy to over 70%. Finally, we show that certain prediction algorithms, such as neural networks and boosted decision trees, are superior to other algorithms.
Similar content being viewed by others
Abbreviations
- SF:
-
structural fragment
- DSSP:
-
dictionary of secondary structures of proteins
- PDB:
-
protein data bank
- AA:
-
amino acid
- MLP:
-
multiple layer perceptron neural network
- RIP:
-
RIPPER
- SLI:
-
SLIPPER
- NB:
-
Naïve Bayes.
References
Altschul S., Madden T., Schaffer A., Zhang J., Zhang Z., Miller W., Lipman D. (1997) Nucleic Acids Res. 25:3389–3402
Berman H. M., Westbrook J., Feng Z., Gilliland G., Bhat T., Weissig H., Shindyalov I., Bourne P. (2000) Nucleic Acids Res. 28:235–242
Black S., Mould D. (1991) Anal. Biochem. 193:72–82
Bowie J., Luthy R., Eisenberg D. (1991) Science 253:164–170
Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). In: Classification and Regression Trees, Chapman and Hall
Boutonnet N., Kajava A., Rooman M. (1998) Proteins 30:193–212
Bujnicki J. (2006) Chembiochem 7(1):19–27
Cai Y., Liu X, Chou K. C. (2002) J. Comput. Chem. 24(6):727–731
Cai Y., Liu X., Xu X., Chou K. C. (2003) J. Theor. Biol. 221:115–120
Chou K-C., Cai Y-D. (2004) Biochem. Bioph. Res. Co. 321:1007–1009
Cios K. J., Moore G. W (2002) Artif. Intell. Med. 26:1–24
Cohen, W. (1996). In: Proc. 13th Nat Conf. on Artificial Intelligence, Portland, Oregon, pp. 709–716
Cohen, W., and Singer, Y. (1999). In: Proc 16th Nat Conf. on Artificial Intelligence, Orlando, Florida, pp. 335–342
Cornette J., Cease K., Margalit H., Spouge J., Berzofsky J., DeLisi C. (1987) J. Mol. Biol. 195:659–685
Cuff J. A., Barton G. J. (2000) Proteins 40:502–511
Dubchak, I., Muchnik, I., and Kim, S-H. (1997). Protein Folding Class Predictor for SCOP: Approach Based on Global Descriptors, Proc of 5th Intelligent Systems for Molecular Biology (ISMB) Conference, Halkidiki, Greece, pp. 104–107
Duda R., Hart P. (1973) Pattern Classification and Scene Analysis, John Wiley and Sons, New York
Eisenhaber F., Imperiale F, Argos P., Frommel C. (1996) Proteins 25(2):157–168
Fauchere J. L., Pliska V. (1983) Eur. J. Med. Chem. 18:369–375
Ganapathiraju M. K., Klein-Seetharaman J., Balakrishnan N., Reddy R. (2004) IEEE Signal Proc. Mag. 15:78–87
Gibrat J. F., Garnier J., Robson B. (1987) J. Mol. Biol. 198(3):425–443
Hobohm U., Sander C. (1994) Protein Sci. 3:522
Hobohm U., Sander C. (1995) J. Mol. Biol. 251:390–399
Hornik K., Stinchcombe M., White H. (1989) Neural Networks 2:359–366
Jones D. T. (1992) J. Mol. Biol. 287:797–815
Jones D. T. (1999) J. Mol. Biol. 292:195–202
Kabsch W., Sander C. (1983) Biopolymers 22(12):2577–2637
Kim D. E., Chivian D., Baker D. (2004) Nucleic Acids Res. 32:W526–W531
Kurgan L., Homaeian L. (2005) Proc of Inter Conf on Machine Learning and Data Mining (MLDM´2005) Leipzig, Germany 334–345
Kurgan L., Kedarisetti K. (2005) Proc of Symposium on Human-Centric Computing, Banff, Canada 26–36
Kurgan, L., and Homaeian, L. (2006). Pattern Recognition, 39:(12), 2323–2343.
Lin Z., Pan X-M. (2001) J. Protein Chem. 20(3):217–220
Lin K., Simossis V. A., Taylor W. R., Heringa J. (2005) Bioinformatics 21(2):152–159
Luo R., Feng Z., Liu J. (2002) Eur. J. Biochem. 269:4219–4225
Martin J., Letellier G., Marin A., Taly J., de Brevern A., Gibrat J. (2005) BMC Struct. Biol. 5:17
McGuffin L., Jones D. (2003) Proteins 52(2):166–175
Moult J., Hubbard T., Bryant S., Fidelis K., Pedersen J. T. (1997) Proteins 29:2–6
Muskal S. M., Kim S-H. (1992) J. Mol. Biol. 225:713–727
Nelson D., Cox D. (2000) Lehninger Principles of Biochemistry 3. Worth, New York
Quinlan J. R. (1986) Mach. Learn. 1:81–106
Petersen T., Lundegaard C., Nielsen M., Bohr H., Bohr J., Brunak S., Gippert G., Lund O. (2000) Proteins 41:17–20
Pollastri G., Przybylski D., Rost B., Baldi P. (2002) Proteins 47:228–235
Pollastri G., McLysaght A. (2005) Bioinformatics, 21(8):1719–1720
Przybylski D., Rost B. (2002) Proteins 46:197–205
Rohl C. A., Strauss C. E., Misura K. M., Baker D. (2004) Method. Enzymol. 383:66–93
Rost B, Sander C., Schneider R., (1994) J. Mol. Biol. 235:13–26
Rost B., Sander C. (1994) Proteins 19(1):55–72
Rost B. (1996) Method. Enzymol. 266:525–539
Rost B. (1997) J. Mol. Biol. 270:1–10
Rost B. (1999) Protein Eng. 12:85–94
Rost, B., and Sander, C. (2000). In: Webstar, D., (ed.), Protein Structure Prediction: Methods and Protocols, Human Press Clifton, pp.71–95
Ruan J., Wang K., Yang J., Kurgan L., Cios K. (2005) Artif. Intell. Med. 35(1–2):19–35
RuleQuest Research (2003). C5.0 rule learner at www.rulequest.com/see5-info.html
Sander C., Schneider R. (1991) Proteins 9:56–68
Shan Y. B., Wang G. L., Zhou H. X. (2001) Proteins 42:23–37
Skolnick J., Kolinski A., Kihara D., Betancourt M. R., Rotkiewicz P., Boniecki M. (2001) Proteins 5:149–156
Skolnick J., Kihara D., Zhang Y. (2004) Proteins 56:502–518
Syed, U., and Yona, G. (2003). In: Proc of Annual Conference on Research in Computational Molecular Biology (RECOMB 2003), Berlin, Germany, pp. 224–234
Szustakowski J., Kasif S., Weng Z. (2005) Bioinformatics 21(Suppl.2):ii66–ii71
Taylor W. (2002) Nature 416(6881):657–660
Unger R., Sussman J. (1993) J. Comput. Aid. Mol. Des. 7(4):457–472
Wang Z-X., Yuan Z. (2000) Proteins 38:165–175
Wang, J., Ma, Q., Shasha, D., and Wu, C. (2000). In: Proc of the 6th ACM SIGKDD Inter. Conf. on Knowledge Discovery and Data Mining, Boston, MA, pp. 305–309
Yang, X., and Wang, B. (2003). In: Proc of the 8th ACM SIGMOD workshop on Research issues in Data Mining and Knowledge Discovery, San Diego, CA, pp. 80–87
Zhang C. T., Lin Z., Zhang Z., Yan M. (1998) Protein Eng. 11(11):971–979
Zhang Z. D., Sun Z. R., Zhang C. T. (2001) J. Theor. Biol. 208:65–78
Zhang Y., Skolnick J. (2004) P. Natl. A. Sci. 101:7594–7599
Acknowledgments
The authors would like to thank Dr. Ruan for fruitful comments and discussions. This research was supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC).
Author information
Authors and Affiliations
Corresponding author
Additional information
This research was supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC).
Rights and permissions
About this article
Cite this article
Kurgan, L., Kedarisetti, K.D. Sequence Representation and Prediction of Protein Secondary Structure for Structural Motifs in Twilight Zone Proteins. Protein J 25, 463–474 (2006). https://doi.org/10.1007/s10930-006-9029-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10930-006-9029-0