Skip to main content
Log in

Sequence Representation and Prediction of Protein Secondary Structure for Structural Motifs in Twilight Zone Proteins

  • Published:
The Protein Journal Aims and scope Submit manuscript

Abstract

Characterizing and classifying regularities in protein structure is an important element in uncovering the mechanisms that regulate protein structure, function and evolution. Recent research concentrates on analysis of structural motifs that can be used to describe larger, fold-sized structures based on homologous primary sequences. At the same time, accuracy of secondary protein structure prediction based on multiple sequence alignment drops significantly when low homology (twilight zone) sequences are considered. To this end, this paper addresses a problem of providing an alternative sequences representation that would improve ability to distinguish secondary structures for the twilight zone sequences without using alignment. We consider a novel classification problem, in which, structural motifs, referred to as structural fragments (SFs) are defined as uniform strand, helix and coil fragments. Classification of SFs allows to design novel sequence representations, and to investigate which other factors and prediction algorithms may result in the improved discrimination. Comprehensive experimental results show that statistically significant improvement in classification accuracy can be achieved by: (1) improving sequence representations, and (2) removing possible noise on the terminal residues in the SFs. Combining these two approaches reduces the error rate on average by 15% when compared to classification using standard representation and noisy information on the terminal residues, bringing the classification accuracy to over 70%. Finally, we show that certain prediction algorithms, such as neural networks and boosted decision trees, are superior to other algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Abbreviations

SF:

structural fragment

DSSP:

dictionary of secondary structures of proteins

PDB:

protein data bank

AA:

amino acid

MLP:

multiple layer perceptron neural network

RIP:

RIPPER

SLI:

SLIPPER

NB:

Naïve Bayes.

References

  • Altschul S., Madden T., Schaffer A., Zhang J., Zhang Z., Miller W., Lipman D. (1997) Nucleic Acids Res. 25:3389–3402

    Article  CAS  Google Scholar 

  • Berman H. M., Westbrook J., Feng Z., Gilliland G., Bhat T., Weissig H., Shindyalov I., Bourne P. (2000) Nucleic Acids Res. 28:235–242

    Article  CAS  Google Scholar 

  • Black S., Mould D. (1991) Anal. Biochem. 193:72–82

    Article  CAS  Google Scholar 

  • Bowie J., Luthy R., Eisenberg D. (1991) Science 253:164–170

    Article  CAS  Google Scholar 

  • Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). In: Classification and Regression Trees, Chapman and Hall

  • Boutonnet N., Kajava A., Rooman M. (1998) Proteins 30:193–212

    Article  CAS  Google Scholar 

  • Bujnicki J. (2006) Chembiochem 7(1):19–27

    Article  CAS  Google Scholar 

  • Cai Y., Liu X, Chou K. C. (2002) J. Comput. Chem. 24(6):727–731

    Article  CAS  Google Scholar 

  • Cai Y., Liu X., Xu X., Chou K. C. (2003) J. Theor. Biol. 221:115–120

    Article  CAS  Google Scholar 

  • Chou K-C., Cai Y-D. (2004) Biochem. Bioph. Res. Co. 321:1007–1009

    Article  CAS  Google Scholar 

  • Cios K. J., Moore G. W (2002) Artif. Intell. Med. 26:1–24

    Article  Google Scholar 

  • Cohen, W. (1996). In: Proc. 13th Nat Conf. on Artificial Intelligence, Portland, Oregon, pp. 709–716

  • Cohen, W., and Singer, Y. (1999). In: Proc 16th Nat Conf. on Artificial Intelligence, Orlando, Florida, pp. 335–342

  • Cornette J., Cease K., Margalit H., Spouge J., Berzofsky J., DeLisi C. (1987) J. Mol. Biol. 195:659–685

    Article  CAS  Google Scholar 

  • Cuff J. A., Barton G. J. (2000) Proteins 40:502–511

    Article  CAS  Google Scholar 

  • Dubchak, I., Muchnik, I., and Kim, S-H. (1997). Protein Folding Class Predictor for SCOP: Approach Based on Global Descriptors, Proc of 5th Intelligent Systems for Molecular Biology (ISMB) Conference, Halkidiki, Greece, pp. 104–107

  • Duda R., Hart P. (1973) Pattern Classification and Scene Analysis, John Wiley and Sons, New York

    Google Scholar 

  • Eisenhaber F., Imperiale F, Argos P., Frommel C. (1996) Proteins 25(2):157–168

    Article  CAS  Google Scholar 

  • Fauchere J. L., Pliska V. (1983) Eur. J. Med. Chem. 18:369–375

    CAS  Google Scholar 

  • Ganapathiraju M. K., Klein-Seetharaman J., Balakrishnan N., Reddy R. (2004) IEEE Signal Proc. Mag. 15:78–87

    Article  Google Scholar 

  • Gibrat J. F., Garnier J., Robson B. (1987) J. Mol. Biol. 198(3):425–443

    Article  CAS  Google Scholar 

  • Hobohm U., Sander C. (1994) Protein Sci. 3:522

    Article  CAS  Google Scholar 

  • Hobohm U., Sander C. (1995) J. Mol. Biol. 251:390–399

    Article  CAS  Google Scholar 

  • Hornik K., Stinchcombe M., White H. (1989) Neural Networks 2:359–366

    Article  Google Scholar 

  • Jones D. T. (1992) J. Mol. Biol. 287:797–815

    Article  Google Scholar 

  • Jones D. T. (1999) J. Mol. Biol. 292:195–202

    Article  CAS  Google Scholar 

  • Kabsch W., Sander C. (1983) Biopolymers 22(12):2577–2637

    Article  CAS  Google Scholar 

  • Kim D. E., Chivian D., Baker D. (2004) Nucleic Acids Res. 32:W526–W531

    CAS  Google Scholar 

  • Kurgan L., Homaeian L. (2005) Proc of Inter Conf on Machine Learning and Data Mining (MLDM´2005) Leipzig, Germany 334–345

    Google Scholar 

  • Kurgan L., Kedarisetti K. (2005) Proc of Symposium on Human-Centric Computing, Banff, Canada 26–36

    Google Scholar 

  • Kurgan, L., and Homaeian, L. (2006). Pattern Recognition, 39:(12), 2323–2343.

    Article  Google Scholar 

  • Lin Z., Pan X-M. (2001) J. Protein Chem. 20(3):217–220

    Article  Google Scholar 

  • Lin K., Simossis V. A., Taylor W. R., Heringa J. (2005) Bioinformatics 21(2):152–159

    Article  CAS  Google Scholar 

  • Luo R., Feng Z., Liu J. (2002) Eur. J. Biochem. 269:4219–4225

    Article  CAS  Google Scholar 

  • Martin J., Letellier G., Marin A., Taly J., de Brevern A., Gibrat J. (2005) BMC Struct. Biol. 5:17

    Article  CAS  Google Scholar 

  • McGuffin L., Jones D. (2003) Proteins 52(2):166–175

    Article  CAS  Google Scholar 

  • Moult J., Hubbard T., Bryant S., Fidelis K., Pedersen J. T. (1997) Proteins 29:2–6

    Article  Google Scholar 

  • Muskal S. M., Kim S-H. (1992) J. Mol. Biol. 225:713–727

    Article  CAS  Google Scholar 

  • Nelson D., Cox D. (2000) Lehninger Principles of Biochemistry 3. Worth, New York

    Google Scholar 

  • Quinlan J. R. (1986) Mach. Learn. 1:81–106

    Google Scholar 

  • Petersen T., Lundegaard C., Nielsen M., Bohr H., Bohr J., Brunak S., Gippert G., Lund O. (2000) Proteins 41:17–20

    Article  CAS  Google Scholar 

  • Pollastri G., Przybylski D., Rost B., Baldi P. (2002) Proteins 47:228–235

    Article  CAS  Google Scholar 

  • Pollastri G., McLysaght A. (2005) Bioinformatics, 21(8):1719–1720

    Article  CAS  Google Scholar 

  • Przybylski D., Rost B. (2002) Proteins 46:197–205

    Article  CAS  Google Scholar 

  • Rohl C. A., Strauss C. E., Misura K. M., Baker D. (2004) Method. Enzymol. 383:66–93

    Article  CAS  Google Scholar 

  • Rost B, Sander C., Schneider R., (1994) J. Mol. Biol. 235:13–26

    Article  CAS  Google Scholar 

  • Rost B., Sander C. (1994) Proteins 19(1):55–72

    Article  CAS  Google Scholar 

  • Rost B. (1996) Method. Enzymol. 266:525–539

    CAS  Google Scholar 

  • Rost B. (1997) J. Mol. Biol. 270:1–10

    Article  Google Scholar 

  • Rost B. (1999) Protein Eng. 12:85–94

    Article  CAS  Google Scholar 

  • Rost, B., and Sander, C. (2000). In: Webstar, D., (ed.), Protein Structure Prediction: Methods and Protocols, Human Press Clifton, pp.71–95

  • Ruan J., Wang K., Yang J., Kurgan L., Cios K. (2005) Artif. Intell. Med. 35(1–2):19–35

    Article  Google Scholar 

  • RuleQuest Research (2003). C5.0 rule learner at www.rulequest.com/see5-info.html

  • Sander C., Schneider R. (1991) Proteins 9:56–68

    Article  CAS  Google Scholar 

  • Shan Y. B., Wang G. L., Zhou H. X. (2001) Proteins 42:23–37

    Article  CAS  Google Scholar 

  • Skolnick J., Kolinski A., Kihara D., Betancourt M. R., Rotkiewicz P., Boniecki M. (2001) Proteins 5:149–156

    Article  CAS  Google Scholar 

  • Skolnick J., Kihara D., Zhang Y. (2004) Proteins 56:502–518

    Article  CAS  Google Scholar 

  • Syed, U., and Yona, G. (2003). In: Proc of Annual Conference on Research in Computational Molecular Biology (RECOMB 2003), Berlin, Germany, pp. 224–234

  • Szustakowski J., Kasif S., Weng Z. (2005) Bioinformatics 21(Suppl.2):ii66–ii71

    Article  CAS  Google Scholar 

  • Taylor W. (2002) Nature 416(6881):657–660

    Article  CAS  Google Scholar 

  • Unger R., Sussman J. (1993) J. Comput. Aid. Mol. Des. 7(4):457–472

    Article  CAS  Google Scholar 

  • Wang Z-X., Yuan Z. (2000) Proteins 38:165–175

    Article  CAS  Google Scholar 

  • Wang, J., Ma, Q., Shasha, D., and Wu, C. (2000). In: Proc of the 6th ACM SIGKDD Inter. Conf. on Knowledge Discovery and Data Mining, Boston, MA, pp. 305–309

  • Yang, X., and Wang, B. (2003). In: Proc of the 8th ACM SIGMOD workshop on Research issues in Data Mining and Knowledge Discovery, San Diego, CA, pp. 80–87

  • Zhang C. T., Lin Z., Zhang Z., Yan M. (1998) Protein Eng. 11(11):971–979

    Article  CAS  Google Scholar 

  • Zhang Z. D., Sun Z. R., Zhang C. T. (2001) J. Theor. Biol. 208:65–78

    Article  CAS  Google Scholar 

  • Zhang Y., Skolnick J. (2004) P. Natl. A. Sci. 101:7594–7599

    Article  CAS  Google Scholar 

Download references

Acknowledgments

The authors would like to thank Dr. Ruan for fruitful comments and discussions. This research was supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lukasz Kurgan.

Additional information

This research was supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kurgan, L., Kedarisetti, K.D. Sequence Representation and Prediction of Protein Secondary Structure for Structural Motifs in Twilight Zone Proteins. Protein J 25, 463–474 (2006). https://doi.org/10.1007/s10930-006-9029-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10930-006-9029-0

Keywords

Navigation