Sequence Representation and Prediction of Protein Secondary Structure for Structural Motifs in Twilight Zone Proteins

Kurgan, Lukasz; Kedarisetti, Kanaka Durga

doi:10.1007/s10930-006-9029-0

Sequence Representation and Prediction of Protein Secondary Structure for Structural Motifs in Twilight Zone Proteins

Published: 11 November 2006

Volume 25, pages 463–474, (2006)
Cite this article

The Protein Journal Aims and scope Submit manuscript

Lukasz Kurgan¹ &
Kanaka Durga Kedarisetti¹

179 Accesses
3 Citations
Explore all metrics

Abstract

Characterizing and classifying regularities in protein structure is an important element in uncovering the mechanisms that regulate protein structure, function and evolution. Recent research concentrates on analysis of structural motifs that can be used to describe larger, fold-sized structures based on homologous primary sequences. At the same time, accuracy of secondary protein structure prediction based on multiple sequence alignment drops significantly when low homology (twilight zone) sequences are considered. To this end, this paper addresses a problem of providing an alternative sequences representation that would improve ability to distinguish secondary structures for the twilight zone sequences without using alignment. We consider a novel classification problem, in which, structural motifs, referred to as structural fragments (SFs) are defined as uniform strand, helix and coil fragments. Classification of SFs allows to design novel sequence representations, and to investigate which other factors and prediction algorithms may result in the improved discrimination. Comprehensive experimental results show that statistically significant improvement in classification accuracy can be achieved by: (1) improving sequence representations, and (2) removing possible noise on the terminal residues in the SFs. Combining these two approaches reduces the error rate on average by 15% when compared to classification using standard representation and noisy information on the terminal residues, bringing the classification accuracy to over 70%. Finally, we show that certain prediction algorithms, such as neural networks and boosted decision trees, are superior to other algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robocrystallographer: automated crystal structure text descriptions and analysis

Article 20 September 2019

Investigating Protein–Peptide Interactions Using the Schrödinger Computational Suite

SESNet: sequence-structure feature-integrated deep learning method for data-efficient protein engineering

Article Open access 03 February 2023

Abbreviations

SF:: structural fragment
DSSP:: dictionary of secondary structures of proteins
PDB:: protein data bank
AA:: amino acid
MLP:: multiple layer perceptron neural network
RIP:: RIPPER
SLI:: SLIPPER
NB:: Naïve Bayes.

References

Altschul S., Madden T., Schaffer A., Zhang J., Zhang Z., Miller W., Lipman D. (1997) Nucleic Acids Res. 25:3389–3402
Article CAS Google Scholar
Berman H. M., Westbrook J., Feng Z., Gilliland G., Bhat T., Weissig H., Shindyalov I., Bourne P. (2000) Nucleic Acids Res. 28:235–242
Article CAS Google Scholar
Black S., Mould D. (1991) Anal. Biochem. 193:72–82
Article CAS Google Scholar
Bowie J., Luthy R., Eisenberg D. (1991) Science 253:164–170
Article CAS Google Scholar
Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). In: Classification and Regression Trees, Chapman and Hall
Boutonnet N., Kajava A., Rooman M. (1998) Proteins 30:193–212
Article CAS Google Scholar
Bujnicki J. (2006) Chembiochem 7(1):19–27
Article CAS Google Scholar
Cai Y., Liu X, Chou K. C. (2002) J. Comput. Chem. 24(6):727–731
Article CAS Google Scholar
Cai Y., Liu X., Xu X., Chou K. C. (2003) J. Theor. Biol. 221:115–120
Article CAS Google Scholar
Chou K-C., Cai Y-D. (2004) Biochem. Bioph. Res. Co. 321:1007–1009
Article CAS Google Scholar
Cios K. J., Moore G. W (2002) Artif. Intell. Med. 26:1–24
Article Google Scholar
Cohen, W. (1996). In: Proc. 13th Nat Conf. on Artificial Intelligence, Portland, Oregon, pp. 709–716
Cohen, W., and Singer, Y. (1999). In: Proc 16th Nat Conf. on Artificial Intelligence, Orlando, Florida, pp. 335–342
Cornette J., Cease K., Margalit H., Spouge J., Berzofsky J., DeLisi C. (1987) J. Mol. Biol. 195:659–685
Article CAS Google Scholar
Cuff J. A., Barton G. J. (2000) Proteins 40:502–511
Article CAS Google Scholar
Dubchak, I., Muchnik, I., and Kim, S-H. (1997). Protein Folding Class Predictor for SCOP: Approach Based on Global Descriptors, Proc of 5th Intelligent Systems for Molecular Biology (ISMB) Conference, Halkidiki, Greece, pp. 104–107
Duda R., Hart P. (1973) Pattern Classification and Scene Analysis, John Wiley and Sons, New York
Google Scholar
Eisenhaber F., Imperiale F, Argos P., Frommel C. (1996) Proteins 25(2):157–168
Article CAS Google Scholar
Fauchere J. L., Pliska V. (1983) Eur. J. Med. Chem. 18:369–375
CAS Google Scholar
Ganapathiraju M. K., Klein-Seetharaman J., Balakrishnan N., Reddy R. (2004) IEEE Signal Proc. Mag. 15:78–87
Article Google Scholar
Gibrat J. F., Garnier J., Robson B. (1987) J. Mol. Biol. 198(3):425–443
Article CAS Google Scholar
Hobohm U., Sander C. (1994) Protein Sci. 3:522
Article CAS Google Scholar
Hobohm U., Sander C. (1995) J. Mol. Biol. 251:390–399
Article CAS Google Scholar
Hornik K., Stinchcombe M., White H. (1989) Neural Networks 2:359–366
Article Google Scholar
Jones D. T. (1992) J. Mol. Biol. 287:797–815
Article Google Scholar
Jones D. T. (1999) J. Mol. Biol. 292:195–202
Article CAS Google Scholar
Kabsch W., Sander C. (1983) Biopolymers 22(12):2577–2637
Article CAS Google Scholar
Kim D. E., Chivian D., Baker D. (2004) Nucleic Acids Res. 32:W526–W531
CAS Google Scholar
Kurgan L., Homaeian L. (2005) Proc of Inter Conf on Machine Learning and Data Mining (MLDM´2005) Leipzig, Germany 334–345
Google Scholar
Kurgan L., Kedarisetti K. (2005) Proc of Symposium on Human-Centric Computing, Banff, Canada 26–36
Google Scholar
Kurgan, L., and Homaeian, L. (2006). Pattern Recognition, 39:(12), 2323–2343.
Article Google Scholar
Lin Z., Pan X-M. (2001) J. Protein Chem. 20(3):217–220
Article Google Scholar
Lin K., Simossis V. A., Taylor W. R., Heringa J. (2005) Bioinformatics 21(2):152–159
Article CAS Google Scholar
Luo R., Feng Z., Liu J. (2002) Eur. J. Biochem. 269:4219–4225
Article CAS Google Scholar
Martin J., Letellier G., Marin A., Taly J., de Brevern A., Gibrat J. (2005) BMC Struct. Biol. 5:17
Article CAS Google Scholar
McGuffin L., Jones D. (2003) Proteins 52(2):166–175
Article CAS Google Scholar
Moult J., Hubbard T., Bryant S., Fidelis K., Pedersen J. T. (1997) Proteins 29:2–6
Article Google Scholar
Muskal S. M., Kim S-H. (1992) J. Mol. Biol. 225:713–727
Article CAS Google Scholar
Nelson D., Cox D. (2000) Lehninger Principles of Biochemistry 3. Worth, New York
Google Scholar
Quinlan J. R. (1986) Mach. Learn. 1:81–106
Google Scholar
Petersen T., Lundegaard C., Nielsen M., Bohr H., Bohr J., Brunak S., Gippert G., Lund O. (2000) Proteins 41:17–20
Article CAS Google Scholar
Pollastri G., Przybylski D., Rost B., Baldi P. (2002) Proteins 47:228–235
Article CAS Google Scholar
Pollastri G., McLysaght A. (2005) Bioinformatics, 21(8):1719–1720
Article CAS Google Scholar
Przybylski D., Rost B. (2002) Proteins 46:197–205
Article CAS Google Scholar
Rohl C. A., Strauss C. E., Misura K. M., Baker D. (2004) Method. Enzymol. 383:66–93
Article CAS Google Scholar
Rost B, Sander C., Schneider R., (1994) J. Mol. Biol. 235:13–26
Article CAS Google Scholar
Rost B., Sander C. (1994) Proteins 19(1):55–72
Article CAS Google Scholar
Rost B. (1996) Method. Enzymol. 266:525–539
CAS Google Scholar
Rost B. (1997) J. Mol. Biol. 270:1–10
Article Google Scholar
Rost B. (1999) Protein Eng. 12:85–94
Article CAS Google Scholar
Rost, B., and Sander, C. (2000). In: Webstar, D., (ed.), Protein Structure Prediction: Methods and Protocols, Human Press Clifton, pp.71–95
Ruan J., Wang K., Yang J., Kurgan L., Cios K. (2005) Artif. Intell. Med. 35(1–2):19–35
Article Google Scholar
RuleQuest Research (2003). C5.0 rule learner at www.rulequest.com/see5-info.html
Sander C., Schneider R. (1991) Proteins 9:56–68
Article CAS Google Scholar
Shan Y. B., Wang G. L., Zhou H. X. (2001) Proteins 42:23–37
Article CAS Google Scholar
Skolnick J., Kolinski A., Kihara D., Betancourt M. R., Rotkiewicz P., Boniecki M. (2001) Proteins 5:149–156
Article CAS Google Scholar
Skolnick J., Kihara D., Zhang Y. (2004) Proteins 56:502–518
Article CAS Google Scholar
Syed, U., and Yona, G. (2003). In: Proc of Annual Conference on Research in Computational Molecular Biology (RECOMB 2003), Berlin, Germany, pp. 224–234
Szustakowski J., Kasif S., Weng Z. (2005) Bioinformatics 21(Suppl.2):ii66–ii71
Article CAS Google Scholar
Taylor W. (2002) Nature 416(6881):657–660
Article CAS Google Scholar
Unger R., Sussman J. (1993) J. Comput. Aid. Mol. Des. 7(4):457–472
Article CAS Google Scholar
Wang Z-X., Yuan Z. (2000) Proteins 38:165–175
Article CAS Google Scholar
Wang, J., Ma, Q., Shasha, D., and Wu, C. (2000). In: Proc of the 6th ACM SIGKDD Inter. Conf. on Knowledge Discovery and Data Mining, Boston, MA, pp. 305–309
Yang, X., and Wang, B. (2003). In: Proc of the 8th ACM SIGMOD workshop on Research issues in Data Mining and Knowledge Discovery, San Diego, CA, pp. 80–87
Zhang C. T., Lin Z., Zhang Z., Yan M. (1998) Protein Eng. 11(11):971–979
Article CAS Google Scholar
Zhang Z. D., Sun Z. R., Zhang C. T. (2001) J. Theor. Biol. 208:65–78
Article CAS Google Scholar
Zhang Y., Skolnick J. (2004) P. Natl. A. Sci. 101:7594–7599
Article CAS Google Scholar

Download references

Acknowledgments

The authors would like to thank Dr. Ruan for fruitful comments and discussions. This research was supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC).

Author information

Authors and Affiliations

Electrical and Computer Engineering Department, University of Alberta, Edmonton, Alberta, Canada, T6G 2V4
Lukasz Kurgan & Kanaka Durga Kedarisetti

Authors

Lukasz Kurgan
View author publications
You can also search for this author in PubMed Google Scholar
Kanaka Durga Kedarisetti
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lukasz Kurgan.

Additional information

This research was supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kurgan, L., Kedarisetti, K.D. Sequence Representation and Prediction of Protein Secondary Structure for Structural Motifs in Twilight Zone Proteins. Protein J 25, 463–474 (2006). https://doi.org/10.1007/s10930-006-9029-0

Download citation

Published: 11 November 2006
Issue Date: December 2006
DOI: https://doi.org/10.1007/s10930-006-9029-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sequence Representation and Prediction of Protein Secondary Structure for Structural Motifs in Twilight Zone Proteins

Abstract

Access this article

Similar content being viewed by others

Robocrystallographer: automated crystal structure text descriptions and analysis

Investigating Protein–Peptide Interactions Using the Schrödinger Computational Suite

SESNet: sequence-structure feature-integrated deep learning method for data-efficient protein engineering

Abbreviations

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Sequence Representation and Prediction of Protein Secondary Structure for Structural Motifs in Twilight Zone Proteins

Abstract

Access this article

Similar content being viewed by others

Robocrystallographer: automated crystal structure text descriptions and analysis

Investigating Protein–Peptide Interactions Using the Schrödinger Computational Suite

SESNet: sequence-structure feature-integrated deep learning method for data-efficient protein engineering

Abbreviations

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation