Abstract
A novel method for predicting the secondary structures of proteins from amino acid sequence has been presented. The protein secondary structure seqlets that are analogous to the words in natural language have been extracted. These seqlets will capture the relationship between amino acid sequence and the secondary structures of proteins and further form the protein secondary structure dictionary. To be elaborate, the dictionary is organism-specific. Protein secondary structure prediction is formulated as an integrated word segmentation and part of speech tagging problem. The word-lattice is used to represent the results of the word segmentation and the maximum entropy model is used to calculate the probability of a seqlet tagged as a certain secondary structure type. The method is markovian in the seqlets, permitting efficient exact calculation of the posterior probability distribution over all possible word segmentations and their tags by viterbi algorithm. The optimal segmentations and their tags are computed as the results of protein secondary structure prediction. The method is applied to predict the secondary structures of proteins of four organisms respectively and compared with the PHD method. The results show that the performance of this method is higher than that of PHD by about 3.9% Q3 accuracy and 4.6% SOV accuracy. Combining with the local similarity protein sequences that are obtained by BLAST can give better prediction. The method is also tested on the 50 CASP5 target proteins with Q3 accuracy 78.9% and SOV accuracy 77.1%. A web server for protein secondary structure prediction has been constructed which is available at http://www.insun.hit.edu.cn:81/demos/biology/index.html.
Similar content being viewed by others
References
Thorton, J. M., From genome to function, Science, 2001, 292: 2095–2097.
Cheng, L. P., Chen, S. X., Jenifer, M. B. et al., Three-dimensional structure determination of capsid of Aedes albopictus C6/36 cell densovirus, Science in China, Ser. C, 2004, 47(3): 224–228.
Liu, Z. Z., Wang, J. L., Wang, Q. et al., Structure expression pattern and chromosomal localization of the rice Osgrp-2 gene, Science in China, Ser. C, 2003, 46(6): 584–594.
Chou, P., Fasman, G., Empirical predictions of protein conformation, Annu. Rev. Biochem., 1978, 47(1): 251–276.
Ptisyn, O. B., Finkelstein, A. V., Theory of protein secondary structure and algorithm of its prediction, Biopolymers, 1983, 22(1): 15–22.
Solovyev, V. V., Salamov, A. A., Method of calculation of discrete secondary structures in globular proteins, J. Mol. Biol., 1991, 25(3): 810–824.
Rost, B., Sander, C., Prediction of protein secondary structure at better than 70% accuracy, J. Mol. Biol., 1993, 232(2): 584–599.
Hua, S., Sun, Z., A novel method of protein secondary structure prediction with high segment overlap measure: Support vector machine approach, J. Mol. Biol., 2001, 308(2): 397–407.
Rost, B., Sander, C., Combining evolutionary information and neural networks to predict protein secondary structure, Proteins: Struc. Funct. Genet., 1994, 19(1): 55–72.
Salzberg, S., Cost, S., Predicting protein secondary structure with nearest-neighbor algorithm, J. Mol. Biol., 1992, 22(2): 7371–7374.
Frishman, D., Argos, P., Seventy-five percent accuracy in protein secondary structure prediction, Proteins: Struct. Funct. Genet., 1997, 27(3): 329–335.
Salamov, A. A., Solovyev, V. V., Protein secondary structure prediction using local alignments, J. Mol. Biol., 1997, 268(1): 31–36.
Schmidler, S. C., Liu, J. S., Brutlag, D. L., Bayesian protein structure prediction, Case Studies in Bayesian Statistics, 2001, 5: 363–378.
Schmidler, S. C., Liu, J. S., Brutlag, D. L., Bayesian segmentation of protein secondary structure, J. Comp. Biol., 2000, 7(1/2): 233–248.
Language Modeling of Biological Data Workshop, ed. Searles, D., University of Pennsylvania, http://www.ircs.upenn.edu/modeling 2001/modeling.shtml, 2001.
Rigoutsos, I., Floratos, A., Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm, Bioinformatics, 1998, 14(1): 55–67.
Pisanti, N., Crochemore, M., Grossi, R., Sagot, M. F., A Basis for Repeated Motifs in Pattern Discovery and Text Mining, Institut Gaspard Monge, University of Marne-la-Vallée, IGM 2002–10, Juillet 2002.
Rigoutsos, I., Huynh, T., Floratos, A., Parida, L., Platt, D., Dictionary-driven protein annotation, Nucleic Acids Research, 2002, 30(17): 3901–3916.
Ganpathiraju, M., Weisser, D., Rosenfeld, R. et al., Comparative n-gram analysis of whole-genome protein sequences, in Proceedings of the Human Language Technologies Conference, San Diego, 2002.
McCallum, A., Freitag, D., Pereira, F., Maximum Entropy Markov Models for information extraction and segmentation, in Proceedings of the Seventeenth International Conf. on Machine Learning, Stanford, CA, 2002, 591–598.
Rabiner, L. R., Juang, B. H., An introduction to hidden markov models, IEEE ASSP Magazine, 1986, 3(1): 4–16.
Berger, A. L., Della Pietra, S. A., Della Pietra, V. J., A maximum entropy approach to natural language processing, Computational Linguistics, 1996, 22(1): 39–71.
Darroch, J. N., Ratcliff, D., Generalized iterative scaling for log-linear models, The Annals of Mathematical Statistics, 1972, 43(5): 1470–1480.
Kabsch, W., Sander, C., Dictionary of protein secondary structure: Pattern recognition of hydrogen bonded and geometrical features, Biopolymers, 1983, 22(12): 235–242.
Frishman, D., Argos, P., Knowledge-based secondary structure assignment, Proteins: Struc. Funct. Genet., 1995, 23(4): 566–579.
Richards, F. M., Kundrot, C. E., Identification of structural motifs from protein coordinate data: Secondary structure and first-level super-secondary structure, Proteins: Struc. Funct. Genet., 1988, 3(2): 71–84.
James, A. C., Geoffrey, J. B., Evaluation and improvement of multiple sequence methods for protein secondary structure prediction, Proteins: Struc. Funct. Genet., 1999, 34(4): 508–519.
Zemla, A. Venclovas, C., Fidelis, K., Rost, B. A., A modified definition of Sov, a segment-based measure for protein secondary structure prediction assessment, Proteins: Struc. Funct. Genet., 1999, 34(2): 220–223.
Rost, B., Sander, C., Schneider, R., Redefining the goals of protein secondary structure prediction, J. Mol. Biol., 1994, 235(1): 13–26.
Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, L. N., Bourne, P. E., The protein data bank, Nucleic Acids Research, 2000, 28(1): 235–242.
Wang, G., Dunbrack, R. L. Jr., PISCES: A protein sequence culling server, Bioinformatics, 2003, 19: 1589–1591.
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., Lipman, D. J., Basic local alignment search tool, J. Mol. Biol., 1990, 215: 403–410.
Jones, D. T., Protein secondary structure prediction based on position-specific scoring matrices, J. Mol. Biol., 1999, 292: 195–202.
Karplus, K., Karchin, R., Barrett, C. et al., What is the value added by human intervention in protein structure prediction? Proteins: Struc. Funct. Genet. 2001, (Suppl. 5): 86–91.
Pollastri, G., Przybylski, D., Rost, B., Baldi, P., Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles, Proteins: Struc. Funct., 2002, 47: 228–235.
Yan, H. L., Song, Y. L., Liu, F. et al., Homology modeling three-dimensional structure of AnxB1 and reducing its immunogenicity by sequence-deleted mutagenesis, Science in China, Ser. C, 2004, 47(4): 359–367.
Cohen, F. E., Abarbanel, R. M., Kuntz, I. D. et al., Turn prediction in proteins using a pattern matching approach, Biochemistry, 1986, 25(1): 266–275.
Presnell, S. R., Cohen, B. I., Cohen, F. E., A segment-based approach to protein secondary structure prediction, Biochemistry, 1992, 31(4): 983–993.
Crooks, G. E., Brenner, S. E., Protein secondary structure: Entropy, correlations and prediction, Bioinformatics, 2004, 20(10): 1603–1611.
Zhou, P., Xie, M. Y., Nie, S. P. et al., Primary structure and configuration of tea polysaccharide, Science in China, Ser. C, 2004, 47(5): 416–424.
Rader, A. J., Anderson, G., Isin, B. et al., Identification of core amino acids stabilizing rhodopsin, Proc. Natl. Acad. Sci. USA, 2004, 101(19): 7246–725
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Dong, Q., Wang, X., Lin, L. et al. A seqlet-based maximum entropy Markov approach for protein secondary structure prediction. Sci. China Ser. C.-Life Sci. 48, 394–405 (2005). https://doi.org/10.1360/062004-53
Received:
Issue Date:
DOI: https://doi.org/10.1360/062004-53