Predicting Protein Secondary Structure Using Stochastic Tree Grammars

Abe, Naoki; Mamitsuka, Hiroshi

doi:10.1023/A:1007477814995

Predicting Protein Secondary Structure Using Stochastic Tree Grammars

Published: November 1997

Volume 29, pages 275–301, (1997)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Predicting Protein Secondary Structure Using Stochastic Tree Grammars

Download PDF

Naoki Abe¹ &
Hiroshi Mamitsuka¹

534 Accesses
34 Citations
Explore all metrics

Abstract

We propose a new method for predicting protein secondary structure of a given amino acid sequence, based on a training algorithm for the probability parameters of a stochastic tree grammar. In particular, we concentrate on the problem of predicting β-sheet regions, which has previously been considered difficult because of the unbounded dependencies exhibited by sequences corresponding to β-sheets. To cope with this difficulty, we use a new family of stochastic tree grammars, which we call Stochastic Ranked Node Rewriting Grammars, which are powerful enough to capture the type of dependencies exhibited by the sequences of β-sheet regions, such as the ‘parallel’ and ‘anti-parallel’ dependencies and their combinations. The training algorithm we use is an extension of the ‘inside-outside’ algorithm for stochastic context-free grammars, but with a number of significant modifications. We applied our method on real data obtained from the HSSP database (Homology-derived Secondary Structure of Proteins Ver 1.0) and the results were encouraging: Our method was able to predict roughly 75 percent of the β-strands correctly in a systematic evaluation experiment, in which the test sequences not only have less than 25 percent identity to the training sequences, but are totally unrelated to them. This figure compares favorably to the predictive accuracy of the state-of-the-art prediction methods in the field, even though our experiment was on a restricted type of β-sheet structures and the test was done on a relatively small data size. We also stress that our method can predict the structure as well as the location of β-sheet regions, which was not possible by conventional methods for secondary structure prediction. Extended abstracts of parts of the work presented in this paper have appeared in (Abe & Mamitsuka, 1994) and (Mamitsuka & Abe, 1994).

References

Abe, N. (1988). Feasible learnability of formal grammars and the theory of natural language acquision. Proceedings of the 12th International Conference on Computational Linguistics (pp. 1–6). Budapest, Hungary.
Abe, N., & Mamitsuka, H. (1994). A new method for predicting protein secondary structures based on stochastic tree grammars. Proceedings of the Eleventh International Conference on Machine Learning (pp. 3–11). New Brunswick, NJ. Morgan Kaufmann.
Google Scholar
Baldi, P., Chauvin, Y., Hunkapillar, T., & McClure, M. (1994). Hidden Markov models of biological primary sequence information. Proceedings of the National Academy of Sciences, 91, 1059–1063.
Google Scholar
Barton, G. J. (1995). Protein secondary structure prediction: Review article. Current Opinion in Structural Biology, 5, 372–376.
Google Scholar
Bernstein, F., Koetzle, T., Williams, G., Meyer, E., Brice, M., Rodgers, J., Kennard, O., Shimanouchi, T., & Tasumi, M. (1977). The Protein Data Bank: a computer-based archival file for macromolecular structures. Journal of Molecular Biology, 112, 535–542.
Google Scholar
Cost, S., & Salzberg, S. (1993). A weighted nearest neighbor algorithm for learning with symbolic features. Machine Learning, 10, 57–78.
Google Scholar
Doolittle, R. F., Feng, D. F., Johnson, M. S., & McClure, M. A. (1986). Relationships of human protein sequences to those of other organisms. The Cold Spring Harbor Symposium on Quantitative Biology, 51, 447–455.
Google Scholar
Eddy, S. R., & Durbin, R. (1994). RNA sequence analysis using covariance models. Nucleic Acids Research, 22, 2079–2088.
Google Scholar
Fauchere, J. & Pliska, V. (1983). Hydrophobic parameters of amino acid side chains from the partitioning of N-acetyl-amino acid amides. European Journal of Medicinal Chemistry: Chimical Therapeutics, 18, 369–375.
Google Scholar
Hoboem, U., Scharf, M., Schneider, R., & Sander, C. (1992). Selection of a representative set of structures from the Brookhaven Protein Data Bank. Protein Science, 1, 409–417.
Google Scholar
Jelinik, F., Lafferty, & Mercer, R. (1990). Basic methods of probabilistic context free grammars. IBM Research Report RC16374 (#72684). Yorktown Heights, NY: IBM, Thomas J. Watson Research Center.
Google Scholar
Joshi, A. K., Levy, L., & Takahashi, M. (1975). Tree adjunct grammars. Journal of Computer and System Sciences, 10, 136–163.
Google Scholar
Kneller, D., Cohen, F., & Langridge, R. (1990). Improvements in protein secondary structure prediction by an enhanced neural network. Journal of Molecular Biology, 214, 171–182.
Google Scholar
Krogh, A., Brown, M., Mian, I. S., Sjölander, K., & Haussler, D. (1994). Hidden Markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology, 235, 1501–1531.
Google Scholar
Levinson, S. E., Rabiner, L. R., & Sondhi, M. M. (1983). An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition. The Bell System Technical Journal, 62, 1035–1074.
Google Scholar
Mamitsuka, H., & Abe, N. (1994). Predicting location and structure of beta-sheet regions using stochastic tree grammars. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology (pp. 276–284). Palo Alto, CA: The AAAI Press.
Google Scholar
Mamitsuka, H. & Yamanishi, K. (1995). α-Helix region prediction with stochastic rule learning. Computer Applications in the Biosciences, 11, 399–411.
Google Scholar
May, A. C. W., & Blundell, T. L. (1994). Automated comparative modelling of protein structures. Current Opinion in Biotechnology, 5, 355–360.
Google Scholar
Muggleton, S., King, R. D., & Sternberg, M. J. E. (1992). Protein secondary structure prediction using logic-based machine learning. Protein Engineering, 5, 647–657.
Google Scholar
Paz, A. (1971). Introduction to probabilistic automata. New York: Academic Press.
Google Scholar
Qian, N., & Sejnowski, T. J. (1988). Predicting the secondary structure of globular proteins using neural network models. Journal of Molecular Biology, 202, 865–884.
Google Scholar
Riis, S., & Krogh, A. (1996). Improving prediction of protein secondary structure using structured neural networks and multiple sequence alignments. Journal of Computational Biology, 3, 163–183.
Google Scholar
Rissanen, J. (1986). Stochastic complexity and modeling. The Annals of Statistics, 14, 1080–1100.
Google Scholar
Rost, B., & Sander, C. (1993). Prediction of protein secondary structure at better than 70% accuracy. Journal of Molecular Biology, 232, 584–599.
Google Scholar
Rounds, W. C. (1969). Context-free grammars on trees. Conference Record of ACM Symposium on Theory of Computing (pp. 143–148). Marina del Rey, CA.
Google Scholar
Sakakibara, Y., Brown, M., Hughey, R., Mian, I. S., Sjölander, K., Underwood, R. C., & Haussler, D. (1994). Stochastic context-free grammars for tRNA modeling. Nucleic Acids Research, 22, 5112–5120.
Google Scholar
Sander, C., & Schneider, R. (1991). Database of homology-derived structures and the structural meaning of sequence alignment. Proteins: Structure, Function, and Genetics, 9, 56–68.
Google Scholar
Schabes, Y. (1992). Stochastic lexicalized tree adjoining grammars. Proceedings of the 14th International Conference of Computational Linguistics (pp. 426–432). Nantes, France.
Searls, D. B. (1993). The computational linguistics of biological sequences. In L. Hunter (Ed.), Artificial intelligence and molecular biology, Menlo Park, CA: AAAI Press.
Google Scholar
Stolcke, A., & Omohundro, S. (1994). Inducing probabilistic grammars by Bayesian model merging. Proceedings of the Second International Colloquium on Grammatical Inference and Applications (pp. 106–118). Alicante, Spain: Springer-Verlag.
Google Scholar
Taylor, W. R. (1986). The classification of amino acid conservation. Journal of Theoretical Biology, 119, 205–218.
Google Scholar
Vijay-Shanker, K.,& Joshi, A. K. (1985). Some computational properties of tree adjoining grammars. Proceedings of 23rd Meeting of the Association for Computational Linguistics (pp. 82–93). Chicago, IL.
Wodak, S. J., & Rooman, M. J. (1993). Generating and testing protein folds. Current Opinion in Structural Biology, 3, 247–259.
Google Scholar

Download references

Author information

Authors and Affiliations

Theory NEC Laboratory, Real World Computing Partnership C & C Media Research Laboratories, NEC Corporation, 4-1-1 Miyazaki, Miyamae-ku, Kawasaki, 216, Japan
Naoki Abe & Hiroshi Mamitsuka

Authors

Naoki Abe
View author publications
You can also search for this author in PubMed Google Scholar
Hiroshi Mamitsuka
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Abe, N., Mamitsuka, H. Predicting Protein Secondary Structure Using Stochastic Tree Grammars. Machine Learning 29, 275–301 (1997). https://doi.org/10.1023/A:1007477814995

Download citation

Issue Date: November 1997
DOI: https://doi.org/10.1023/A:1007477814995

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Predicting Protein Secondary Structure Using Stochastic Tree Grammars

Abstract

Article PDF

Similar content being viewed by others

Stochastic k-Tree Grammar and Its Application in Biomolecular Structure Modeling

SCFGs in RNA Secondary Structure Prediction : A Hands-on Approach

Predicting RNA Secondary Structures: One-grammar-fits-all Solution

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Predicting Protein Secondary Structure Using Stochastic Tree Grammars

Abstract

Article PDF

Similar content being viewed by others

Stochastic k-Tree Grammar and Its Application in Biomolecular Structure Modeling

SCFGs in RNA Secondary Structure Prediction : A Hands-on Approach

Predicting RNA Secondary Structures: One-grammar-fits-all Solution

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation