Abstract
We propose a new method for predicting protein secondary structure of a given amino acid sequence, based on a training algorithm for the probability parameters of a stochastic tree grammar. In particular, we concentrate on the problem of predicting β-sheet regions, which has previously been considered difficult because of the unbounded dependencies exhibited by sequences corresponding to β-sheets. To cope with this difficulty, we use a new family of stochastic tree grammars, which we call Stochastic Ranked Node Rewriting Grammars, which are powerful enough to capture the type of dependencies exhibited by the sequences of β-sheet regions, such as the ‘parallel’ and ‘anti-parallel’ dependencies and their combinations. The training algorithm we use is an extension of the ‘inside-outside’ algorithm for stochastic context-free grammars, but with a number of significant modifications. We applied our method on real data obtained from the HSSP database (Homology-derived Secondary Structure of Proteins Ver 1.0) and the results were encouraging: Our method was able to predict roughly 75 percent of the β-strands correctly in a systematic evaluation experiment, in which the test sequences not only have less than 25 percent identity to the training sequences, but are totally unrelated to them. This figure compares favorably to the predictive accuracy of the state-of-the-art prediction methods in the field, even though our experiment was on a restricted type of β-sheet structures and the test was done on a relatively small data size. We also stress that our method can predict the structure as well as the location of β-sheet regions, which was not possible by conventional methods for secondary structure prediction. Extended abstracts of parts of the work presented in this paper have appeared in (Abe & Mamitsuka, 1994) and (Mamitsuka & Abe, 1994).
Article PDF
Similar content being viewed by others
References
Abe, N. (1988). Feasible learnability of formal grammars and the theory of natural language acquision. Proceedings of the 12th International Conference on Computational Linguistics (pp. 1–6). Budapest, Hungary.
Abe, N., & Mamitsuka, H. (1994). A new method for predicting protein secondary structures based on stochastic tree grammars. Proceedings of the Eleventh International Conference on Machine Learning (pp. 3–11). New Brunswick, NJ. Morgan Kaufmann.
Baldi, P., Chauvin, Y., Hunkapillar, T., & McClure, M. (1994). Hidden Markov models of biological primary sequence information. Proceedings of the National Academy of Sciences, 91, 1059–1063.
Barton, G. J. (1995). Protein secondary structure prediction: Review article. Current Opinion in Structural Biology, 5, 372–376.
Bernstein, F., Koetzle, T., Williams, G., Meyer, E., Brice, M., Rodgers, J., Kennard, O., Shimanouchi, T., & Tasumi, M. (1977). The Protein Data Bank: a computer-based archival file for macromolecular structures. Journal of Molecular Biology, 112, 535–542.
Cost, S., & Salzberg, S. (1993). A weighted nearest neighbor algorithm for learning with symbolic features. Machine Learning, 10, 57–78.
Doolittle, R. F., Feng, D. F., Johnson, M. S., & McClure, M. A. (1986). Relationships of human protein sequences to those of other organisms. The Cold Spring Harbor Symposium on Quantitative Biology, 51, 447–455.
Eddy, S. R., & Durbin, R. (1994). RNA sequence analysis using covariance models. Nucleic Acids Research, 22, 2079–2088.
Fauchere, J. & Pliska, V. (1983). Hydrophobic parameters of amino acid side chains from the partitioning of N-acetyl-amino acid amides. European Journal of Medicinal Chemistry: Chimical Therapeutics, 18, 369–375.
Hoboem, U., Scharf, M., Schneider, R., & Sander, C. (1992). Selection of a representative set of structures from the Brookhaven Protein Data Bank. Protein Science, 1, 409–417.
Jelinik, F., Lafferty, & Mercer, R. (1990). Basic methods of probabilistic context free grammars. IBM Research Report RC16374 (#72684). Yorktown Heights, NY: IBM, Thomas J. Watson Research Center.
Joshi, A. K., Levy, L., & Takahashi, M. (1975). Tree adjunct grammars. Journal of Computer and System Sciences, 10, 136–163.
Kneller, D., Cohen, F., & Langridge, R. (1990). Improvements in protein secondary structure prediction by an enhanced neural network. Journal of Molecular Biology, 214, 171–182.
Krogh, A., Brown, M., Mian, I. S., Sjölander, K., & Haussler, D. (1994). Hidden Markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology, 235, 1501–1531.
Levinson, S. E., Rabiner, L. R., & Sondhi, M. M. (1983). An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition. The Bell System Technical Journal, 62, 1035–1074.
Mamitsuka, H., & Abe, N. (1994). Predicting location and structure of beta-sheet regions using stochastic tree grammars. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology (pp. 276–284). Palo Alto, CA: The AAAI Press.
Mamitsuka, H. & Yamanishi, K. (1995). α-Helix region prediction with stochastic rule learning. Computer Applications in the Biosciences, 11, 399–411.
May, A. C. W., & Blundell, T. L. (1994). Automated comparative modelling of protein structures. Current Opinion in Biotechnology, 5, 355–360.
Muggleton, S., King, R. D., & Sternberg, M. J. E. (1992). Protein secondary structure prediction using logic-based machine learning. Protein Engineering, 5, 647–657.
Paz, A. (1971). Introduction to probabilistic automata. New York: Academic Press.
Qian, N., & Sejnowski, T. J. (1988). Predicting the secondary structure of globular proteins using neural network models. Journal of Molecular Biology, 202, 865–884.
Riis, S., & Krogh, A. (1996). Improving prediction of protein secondary structure using structured neural networks and multiple sequence alignments. Journal of Computational Biology, 3, 163–183.
Rissanen, J. (1986). Stochastic complexity and modeling. The Annals of Statistics, 14, 1080–1100.
Rost, B., & Sander, C. (1993). Prediction of protein secondary structure at better than 70% accuracy. Journal of Molecular Biology, 232, 584–599.
Rounds, W. C. (1969). Context-free grammars on trees. Conference Record of ACM Symposium on Theory of Computing (pp. 143–148). Marina del Rey, CA.
Sakakibara, Y., Brown, M., Hughey, R., Mian, I. S., Sjölander, K., Underwood, R. C., & Haussler, D. (1994). Stochastic context-free grammars for tRNA modeling. Nucleic Acids Research, 22, 5112–5120.
Sander, C., & Schneider, R. (1991). Database of homology-derived structures and the structural meaning of sequence alignment. Proteins: Structure, Function, and Genetics, 9, 56–68.
Schabes, Y. (1992). Stochastic lexicalized tree adjoining grammars. Proceedings of the 14th International Conference of Computational Linguistics (pp. 426–432). Nantes, France.
Searls, D. B. (1993). The computational linguistics of biological sequences. In L. Hunter (Ed.), Artificial intelligence and molecular biology, Menlo Park, CA: AAAI Press.
Stolcke, A., & Omohundro, S. (1994). Inducing probabilistic grammars by Bayesian model merging. Proceedings of the Second International Colloquium on Grammatical Inference and Applications (pp. 106–118). Alicante, Spain: Springer-Verlag.
Taylor, W. R. (1986). The classification of amino acid conservation. Journal of Theoretical Biology, 119, 205–218.
Vijay-Shanker, K.,& Joshi, A. K. (1985). Some computational properties of tree adjoining grammars. Proceedings of 23rd Meeting of the Association for Computational Linguistics (pp. 82–93). Chicago, IL.
Wodak, S. J., & Rooman, M. J. (1993). Generating and testing protein folds. Current Opinion in Structural Biology, 3, 247–259.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Abe, N., Mamitsuka, H. Predicting Protein Secondary Structure Using Stochastic Tree Grammars. Machine Learning 29, 275–301 (1997). https://doi.org/10.1023/A:1007477814995
Issue Date:
DOI: https://doi.org/10.1023/A:1007477814995