Machine Learning

, Volume 29, Issue 2–3, pp 275–301 | Cite as

Predicting Protein Secondary Structure Using Stochastic Tree Grammars

  • Naoki Abe
  • Hiroshi Mamitsuka


We propose a new method for predicting protein secondary structure of a given amino acid sequence, based on a training algorithm for the probability parameters of a stochastic tree grammar. In particular, we concentrate on the problem of predicting β-sheet regions, which has previously been considered difficult because of the unbounded dependencies exhibited by sequences corresponding to β-sheets. To cope with this difficulty, we use a new family of stochastic tree grammars, which we call Stochastic Ranked Node Rewriting Grammars, which are powerful enough to capture the type of dependencies exhibited by the sequences of β-sheet regions, such as the ‘parallel’ and ‘anti-parallel’ dependencies and their combinations. The training algorithm we use is an extension of the ‘inside-outside’ algorithm for stochastic context-free grammars, but with a number of significant modifications. We applied our method on real data obtained from the HSSP database (Homology-derived Secondary Structure of Proteins Ver 1.0) and the results were encouraging: Our method was able to predict roughly 75 percent of the β-strands correctly in a systematic evaluation experiment, in which the test sequences not only have less than 25 percent identity to the training sequences, but are totally unrelated to them. This figure compares favorably to the predictive accuracy of the state-of-the-art prediction methods in the field, even though our experiment was on a restricted type of β-sheet structures and the test was done on a relatively small data size. We also stress that our method can predict the structure as well as the location of β-sheet regions, which was not possible by conventional methods for secondary structure prediction. Extended abstracts of parts of the work presented in this paper have appeared in (Abe & Mamitsuka, 1994) and (Mamitsuka & Abe, 1994).

Stochastic tree grammars protein secondary structure prediction beta-sheets maximum likelihood estimation minimum description length principle unsupervised learning 


  1. Abe, N. (1988). Feasible learnability of formal grammars and the theory of natural language acquision. Proceedings of the 12th International Conference on Computational Linguistics (pp. 1–6). Budapest, Hungary.Google Scholar
  2. Abe, N., & Mamitsuka, H. (1994). A new method for predicting protein secondary structures based on stochastic tree grammars. Proceedings of the Eleventh International Conference on Machine Learning (pp. 3–11). New Brunswick, NJ. Morgan Kaufmann.Google Scholar
  3. Baldi, P., Chauvin, Y., Hunkapillar, T., & McClure, M. (1994). Hidden Markov models of biological primary sequence information. Proceedings of the National Academy of Sciences, 91, 1059–1063.Google Scholar
  4. Barton, G. J. (1995). Protein secondary structure prediction: Review article. Current Opinion in Structural Biology, 5, 372–376.Google Scholar
  5. Bernstein, F., Koetzle, T., Williams, G., Meyer, E., Brice, M., Rodgers, J., Kennard, O., Shimanouchi, T., & Tasumi, M. (1977). The Protein Data Bank: a computer-based archival file for macromolecular structures. Journal of Molecular Biology, 112, 535–542.Google Scholar
  6. Cost, S., & Salzberg, S. (1993). A weighted nearest neighbor algorithm for learning with symbolic features. Machine Learning, 10, 57–78.Google Scholar
  7. Doolittle, R. F., Feng, D. F., Johnson, M. S., & McClure, M. A. (1986). Relationships of human protein sequences to those of other organisms. The Cold Spring Harbor Symposium on Quantitative Biology, 51, 447–455.Google Scholar
  8. Eddy, S. R., & Durbin, R. (1994). RNA sequence analysis using covariance models. Nucleic Acids Research, 22, 2079–2088.Google Scholar
  9. Fauchere, J. & Pliska, V. (1983). Hydrophobic parameters of amino acid side chains from the partitioning of N-acetyl-amino acid amides. European Journal of Medicinal Chemistry: Chimical Therapeutics, 18, 369–375.Google Scholar
  10. Hoboem, U., Scharf, M., Schneider, R., & Sander, C. (1992). Selection of a representative set of structures from the Brookhaven Protein Data Bank. Protein Science, 1, 409–417.Google Scholar
  11. Jelinik, F., Lafferty, & Mercer, R. (1990). Basic methods of probabilistic context free grammars. IBM Research Report RC16374 (#72684). Yorktown Heights, NY: IBM, Thomas J. Watson Research Center.Google Scholar
  12. Joshi, A. K., Levy, L., & Takahashi, M. (1975). Tree adjunct grammars. Journal of Computer and System Sciences, 10, 136–163.Google Scholar
  13. Kneller, D., Cohen, F., & Langridge, R. (1990). Improvements in protein secondary structure prediction by an enhanced neural network. Journal of Molecular Biology, 214, 171–182.Google Scholar
  14. Krogh, A., Brown, M., Mian, I. S., Sjölander, K., & Haussler, D. (1994). Hidden Markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology, 235, 1501–1531.Google Scholar
  15. Levinson, S. E., Rabiner, L. R., & Sondhi, M. M. (1983). An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition. The Bell System Technical Journal, 62, 1035–1074.Google Scholar
  16. Mamitsuka, H., & Abe, N. (1994). Predicting location and structure of beta-sheet regions using stochastic tree grammars. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology (pp. 276–284). Palo Alto, CA: The AAAI Press.Google Scholar
  17. Mamitsuka, H. & Yamanishi, K. (1995). α-Helix region prediction with stochastic rule learning. Computer Applications in the Biosciences, 11, 399–411.Google Scholar
  18. May, A. C. W., & Blundell, T. L. (1994). Automated comparative modelling of protein structures. Current Opinion in Biotechnology, 5, 355–360.Google Scholar
  19. Muggleton, S., King, R. D., & Sternberg, M. J. E. (1992). Protein secondary structure prediction using logic-based machine learning. Protein Engineering, 5, 647–657.Google Scholar
  20. Paz, A. (1971). Introduction to probabilistic automata. New York: Academic Press.Google Scholar
  21. Qian, N., & Sejnowski, T. J. (1988). Predicting the secondary structure of globular proteins using neural network models. Journal of Molecular Biology, 202, 865–884.Google Scholar
  22. Riis, S., & Krogh, A. (1996). Improving prediction of protein secondary structure using structured neural networks and multiple sequence alignments. Journal of Computational Biology, 3, 163–183.Google Scholar
  23. Rissanen, J. (1986). Stochastic complexity and modeling. The Annals of Statistics, 14, 1080–1100.Google Scholar
  24. Rost, B., & Sander, C. (1993). Prediction of protein secondary structure at better than 70% accuracy. Journal of Molecular Biology, 232, 584–599.Google Scholar
  25. Rounds, W. C. (1969). Context-free grammars on trees. Conference Record of ACM Symposium on Theory of Computing (pp. 143–148). Marina del Rey, CA.Google Scholar
  26. Sakakibara, Y., Brown, M., Hughey, R., Mian, I. S., Sjölander, K., Underwood, R. C., & Haussler, D. (1994). Stochastic context-free grammars for tRNA modeling. Nucleic Acids Research, 22, 5112–5120.Google Scholar
  27. Sander, C., & Schneider, R. (1991). Database of homology-derived structures and the structural meaning of sequence alignment. Proteins: Structure, Function, and Genetics, 9, 56–68.Google Scholar
  28. Schabes, Y. (1992). Stochastic lexicalized tree adjoining grammars. Proceedings of the 14th International Conference of Computational Linguistics (pp. 426–432). Nantes, France.Google Scholar
  29. Searls, D. B. (1993). The computational linguistics of biological sequences. In L. Hunter (Ed.), Artificial intelligence and molecular biology, Menlo Park, CA: AAAI Press.Google Scholar
  30. Stolcke, A., & Omohundro, S. (1994). Inducing probabilistic grammars by Bayesian model merging. Proceedings of the Second International Colloquium on Grammatical Inference and Applications (pp. 106–118). Alicante, Spain: Springer-Verlag.Google Scholar
  31. Taylor, W. R. (1986). The classification of amino acid conservation. Journal of Theoretical Biology, 119, 205–218.Google Scholar
  32. Vijay-Shanker, K.,& Joshi, A. K. (1985). Some computational properties of tree adjoining grammars. Proceedings of 23rd Meeting of the Association for Computational Linguistics (pp. 82–93). Chicago, IL.Google Scholar
  33. Wodak, S. J., & Rooman, M. J. (1993). Generating and testing protein folds. Current Opinion in Structural Biology, 3, 247–259.Google Scholar

Copyright information

© Kluwer Academic Publishers 1997

Authors and Affiliations

  • Naoki Abe
    • 1
  • Hiroshi Mamitsuka
    • 1
  1. 1.Theory NEC Laboratory, Real World Computing Partnership C & C Media Research LaboratoriesNEC CorporationMiyamae-ku, KawasakiJapan

Personalised recommendations