Abstract
Stochastic context-free grammars (SCFGs) can be applied to the problems of folding, aligning and modeling families of homologous RNA sequences. SCFGs capture the sequences' common primary and secondary structure and generalize the hidden Markov models (HMMs) used in related work on protein and DNA. This paper discusses our new algorithm, Tree-Grammar EM, for deducing SCFG parameters automatically from unaligned, unfolded training sequences. Tree-Grammar EM, a generalization of the HMM forward-backward algorithm, is based on tree grammars and is faster than the previously proposed inside-outside SCFG training algorithm. Independently, Sean Eddy and Richard Durbin have introduced a trainable “covariance model” (CM) to perform similar tasks. We compare and contrast our methods with theirs.
We thank Anders Krogh, Harry Noller and Bryn Weiser for discussions and assistance, and Michael Waterman and David Searls for discussions. This work was supported by NSF grants CDA-9115268 and IRI-9123692 and NIH grant number GM17129. This material is based upon work supported under a National Science Foundation Graduate Research Fellowship.
Preview
Unable to display preview. Download preview PDF.
References
A. V. Aho and J. D. Ullman. The Theory of Parsing, Translation and Compiling, Vol. I: Parsing. Prentice Hall, Englewood Cliffs, N.J., 1972.
J. K. Baker. Trainable grammars for speech recognition. Speech Communication Papers for the 97th Meeting of the Acoustical Society of America, pages 547–550, 1979.
J. W. Brown, E. S. Haas, B. D. James, D. A. Hunt, J. S. Liu, and N. R. Pace. Phylogenetic analysis and evolution of RNase P RNA in proteobacteria. Journal of Bacteriology, 173:3855–3863, 1991.
M. P. Brown, R. Hughey, A. Krogh, I. S. Mian, K. Sjölander, and D. Haussler. Dirichlet mixture priors for HMMs. In preparation, 1993.
M. P. Brown, R. Hughey, A. Krogh, I. S. Mian, K. Sjölander, and D. Haussler. Using Dirichlet mixture priors to derive hidden Markov models for protein families. In L. Hunter, D. Searls, and J. Shavlik, editors, Proc. of First Int. Conf. on Intelligent Systems for Molecular Biology, pages 47–55, Memo Park, CA, July 1993. AAAI/MIT Press.
S. R. Eddy and R. Durbin. RNA sequence analysis using covariance models. Submitted to Nucleic Acids Research, 1994.
J. Engelfriet and G. Rozenberg. Graph grammars based on node rewriting: An introduction to NLC graph grammars. In E. Ehrig, H. J. Kreowski, and G. Rozenberg, editors, Lecture Notes in Computer Science, volume 532, pages 12–23. Springer-Verlag, 1991.
K. S. Fu. Syntactic pattern recognition and applications. Prentice-Hall, Englewood Cliffs, NJ, 1982.
G. E. Fox and C. R. Woese. 5S RNA secondary structure. Nature, 256:505–507, 1975.
M. Gouy. Secondary structure prediction of RNA. In M. J. Bishop and C. R. Rawlings, editors, Nucleic acid and protein sequence analysis, a practical approach, pages 259–284. IRL Press, Oxford, England, 1987.
C. Guthrie and B. Patterson. Spliceosomal snRNAs. Annual Review of Genetics, 22:387–419, 1988.
R. R. Gutell, A. Power, G. Z. Hertz, E. J. Putz, and G. D. Stormo. Identifying constraints on the higher-order structure of RNA: continued development and application of comparative sequence analysis methods. Nucleic Acids Research, 20:5785–5795, 1992.
D. Haussler, A. Krogh, I. S. Mian, and K. Sjölander. Protein modeling using hidden Markov models: Analysis of globins. In Proceedings of the Hawaii International Conference on System Sciences, volume 1, pages 792–802, Los Alamitos, CA, 1993. IEEE Computer Society Press.
T. Klinger and D. Brutlag. Detection of correlations in tRNA sequences with structural implications. In Lawrence Hunter, David Searls, and Jude Shavlik, editors, First International Conference on Intelligent Systems for Molecular Biology, Menlo Park, 1993. AAAI Press.
A. Krogh, M. Brown, I. S. Mian, K. Sjölander, and D. Haussler. Hidden Markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology, 235:1501–1531, Feb. 1994.
Allan Lapedes. Private communication, 1992.
N. Larsen, G. J. Olsen, B. L. Maidak, M. J. McCaughey, R. Overbeek, T. J. Macke, T. L. Marsh, and C. R. Woese. The ribosomal database project. Nucleic Acids Research, 21:3021–3023, 1993.
R. H. Lathrop and T. F. Smith. A branch-and-bound algorithm for optimal protein threading with pairwise (contact potential) amino acid interactions. In Proceedings of the 27th Hawaii International Conference on System Sciences, Los Alamitos, CA, 1994. IEEE Computer Society Press.
K. Lari and S. J. Young. The estimation of stochastic context-free grammars using the inside-outside algorithm. Computer Speech and Language, 4:35–56, 1990.
F. Michel, A. D. Ellington, S. Couture, and J. W. Szostak. Phylogenetic and genetic evidence for base-triples in the catalytic domain of group I introns. Nature, 347:578–580, 1990.
F. Michel, K. Umesono, and H. Ozeki. Comparative and functional anatomy of group II catalytic introns-a review. Gene, 82:5–30, 1989.
F. Michel and E. Westhof. Modelling of the three-dimensional architecture of group I catalytic introns based on comparative sequence analysis. Journal of Molecular Biology, 216:585–610, 1990.
R. Nussinov, G. Pieczenik, J. R. Griggs, and D. J. Kleitman. Algorithms for loop matchings. SIAM Journal of Applied Mathematics, 35:68–82, 1978.
L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE, 77(2):257–286, 1989.
W. Saenger. Principles of nucleic acid structure. Springer Advanced Texts in Chemistry. Springer-Verlag, New York, 1984.
Y. Sakakibara. Efficient learning of context-free grammars from positive structural examples. Information and Computation, 97:23–60, 1992.
D. Sankoff. Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM J. Appl. Math., 45:810–825, 1985.
Y. Sakakibara, M. Brown, R. Hughey, I. S. Mian, K. Sjölander, R. Underwood, and D. Haussler. The application of stochastic context-free grammars to folding, aligning and modeling homologous RNA sequences. Submitted for publication, 1993.
Y. Sakakibara, M. Brown, I. S. Mian, R. Underwood, and D. Haussler. Stochastic context-free grammars for modeling RNA. In Proceedings of the Hawaii International Conference on System Sciences, Los Alamitos, CA, 1994. IEEE Computer Society Press.
Y. Sakakibara, M. Brown, R. Underwood, I. S. Mian, and D. Haussler. Stochastic context-free grammars for modeling RNA. Technical Report UCSC-CRL-93-16, UC Santa Cruz, Computer and Information Sciences Dept., Santa Cruz, CA 95064, 1993.
T. J. Santner and D. E. Duffy. The Statistical Analysis of Discrete Data. Springer Verlag, New York, 1989.
D. B. Searls and S. Dong. A syntactic pattern recognition system for DNA sequences. In Proc. 2nd Int. Conf. on Bioinformatics, Supercomputing and complex genome analysis. World Scientific, 1993. In press.
David B. Searls. The linguistics of DNA. American Scientist, 80:579–591, November–December 1992.
D. B. Searls. The computational linguistics of biological sequences. In Artificial Intelligence and Molecular Biology, chapter 2, pages 47–120. AAAI Press, 1993.
D. B. Searls. String variable grammar: a logic grammar formalism for DNA sequences, 1993. Unpublished.
B. A. Shapiro and K. Zhang. Comparing multiple RNA secondary structures using tree comparisons. CABIOS, 6(4):309–318, 1990.
A. J. Tranguch and D. R. Engelke. Comparative structural analysis of nuclear RNase P RNAs from yeast. Journal of Biological Chemistry, 268:14045–1455, 1993.
D. H. Turner, N. Sugimoto, and S. M. Freier. RNA structure prediction. Annual Review of Biophysics and Biophysical Chemistry, 17:167–192, 1988.
I. Tinoco Jr., O. C. Uhlenbeck, and M. D. Levine. Estimation of secondary structure in ribonucleic acids. Nature, 230:363–367, 1971.
J. W. Thatcher and J. B. Wright. Generalized finite automata theory with an application to a decision problem of second-order logic. Mathematical Systems Theory, 2:57–81, 1968.
M. S. Waterman. Computer analysis of nucleic acid sequences. Methods in Enzymology, 164:765–792, 1988.
M. S. Waterman. Consensus methods for folding single-stranded nucleic acids. In M. S. Waterman, editor, Mathematical Methods for DNA Sequences, chapter 8. CRC Press, 1989.
C. R. Woese, R. R. Gutell, R. Gupta, and H. F. Noller. Detailed analysis of the higher-order structure of 16S-like ribosomal ribonucleic acids. Microbiology Reviews, 47(4):621–669, 1983.
S. Winker, R. Overbeek, C.R. Woese, G.J. Olsen, and N. Pfluger. Structure detection through automated covariance search. Computer Applications in the Biosciences, 6:365–371, 1990.
J. R. Wyatt, J. D. Puglisi, and I. Tinoco Jr. RNA folding: pseudoknots, loops and bulges. BioEssays, 11(4):100–106, 1989.
M. Zuker. On finding all suboptimal foldings of an RNA molecule. Science, 244:48–52, 1989.
C. Zwieb. Structure and function of signal recognition particle RNA. Progress in Nucleic Acid Research and Molecular Biology, 37:207–234, 1989.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1994 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sakakibara, Y. et al. (1994). Recent methods for RNA modeling using stochastic context-free grammars. In: Crochemore, M., Gusfield, D. (eds) Combinatorial Pattern Matching. CPM 1994. Lecture Notes in Computer Science, vol 807. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-58094-8_25
Download citation
DOI: https://doi.org/10.1007/3-540-58094-8_25
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-58094-2
Online ISBN: 978-3-540-48450-9
eBook Packages: Springer Book Archive