Computing similarity between RNA strings

  • Vineet Bafna
  • S. Muthukrishnan
  • R. Ravi
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 937)


Ribonucleic acid (RNA) strings are strings over the four-letter alphabet {A, C, G, U} with a secondary structure of base-pairing between A-U and C-G pairs in the string. Edges are drawn between two bases that are paired in the secondary structure and these edges have traditionally been assumed to be noncrossing. The noncrossing base-pairing naturally leads to a tree-like representation of the secondary structure of RNA strings.

In this paper, we address several notions of similarity between two RNA strings that take into account both the primary sequence and secondary base-pairing structure of the strings. We present efficient algorithms for exact matching and approximate matching between two RNA strings. We define a notion of alignment between two RNA strings and devise algorithms based on dynamic programming. We then present a method for optimally aligning a given RNA string with unknown secondary structure to one with known sequence and structure, thus attacking the structure prediction problem in the case when the structure of a closely related sequence is known. The techniques employed to prove our results include reductions to well-known string matching problems allowing wild cards and ranges, and speeding up dynamic programming by using the tree structures implicit in the secondary structure of RNA strings.


RNA structure edit distances approximate matching string algorithms trees 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    K. Abrahamson. Generalized string matching. SIAM J. Comp., 1987, 1039–1051.Google Scholar
  2. [2]
    A. Amir and M. Farach. Efficient 2-dimensional Approximate Matching of Non-rectangular Figures. Proc of 2nd Ann ACM Symp on Discrete Algorithms, 1991, 212–222.Google Scholar
  3. [3]
    D. Eppstein, Z. Galil, R. Giancarlo, and G.F. Italiano, “Sparse dynamic programming I: Linear cost functions,” JACM, Vol. 39, No. 3, 519–545 (1992).CrossRefGoogle Scholar
  4. [4]
    D. Eppstein, Z. Galil, R. Giancarlo, and G.F. Italiano, “Sparse dynamic programming II: Convex and concave cost functions,” JACM, Vol. 39, No. 3, 546–567 (1992).CrossRefGoogle Scholar
  5. [5]
    M. Fischer and M. Paterson. String Matching and other Products. SIAM-AMS Proceedings, Vol. 7, 113–125, 1974.Google Scholar
  6. [6]
    L. Grate, M. Hebster. R. Hughey, D, Haussler, I. S. Mian and H. Noller, “RNA modeling using Gibbs sampling and stochastic context free grammars,” Second Intl. Conf. on Intelligent Systems for Molecular Biology (1994).Google Scholar
  7. [7]
    T. Jiang, L. Wang and K. Zhang, “Alignment of trees — an alternative to tree edit,” Proc. Combinatorial Pattern Matching Conf. 94, LNCS 807, 75–86 (1994).Google Scholar
  8. [8]
    P. KilpelÄinen and H. Mannila, “Query primitives for tree-structured data,” Proc. Combinatorial Pattern Matching Conf. 94, LNCS 807, 213–225 (1994).Google Scholar
  9. [9]
    D. E. Knuth, J. H. Morris, and V. R. Pratt. Fast pattern matching in strings. SIAM J. Computing, 6:323–350, 1977.CrossRefGoogle Scholar
  10. [10]
    L. L. Larmore and B. Schieber, “On-line dynamic programming with applications to the prediction of RNA secondary structure,” Prof. First ACM-SIAM Symp. on Discrete Algorithms, 503–512 (1990).Google Scholar
  11. [11]
    S-Y Le, J. Owens, R. Nussinov, J-H. Chen, B. Shapiro and J. V. Maizel, “RNA secondary structures: comparison and determination of frequently recurring substructures by consensus,” CABIOS Vol. 5, No. 3, 205–210 (1989).PubMedGoogle Scholar
  12. [12]
    S. Muthukrishnan. New results and open problems related to nonstandard stringology. Manuscript, 1995.Google Scholar
  13. [13]
    S. E. Needleman and C. D. Wunsch, “A general method applicable to the search for similarities in the amino-acid sequences of two proteins,” J. Mol. Bio., 48, 443–453 (1970).CrossRefGoogle Scholar
  14. [14]
    R. Nussinov, G. Pieczenik, J. R. Griggs and D. J. Kleitman, “Algorithms for loop matchings,” SIAM J. Appl. Math., 35, 68–82 (1978).CrossRefGoogle Scholar
  15. [15]
    Y. Sakakibara, M. Brown, I. S. Mian, R. Underwood, and D. Haussler, “Stochastic context free grammars for modeling RNA,” Proc. the Hawaii Intl. Conf. on System Sciences, IEEE Computer Society Press, Los Alamitos, CA, (1994).Google Scholar
  16. [16]
    Y. Sakakibara, M. Brown, R. Hughey, I. S. Mian, K. Sjölander, R. C. Underwood and D. Haussler, “Recent methods for RNA modeling using stochastic context-free grammars,” Proc. Combinatorial Pattern Matching Conf., LNCS 807, 289–306 (1994).Google Scholar
  17. [17]
    D. Sankoff, “Simultaneous solution of the RNA folding, alignment and protosequence problems,” SIAM J. Appl. Math. Vol. 45, No. 5, 810–825 (1985).Google Scholar
  18. [18]
    B. A. Shapiro, “An algorithm for comparing multiple RNA secondary structures,” CABIOS, Vol. 4, No. 3, 387–393 (1988).PubMedGoogle Scholar
  19. [19]
    B. A. Shapiro and K. Zhang, “Comparing multiple RNA secondary structures using tree comparisons,” CABIOS Vol. 6, No. 4, 309–318 (1990).PubMedGoogle Scholar
  20. [20]
    T. F. Smith and M. S. Waterman, “The identification of common molecular subsequences,” J. Mol. Biol. 147, 195–197 (1981).PubMedGoogle Scholar
  21. [21]
    T. F. Smith and M. S. Waterman, “Comparison of biosequences,” Adv. in App. Math. 2, 482–489 (1981).Google Scholar
  22. [22]
    K-C Tai, “The tree to tree correction problem,” JACM, Vol. 26, No. 3, 422–433 (1979).Google Scholar
  23. [23]
    M. S. Waterman, “Secondary structure of single-stranded nucleic acids,” Studies in Foundations and Combinatorics, Advances in Mathematics supplementary studies VOl 1, Academic press, New York, 167–212 (1978).Google Scholar
  24. [24]
    M. S. Waterman and T. F. Smith, “RNA secondary structure: a complete mathematical analysis,” Math. Biosci. 42, 257–266 (1978).Google Scholar
  25. [25]
    K. Zhang and D. Shasha, “Simple fast algorithms for the editing distance between trees and related problems, SIAM J. Comput. 18, 1245–1262 (1989).CrossRefGoogle Scholar
  26. [26]
    K. Zhang, R. Statman, and D. Shasha, “On the editing distance between unordered labeled trees,” Inform. Proc. Lett. 42, 133–139 (1992).Google Scholar
  27. [27]
    M. Zuker, “On finding all suboptimal foldings of an RNA molecule,” Science, 244 7, 48–52 (1989).PubMedMathSciNetGoogle Scholar
  28. [28]
    M. Zuker and D. Sankoff, “RNA secondary structures and their prediction,” Bull. Math. Biol. 46, 591–621 (1984).Google Scholar
  29. [29]
    M. Zuker and P. Stiegler, “Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information,” Nucleic Acid Res. 9, 133–148 (1981).PubMedGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1995

Authors and Affiliations

  • Vineet Bafna
    • 1
  • S. Muthukrishnan
    • 2
  • R. Ravi
    • 3
  1. 1.DIMACS CenterPiscataway
  2. 2.DIMACS CenterPiscataway
  3. 3.DIMACS, Department of Computer SciencePrinceton University

Personalised recommendations