Abstract
Given a sequenceA and regular expressionR, theapproximate regular expression matching problem is to find a sequence matchingR whose optimal alignment withA is the highest scoring of all such sequences. This paper develops an algorithm to solve the problem in timeO(MN), whereM andN are the lengths ofA andR. Thus, the time requirement is asymptotically no worse than for the simpler problem of aligning two fixed sequences. Our method is superior to an earlier algorithm by Wagner and Seiferas in several ways. First, it treats real-valued costs, in addition to integer costs, with no loss of asymptotic efficiency. Second, it requires onlyO(N) space to deliver just the score of the best alignment. Finally, its structure permits implementation techniques that make it extremely fast in practice. We extend the method to accommodate gap penalties, as required for typical applications in molecular biology, and further refine it to search for substrings ofA that strongly align with a sequence inR, as required for typical data base searches. We also show how to deliver an optimal alignment betweenA andR in onlyO(N+logM) space usingO(MN logM) time. Finally, anO(MN(M+N)+N 2logN) time algorithm is presented for alignment scoring schemes where the cost of a gap is an arbitrary increasing function of its length.
Similar content being viewed by others
Literature
Abarbanel, R. M., P. R. Wieneke, E. Mansfield, D. A. Jaffe and D. L. Brutlag. 1984. “Rapid Searches for Complex Patterns in Biological Molecules.”Nucleic Acids Res. 12, 263–280.
Aho, A. 1980. “Pattern Matching in Strings.” InFormal Language Theory, R. Book (Ed.). New York: Academic Press.
—, J. E. Hopcroft and J. D. Ullman. 1983.Data Structures and Algorithms, pp. 203–208. Reading, MA: Addison-Wesley.
Cohen, F. E., R. M. Abarbanel, I. D. Kuntz and R. J. Fletterick. 1986. “Turn Prediction in Proteins Using a Pattern-Matching Approach.”Biochemistry 25, 266–275.
Fitch, W. M. and T. F. Smith. 1983. “Optimal Sequence Alignments.”Proc. Natn. Acad. Sci. U.S.A. 80, 1382–1386.
Gotoh, O. 1982. “An Improved Algorithm for Matching Biological Sequences.”J. Molec. Biol. 162, 705–708.
Hecht, M. S. 1977.Flow Analysis of Computer Programs. Amsterdam: North-Holland.
— and J. D. Ullman. 1975. “A Simple Algorithm for Global Data Flow Analysis Programs.”SIAM J. Computing 4, 519–532.
Hopcroft, J. E. and J. D. Ullman. 1979.Introduction to Automata Theory, Languages, and Computation. Reading, MA: Addison-Wesley.
Kennedy, K. 1975. “Node Listing Techniques Applied to Data Flow Analysis.”Proceedings of the 2nd ACM Conference on Principles of Programming Languages, 10–21.
Levenshtein, V. I. 1966. “Binary Codes Capable of Correcting Deletions, Insertions, and Reversals.”Cybernetics Control Theory 10, 707–710.
Miller, W. 1987.A Software Tools Sampler. New Jersey. Prentice-Hall.
— and E. W. Myers. 1988a. “A Simple Row-Replacement Method.”Software-Practice and Experience 18, 597–611.
— and —. 1988b. “Sequence Comparison with Concave Weighting Functions.”Bull. Math. Biol. 50, 97–120.
Myers, E. W. and W. Miller. 1988a. “Row replacement Algorithms for Screen Editors.”ACM Trans. Prog. Lang. Systems. (to be published).
— and —. 1988b. “Optimal Alignments in Linear Space.”CABIOS 4, 11–17.
Pennello, T. J. 1986. “Very Fast LR Parsing.” Proceedings of the SIGPLAN'86 Symposium on Compiler Construction.ACM SIGPLAN Notices 21, 145–150.
Sankoff, D. and J. B. Kruskal. 1983.Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison. Reading, MA: Addison-Wesley.
Sellers, P. H. 1980. “The Theory and Computation of Evolutionary Distances: Pattern Recognition.”J. Algorithms 1, 359–373.
—. 1984. “Pattern Recognition in Genetic Sequences by Mismatch Density.”Bull. Math. Biol. 46, 501–514.
Thompson, K. 1968. “Regular Expression Search Algorithm.”Comm. ACM 11, 419–422.
Wagner, R. A. 1974. “Order-n Correction of Regular Languages.”Comm. ACM 17, 265–268.
— and J. I. Seiferas. 1978. “Correcting Counter-Automaton-Recognizable Languages.”SIAM J. Computing 7, 357–375.
Waterman, M. S. 1984. “General Methods for Sequence Comparison.”Bull. Math. Biol. 46, 473–500.
—, T. F. Smith and W. A. Beyer. 1976. “Some Biological Sequence Metrics.”Adv. Maths 20, 367–387.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Myers, E.W., Miller, W. Approximate matching of regular expressions. Bltn Mathcal Biology 51, 5–37 (1989). https://doi.org/10.1007/BF02458834
Received:
Issue Date:
DOI: https://doi.org/10.1007/BF02458834