Skip to main content
Log in

Approximate matching of regular expressions

  • Published:
Bulletin of Mathematical Biology Aims and scope Submit manuscript

Abstract

Given a sequenceA and regular expressionR, theapproximate regular expression matching problem is to find a sequence matchingR whose optimal alignment withA is the highest scoring of all such sequences. This paper develops an algorithm to solve the problem in timeO(MN), whereM andN are the lengths ofA andR. Thus, the time requirement is asymptotically no worse than for the simpler problem of aligning two fixed sequences. Our method is superior to an earlier algorithm by Wagner and Seiferas in several ways. First, it treats real-valued costs, in addition to integer costs, with no loss of asymptotic efficiency. Second, it requires onlyO(N) space to deliver just the score of the best alignment. Finally, its structure permits implementation techniques that make it extremely fast in practice. We extend the method to accommodate gap penalties, as required for typical applications in molecular biology, and further refine it to search for substrings ofA that strongly align with a sequence inR, as required for typical data base searches. We also show how to deliver an optimal alignment betweenA andR in onlyO(N+logM) space usingO(MN logM) time. Finally, anO(MN(M+N)+N 2logN) time algorithm is presented for alignment scoring schemes where the cost of a gap is an arbitrary increasing function of its length.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Literature

  • Abarbanel, R. M., P. R. Wieneke, E. Mansfield, D. A. Jaffe and D. L. Brutlag. 1984. “Rapid Searches for Complex Patterns in Biological Molecules.”Nucleic Acids Res. 12, 263–280.

    Google Scholar 

  • Aho, A. 1980. “Pattern Matching in Strings.” InFormal Language Theory, R. Book (Ed.). New York: Academic Press.

    Google Scholar 

  • —, J. E. Hopcroft and J. D. Ullman. 1983.Data Structures and Algorithms, pp. 203–208. Reading, MA: Addison-Wesley.

    Google Scholar 

  • Cohen, F. E., R. M. Abarbanel, I. D. Kuntz and R. J. Fletterick. 1986. “Turn Prediction in Proteins Using a Pattern-Matching Approach.”Biochemistry 25, 266–275.

    Article  Google Scholar 

  • Fitch, W. M. and T. F. Smith. 1983. “Optimal Sequence Alignments.”Proc. Natn. Acad. Sci. U.S.A. 80, 1382–1386.

    Article  Google Scholar 

  • Gotoh, O. 1982. “An Improved Algorithm for Matching Biological Sequences.”J. Molec. Biol. 162, 705–708.

    Article  Google Scholar 

  • Hecht, M. S. 1977.Flow Analysis of Computer Programs. Amsterdam: North-Holland.

    Google Scholar 

  • — and J. D. Ullman. 1975. “A Simple Algorithm for Global Data Flow Analysis Programs.”SIAM J. Computing 4, 519–532.

    Article  MATH  MathSciNet  Google Scholar 

  • Hopcroft, J. E. and J. D. Ullman. 1979.Introduction to Automata Theory, Languages, and Computation. Reading, MA: Addison-Wesley.

    Google Scholar 

  • Kennedy, K. 1975. “Node Listing Techniques Applied to Data Flow Analysis.”Proceedings of the 2nd ACM Conference on Principles of Programming Languages, 10–21.

  • Levenshtein, V. I. 1966. “Binary Codes Capable of Correcting Deletions, Insertions, and Reversals.”Cybernetics Control Theory 10, 707–710.

    MathSciNet  Google Scholar 

  • Miller, W. 1987.A Software Tools Sampler. New Jersey. Prentice-Hall.

    Google Scholar 

  • — and E. W. Myers. 1988a. “A Simple Row-Replacement Method.”Software-Practice and Experience 18, 597–611.

    Google Scholar 

  • — and —. 1988b. “Sequence Comparison with Concave Weighting Functions.”Bull. Math. Biol. 50, 97–120.

    Article  MATH  MathSciNet  Google Scholar 

  • Myers, E. W. and W. Miller. 1988a. “Row replacement Algorithms for Screen Editors.”ACM Trans. Prog. Lang. Systems. (to be published).

  • — and —. 1988b. “Optimal Alignments in Linear Space.”CABIOS 4, 11–17.

    Google Scholar 

  • Pennello, T. J. 1986. “Very Fast LR Parsing.” Proceedings of the SIGPLAN'86 Symposium on Compiler Construction.ACM SIGPLAN Notices 21, 145–150.

    Article  Google Scholar 

  • Sankoff, D. and J. B. Kruskal. 1983.Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison. Reading, MA: Addison-Wesley.

    Google Scholar 

  • Sellers, P. H. 1980. “The Theory and Computation of Evolutionary Distances: Pattern Recognition.”J. Algorithms 1, 359–373.

    Article  MATH  MathSciNet  Google Scholar 

  • —. 1984. “Pattern Recognition in Genetic Sequences by Mismatch Density.”Bull. Math. Biol. 46, 501–514.

    Article  MATH  MathSciNet  Google Scholar 

  • Thompson, K. 1968. “Regular Expression Search Algorithm.”Comm. ACM 11, 419–422.

    Article  MATH  Google Scholar 

  • Wagner, R. A. 1974. “Order-n Correction of Regular Languages.”Comm. ACM 17, 265–268.

    Article  MATH  Google Scholar 

  • — and J. I. Seiferas. 1978. “Correcting Counter-Automaton-Recognizable Languages.”SIAM J. Computing 7, 357–375.

    Article  MATH  MathSciNet  Google Scholar 

  • Waterman, M. S. 1984. “General Methods for Sequence Comparison.”Bull. Math. Biol. 46, 473–500.

    Article  MATH  MathSciNet  Google Scholar 

  • —, T. F. Smith and W. A. Beyer. 1976. “Some Biological Sequence Metrics.”Adv. Maths 20, 367–387.

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Myers, E.W., Miller, W. Approximate matching of regular expressions. Bltn Mathcal Biology 51, 5–37 (1989). https://doi.org/10.1007/BF02458834

Download citation

  • Received:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02458834

Keywords

Navigation