Fast identification of approximately matching substrings
Let two strings S, T over a finite alphabet Σ be given, and let M be an arbitrary relation on Σ×Σ. Define an approximate match (x,y) of two length m subwords (substrings) x ⊑ S, y ⊑ T when M(x i ,y i , for all 1≤i≤m. A match implies all the local alignments (without insertions and deletions) which are pairings of specific occurrances of x and y. A match (x,y) is maximal if there exists no longer match (u, v) such that all of the local alignments implied by (x,y) are contained in a local alignment implied by (u,v). We give an efficient algorithm for finding all maximal matches between S and T. The algorithm runs in time bounded by the sum of the lengths of the maximal matches, at worst. O(¦Σ¦2n2). The main application is identifying homologous regions of protein sequences.
KeywordsLinear Time Local Alignment Maximal Match Suffix Tree Finite Alphabet
Unable to display preview. Download preview PDF.
- 1.A. Blumer, J. Blumer, A. Ehrenfeucht, D. Haussler, and R. McConnell. Building a complete inverted file for a set of text files in linear time. In FOCS, pages 349–358. ACM, January 1984.Google Scholar
- 2.M. T. Chen and Joel Seiferas. Efficient and Elegant Subword Tree Construction, pages 97–107. Springer-Verlag, Berlin, 1985.Google Scholar
- 3.B. Clift, D. Haussler, R. McConnell, T. D. Schneider, and G. D. Stormo. Sequence landscapes. Nucleic Acids Res., 14(1):141–158, January 1986.Google Scholar
- 4.M. O. Dayhoff, R. M. Schwartz, and B. C. Orcutt. Atlas of Protein Structure, volume 5. National Biomedical Research Foundation, Washington, DC, 1978. suppl. 3.Google Scholar
- 5.Edward M. McCreight. A space-economical suffix tree construction algorithm. JACM, 23(2):262–272, April 1976.Google Scholar
- 6.P. Weine. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory, pages 1–11. IEEE, 1973.Google Scholar