The problems of finding a longest common subsequence of two sequencesA andB and a shortest edit script for transformingA intoB have long been known to be dual problems. In this paper, they are shown to be equivalent to finding a shortest/longest path in an edit graph. Using this perspective, a simpleO(ND) time and space algorithm is developed whereN is the sum of the lengths ofA andB andD is the size of the minimum edit script forA andB. The algorithm performs well when differences are small (sequences are similar) and is consequently fast in typical applications. The algorithm is shown to haveO(N+D 2) expected-time performance under a basic stochastic model. A refinement of the algorithm requires onlyO(N) space, and the use of suffix trees leads to anO(N logN+D 2) time variation.
Key wordsLongest common subsequence Shortest edit script Edit graph File comparison
Unable to display preview. Download preview PDF.
- J. Gosling. A redisplay algorithm.Proceedings ACM SIGPLAN/SIGOA Symposium on Text Manipulation, 1981, pp.Google Scholar
- J. W. Hunt and M. D. McIlroy. An algorithm for differential file comparison. Computing Science Technical Report 41, Bell Laboratories (1975).Google Scholar
- D. E. Knuth.The Art of Computer Programming, Vol. 3: Sorting and Searching. Addison-Wesley: Reading, MA, 1983, pp. 490–493.Google Scholar
- M. J. Rochkind. The source code control system.IEEE Trans. Software Engrg.,1, 4 (1975), 364–370.Google Scholar
- D. Sankoff and J. B. Kruskal.Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley: Reading, MA, 1983.Google Scholar