Abstract
This paper concerns the longest common subsequence (LCS) shared by two sequences (or strings) of length N, whose elements are chosen at random from a finite alphabet. The exact distribution and the expected value of the length of the LCS, k say, between two random sequences is still an open problem in applied probability. While the expected value E(N) of the length of the LCS of two random strings is known to lie within certain limits, the exact value of E(N) and the exact distribution are unknown. In this paper, we calculate the length of the LCS for all possible pairs of binary sequences from N=1 to 14. The length of the LCS and the Hamming distance are represented in color on two all-against-all arrays. An iterative approach is then introduced in which we determine the pairs of sequences whose LCS lengths increased by one upon the addition of one letter to each sequence. The pairs whose score did increase are shown in black and white on an array, which has an interesting fractal-like structure. As the sequence length increases, R(N) (the proportion of sequences whose score increased) approaches the Chvátal–Sankoff constant a c (the proportionality constant for the linear growth of the expected length of the LCS with sequence length). We show that R(N) is converging more rapidly to a c than E(N)/N.
Similar content being viewed by others
References
K. S. Alexander, “The rate of convergence of the mean length of the longest common subsequence,” Ann. Applied Probab. vol. 4 pp. 1074–1082, 1994.
S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, “Basic local alignment search tool,” J. Mol. Biol. vol. 215 pp. 403–410, 1990.
S. F. Altschul, R. Bundschuh, R. Olsen, and T. Hwa, “The estimation of statistical parameters for local alignment score distributions,” Nucl. Acids Res. vol. 29 pp. 351–361, 2001.
R. Arratia and M. S. Waterman, “A phase transition for the score in matching random sequences allowing deletions,” Ann. Appl. Probab. vol. 4 pp. 200–225, 1994.
R. Bundschuh and T. Hwa, “An analytic study of the phase transition line in local sequence alignment with gaps,” Disc. Appl. Math. vol. 104 pp. 113–142, 2000.
R. Bundschuh, “High precision simulations of the longest common subsequence,” Eur. Phys. J. B. vol. 22 pp. 533–541, 2001a.
R. Bundschuh, “Rapid significance estimation in local sequence alignment with gaps,” In Proceedings of the Fifth Annual International Conference on Computational Molecular Biology, edited by T. Lengauer, D. Sankoff, S. Istrail, P. Pevzner, and M. Waterman, ACM Press: New York, NY, pp. 77–85, 2001b.
R. Bundschuh, “Asymmetric exclusion process and extremal statistics of random sequences,” Phys. Rev. E vol. 65 p. 031911, 2002.
V. Chvátal and D. Sankoff, “Longest common subsequences of two random sequences,” J. Appl. Probability vol. 12 pp. 306–315, 1975.
T. Jiang and M. Li, “On the approximation of the shortest common subsequences and longest common subsequences,” SIAM J. Comput. vol. 24(5) pp. 1122–1139, 1995.
S. Karlin and S. F. Altschul, “Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes,” Proc. Natl. Acad. Sci. USA vol. 87 pp. 2264–2268, 1990.
R. Mott, “Accurate formula for P-values of gapped local sequence and profile alignments,” J. Mol. Biol. vol. 300 pp. 649–659, 2000.
S. B. Needleman and C. D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins,” J. Mol. Biol. vol. 48 pp. 443–453, 1970.
R. Olsen, R. Bundschuh, and T. Hwa, “Rapid assessment of extremal statistics for gapped local alignment,” In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, edited by T. Lengauer, R. Schneider, P. Bork, D. Brutlag, J. Glasgow, H. W. Mewes, and R. Zimmer, AAAI Press: Menlo Park, CA, pp. 211–222, 1999.
D. Sankoff, “Matching sequences under deletion/insertion constraints,” Proc. Nat. Acad. Sci. vol. 69 pp. 4–6, 1972.
D. Sankoff and R. J. Cedergren, “A test for nucleotide sequence homology,” J. Mol. Biol. vol. 77 pp. 159–164, 1973.
A. A. Schäffer, L. Aravind, T. L. Madden, S. Shavirin, J. L. Spouge, Y. I. Wolf, E. V. Koonin, and S. F. Altschul, “Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements,” Nucl. Acids Res. vol. 29 pp. 2994–3005, 2001.
T. F. Smith and M. S. Waterman, “Identification of common molecular subsequences,” J. Mol. Biol. vol. 147 pp. 195–197, 1981.
M. S. Waterman and M. Vingron, “Sequence comparison significance and Poisson approximation,” Statistical Science vol. 9 pp. 367–381, 1994.
M. S. Waterman, Introduction to Computational Biology, Chapman & Hall: London, 1995.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Booth, H.S., MacNamara, S.F., Nielsen, O.M. et al. An Iterative Approach to Determining the Length of the Longest Common Subsequence of Two Strings. Methodology and Computing in Applied Probability 6, 401–421 (2004). https://doi.org/10.1023/B:MCAP.0000045088.88240.3a
Issue Date:
DOI: https://doi.org/10.1023/B:MCAP.0000045088.88240.3a