Skip to main content
Log in

An Iterative Approach to Determining the Length of the Longest Common Subsequence of Two Strings

  • Published:
Methodology And Computing In Applied Probability Aims and scope Submit manuscript

Abstract

This paper concerns the longest common subsequence (LCS) shared by two sequences (or strings) of length N, whose elements are chosen at random from a finite alphabet. The exact distribution and the expected value of the length of the LCS, k say, between two random sequences is still an open problem in applied probability. While the expected value E(N) of the length of the LCS of two random strings is known to lie within certain limits, the exact value of E(N) and the exact distribution are unknown. In this paper, we calculate the length of the LCS for all possible pairs of binary sequences from N=1 to 14. The length of the LCS and the Hamming distance are represented in color on two all-against-all arrays. An iterative approach is then introduced in which we determine the pairs of sequences whose LCS lengths increased by one upon the addition of one letter to each sequence. The pairs whose score did increase are shown in black and white on an array, which has an interesting fractal-like structure. As the sequence length increases, R(N) (the proportion of sequences whose score increased) approaches the Chvátal–Sankoff constant a c (the proportionality constant for the linear growth of the expected length of the LCS with sequence length). We show that R(N) is converging more rapidly to a c than E(N)/N.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • K. S. Alexander, “The rate of convergence of the mean length of the longest common subsequence,” Ann. Applied Probab. vol. 4 pp. 1074–1082, 1994.

    Google Scholar 

  • S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, “Basic local alignment search tool,” J. Mol. Biol. vol. 215 pp. 403–410, 1990.

    Google Scholar 

  • S. F. Altschul, R. Bundschuh, R. Olsen, and T. Hwa, “The estimation of statistical parameters for local alignment score distributions,” Nucl. Acids Res. vol. 29 pp. 351–361, 2001.

    Google Scholar 

  • R. Arratia and M. S. Waterman, “A phase transition for the score in matching random sequences allowing deletions,” Ann. Appl. Probab. vol. 4 pp. 200–225, 1994.

    Google Scholar 

  • R. Bundschuh and T. Hwa, “An analytic study of the phase transition line in local sequence alignment with gaps,” Disc. Appl. Math. vol. 104 pp. 113–142, 2000.

    Google Scholar 

  • R. Bundschuh, “High precision simulations of the longest common subsequence,” Eur. Phys. J. B. vol. 22 pp. 533–541, 2001a.

    Google Scholar 

  • R. Bundschuh, “Rapid significance estimation in local sequence alignment with gaps,” In Proceedings of the Fifth Annual International Conference on Computational Molecular Biology, edited by T. Lengauer, D. Sankoff, S. Istrail, P. Pevzner, and M. Waterman, ACM Press: New York, NY, pp. 77–85, 2001b.

    Google Scholar 

  • R. Bundschuh, “Asymmetric exclusion process and extremal statistics of random sequences,” Phys. Rev. E vol. 65 p. 031911, 2002.

    Google Scholar 

  • V. Chvátal and D. Sankoff, “Longest common subsequences of two random sequences,” J. Appl. Probability vol. 12 pp. 306–315, 1975.

    Google Scholar 

  • T. Jiang and M. Li, “On the approximation of the shortest common subsequences and longest common subsequences,” SIAM J. Comput. vol. 24(5) pp. 1122–1139, 1995.

    Google Scholar 

  • S. Karlin and S. F. Altschul, “Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes,” Proc. Natl. Acad. Sci. USA vol. 87 pp. 2264–2268, 1990.

    Google Scholar 

  • R. Mott, “Accurate formula for P-values of gapped local sequence and profile alignments,” J. Mol. Biol. vol. 300 pp. 649–659, 2000.

    Google Scholar 

  • S. B. Needleman and C. D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins,” J. Mol. Biol. vol. 48 pp. 443–453, 1970.

    Google Scholar 

  • R. Olsen, R. Bundschuh, and T. Hwa, “Rapid assessment of extremal statistics for gapped local alignment,” In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, edited by T. Lengauer, R. Schneider, P. Bork, D. Brutlag, J. Glasgow, H. W. Mewes, and R. Zimmer, AAAI Press: Menlo Park, CA, pp. 211–222, 1999.

    Google Scholar 

  • D. Sankoff, “Matching sequences under deletion/insertion constraints,” Proc. Nat. Acad. Sci. vol. 69 pp. 4–6, 1972.

    Google Scholar 

  • D. Sankoff and R. J. Cedergren, “A test for nucleotide sequence homology,” J. Mol. Biol. vol. 77 pp. 159–164, 1973.

    Google Scholar 

  • A. A. Schäffer, L. Aravind, T. L. Madden, S. Shavirin, J. L. Spouge, Y. I. Wolf, E. V. Koonin, and S. F. Altschul, “Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements,” Nucl. Acids Res. vol. 29 pp. 2994–3005, 2001.

    Google Scholar 

  • T. F. Smith and M. S. Waterman, “Identification of common molecular subsequences,” J. Mol. Biol. vol. 147 pp. 195–197, 1981.

    Google Scholar 

  • M. S. Waterman and M. Vingron, “Sequence comparison significance and Poisson approximation,” Statistical Science vol. 9 pp. 367–381, 1994.

    Google Scholar 

  • M. S. Waterman, Introduction to Computational Biology, Chapman & Hall: London, 1995.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Booth, H.S., MacNamara, S.F., Nielsen, O.M. et al. An Iterative Approach to Determining the Length of the Longest Common Subsequence of Two Strings. Methodology and Computing in Applied Probability 6, 401–421 (2004). https://doi.org/10.1023/B:MCAP.0000045088.88240.3a

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/B:MCAP.0000045088.88240.3a

Navigation