An Iterative Approach to Determining the Length of the Longest Common Subsequence of Two Strings

Booth, Hilary S.; MacNamara, Shevarl F.; Nielsen, Ole M.; Wilson, Susan R.

doi:10.1023/B:MCAP.0000045088.88240.3a

An Iterative Approach to Determining the Length of the Longest Common Subsequence of Two Strings

Published: December 2004

Volume 6, pages 401–421, (2004)
Cite this article

Methodology And Computing In Applied Probability Aims and scope Submit manuscript

Hilary S. Booth¹,
Shevarl F. MacNamara¹,
Ole M. Nielsen² &
…
Susan R. Wilson¹

97 Accesses
4 Citations
Explore all metrics

Abstract

This paper concerns the longest common subsequence (LCS) shared by two sequences (or strings) of length N, whose elements are chosen at random from a finite alphabet. The exact distribution and the expected value of the length of the LCS, k say, between two random sequences is still an open problem in applied probability. While the expected value E(N) of the length of the LCS of two random strings is known to lie within certain limits, the exact value of E(N) and the exact distribution are unknown. In this paper, we calculate the length of the LCS for all possible pairs of binary sequences from N=1 to 14. The length of the LCS and the Hamming distance are represented in color on two all-against-all arrays. An iterative approach is then introduced in which we determine the pairs of sequences whose LCS lengths increased by one upon the addition of one letter to each sequence. The pairs whose score did increase are shown in black and white on an array, which has an interesting fractal-like structure. As the sequence length increases, R(N) (the proportion of sequences whose score increased) approaches the Chvátal–Sankoff constant a _c (the proportionality constant for the linear growth of the expected length of the LCS with sequence length). We show that R(N) is converging more rapidly to a _c than E(N)/N.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A note on the longest matching consecutive subsequence

Article 16 November 2019

A Heuristic Approach for Solving the Longest Common Square Subsequence Problem

Shortest Distance Between Multiple Orbits and Generalized Fractal Dimensions

Article 08 March 2021

References

K. S. Alexander, “The rate of convergence of the mean length of the longest common subsequence,” Ann. Applied Probab. vol. 4 pp. 1074–1082, 1994.
Google Scholar
S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, “Basic local alignment search tool,” J. Mol. Biol. vol. 215 pp. 403–410, 1990.
Google Scholar
S. F. Altschul, R. Bundschuh, R. Olsen, and T. Hwa, “The estimation of statistical parameters for local alignment score distributions,” Nucl. Acids Res. vol. 29 pp. 351–361, 2001.
Google Scholar
R. Arratia and M. S. Waterman, “A phase transition for the score in matching random sequences allowing deletions,” Ann. Appl. Probab. vol. 4 pp. 200–225, 1994.
Google Scholar
R. Bundschuh and T. Hwa, “An analytic study of the phase transition line in local sequence alignment with gaps,” Disc. Appl. Math. vol. 104 pp. 113–142, 2000.
Google Scholar
R. Bundschuh, “High precision simulations of the longest common subsequence,” Eur. Phys. J. B. vol. 22 pp. 533–541, 2001a.
Google Scholar
R. Bundschuh, “Rapid significance estimation in local sequence alignment with gaps,” In Proceedings of the Fifth Annual International Conference on Computational Molecular Biology, edited by T. Lengauer, D. Sankoff, S. Istrail, P. Pevzner, and M. Waterman, ACM Press: New York, NY, pp. 77–85, 2001b.
Google Scholar
R. Bundschuh, “Asymmetric exclusion process and extremal statistics of random sequences,” Phys. Rev. E vol. 65 p. 031911, 2002.
Google Scholar
V. Chvátal and D. Sankoff, “Longest common subsequences of two random sequences,” J. Appl. Probability vol. 12 pp. 306–315, 1975.
Google Scholar
T. Jiang and M. Li, “On the approximation of the shortest common subsequences and longest common subsequences,” SIAM J. Comput. vol. 24(5) pp. 1122–1139, 1995.
Google Scholar
S. Karlin and S. F. Altschul, “Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes,” Proc. Natl. Acad. Sci. USA vol. 87 pp. 2264–2268, 1990.
Google Scholar
R. Mott, “Accurate formula for P-values of gapped local sequence and profile alignments,” J. Mol. Biol. vol. 300 pp. 649–659, 2000.
Google Scholar
S. B. Needleman and C. D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins,” J. Mol. Biol. vol. 48 pp. 443–453, 1970.
Google Scholar
R. Olsen, R. Bundschuh, and T. Hwa, “Rapid assessment of extremal statistics for gapped local alignment,” In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, edited by T. Lengauer, R. Schneider, P. Bork, D. Brutlag, J. Glasgow, H. W. Mewes, and R. Zimmer, AAAI Press: Menlo Park, CA, pp. 211–222, 1999.
Google Scholar
D. Sankoff, “Matching sequences under deletion/insertion constraints,” Proc. Nat. Acad. Sci. vol. 69 pp. 4–6, 1972.
Google Scholar
D. Sankoff and R. J. Cedergren, “A test for nucleotide sequence homology,” J. Mol. Biol. vol. 77 pp. 159–164, 1973.
Google Scholar
A. A. Schäffer, L. Aravind, T. L. Madden, S. Shavirin, J. L. Spouge, Y. I. Wolf, E. V. Koonin, and S. F. Altschul, “Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements,” Nucl. Acids Res. vol. 29 pp. 2994–3005, 2001.
Google Scholar
T. F. Smith and M. S. Waterman, “Identification of common molecular subsequences,” J. Mol. Biol. vol. 147 pp. 195–197, 1981.
Google Scholar
M. S. Waterman and M. Vingron, “Sequence comparison significance and Poisson approximation,” Statistical Science vol. 9 pp. 367–381, 1994.
Google Scholar
M. S. Waterman, Introduction to Computational Biology, Chapman & Hall: London, 1995.
Google Scholar

Download references

Author information

Authors and Affiliations

Centre for Bioinformation Science, Mathematical Sciences Institute, John Curtin School of Medical Research, Australian National University, ACT 0200, Australia
Hilary S. Booth, Shevarl F. MacNamara & Susan R. Wilson
Australian Partnership for Advanced Computing, Mathematical Sciences Institute, Australian National University, ACT 0200, Australia
Ole M. Nielsen

Authors

Hilary S. Booth
View author publications
You can also search for this author in PubMed Google Scholar
Shevarl F. MacNamara
View author publications
You can also search for this author in PubMed Google Scholar
Ole M. Nielsen
View author publications
You can also search for this author in PubMed Google Scholar
Susan R. Wilson
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Booth, H.S., MacNamara, S.F., Nielsen, O.M. et al. An Iterative Approach to Determining the Length of the Longest Common Subsequence of Two Strings. Methodology and Computing in Applied Probability 6, 401–421 (2004). https://doi.org/10.1023/B:MCAP.0000045088.88240.3a

Download citation

Issue Date: December 2004
DOI: https://doi.org/10.1023/B:MCAP.0000045088.88240.3a

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Iterative Approach to Determining the Length of the Longest Common Subsequence of Two Strings

Abstract

Access this article

Similar content being viewed by others

A note on the longest matching consecutive subsequence

A Heuristic Approach for Solving the Longest Common Square Subsequence Problem

Shortest Distance Between Multiple Orbits and Generalized Fractal Dimensions

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

An Iterative Approach to Determining the Length of the Longest Common Subsequence of Two Strings

Abstract

Access this article

Similar content being viewed by others

A note on the longest matching consecutive subsequence

A Heuristic Approach for Solving the Longest Common Square Subsequence Problem

Shortest Distance Between Multiple Orbits and Generalized Fractal Dimensions

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation