On the approximation of shortest common supersequences and longest common subsequences

Jiang, Tao; Li, Ming

doi:10.1007/3-540-58201-0_68

Tao Jiang¹ &
Ming Li²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 820))

Included in the following conference series:

International Colloquium on Automata, Languages, and Programming

251 Accesses
9 Citations

Abstract

Finding shortest common supersequences (SCS) and longest common subsequences (LCS) for a given set of sequences are two well-known NP-hard problems. They have important applications in many areas including computational molecular biology (e.g., sequence alignment), data compression, planning, text editing (e.g., diff function in UNIX), etc. [1, 6, 7, 8, 10, 17, 19, 22, 23, 24, 26, 27]. The question of approximating SCS and LCS was raised 15 years ago in [19]. A lot of fruitless effort has been spent in searching for such approximation algorithms.

We will attack the question by proving: (i) SCS does not have a polynomial-time linear approximation algorithm, unless P = NP; (ii) There exists a constant δ>0 such that, if SCS has a polynomial-time approximation algorithm with ratio log^δ n, where n is the number of input sequences, then NP is contained in DTIME(2^{polylog n}); (iii) There exists a constant δ>0 such that, if LCS has a polynomial-time approximation algorithm with performance ratio Ŋ ^δ, then P = NP. Item (iii) is straightforward using recent breakthrough results in [3]. However, items (i) and (ii) require new ideas and techniques.

In the second part of the paper, we introduce a new powerful method for analyzing average performance of algorithms. Despite of our non-approximability results (for the worst case), we show that there is a simple greedy algorithm which produces a common supersequence (or a common subsequence) of length OPT + O(OPT ^0.707) (or OPT − O(OPT ^1/2+ε) for any ε>0, resp.), on the average, where OPT denotes the optimal length.

Incidentally, our analysis also provides tight upper and lower bounds on the expected LCS and SCS length for n random sequences, solving a generalization of another well-known open question on the expected LCS length for two random sequences [2, 5, 22].

Supported in part by NSERC Operating Grant OGP0046613.

Supported in part by the NSERC Operating Grant OGP0046506.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

A. Aho, J. Hopcroft, and J. Ullman. Data Structures and Algorithms. Addison-Wesley, 1983.
Google Scholar
K. Alexander. The rate of convergence of the mean length of the longest common subsequence. Manuscript, Univ. Southern Cal. 1992.
Google Scholar
A. Arora, C. Lund, R. Motwani, M. Sudan, and M. Szegedy. Proof Verification and Hardness of Approximation Problems. Proc. 33rd IEEE Symp. Found. Comp. Sci., 1992, 14–23.
Google Scholar
A. Blum, T. Jiang, M. Li, J. Tromp, and M. Yannakakis. Linear approximation of shortest superstrings. Proc. 23rd ACM Symp. on Theory of Computing, 1991, 328–336. Also to appear in J. ACM.
Google Scholar
V. Chvátal and D. Sankoff. Longest common subsequences of two random sequences. J. Appl. Probab. 12(1975), 306–315.
Google Scholar
M.O. Dayhoff. Computer analysis of protein evolution. Scientific American 221:1(1969).
Google Scholar
D.E. Foulser. On random strings and sequence comparisons. Ph.D. Thesis, Stanford, 1986.
Google Scholar
D.E. Foulser, M. Li, and Q. Yang. Theory and algorithms for plan merging. Artificial Intelligence, 57(1992), 143–181.
Google Scholar
M. Garey and D. Johnson. Computers and Intractability. Freeman, New York, 1979.
Google Scholar
C.C. Hayes. A model of planning for plan efficiency: Taking advantage of operator overlap. Proc. 11th IJCAI, Detroit, Michigan. (1989), 949–953.
Google Scholar
D.S. Hirschberg. The longest common subsequence problem. Ph.D. Thesis, Princeton, 1975.
Google Scholar
W.J. Hsu and M.W. Du. Computing a longest common subsequence for a set of strings. BIT 24(1984) 45–59.
Google Scholar
R.W. Irving and C.B. Fraser. Two algorithms for the longest common subsequence of three (or more) strings. Proc. Symp. on Combinatorial Pattern Matching, Tucson, 1992.
Google Scholar
D. Karger, R. Motwani, and G.D.S. Ramkumar. On approximating the longest path in a graph. Manuscript, Stanford, 1992.
Google Scholar
M. Li and P.M.B. Vitányi. An Introduction to Kolmogorov Complexity and Its Applications. Springer-Verlag, New York, 1993.
Google Scholar
M. Li and P.M.B. Vitányi. Combinatorial properties of finite sequences with high Kolmogorov complexity. To appear in Math. Syst. Theory.
Google Scholar
S.Y. Lu and K.S. Fu. A sentence-to-sentence clustering procedure for pattern analysis. IEEE Trans. Syst., Man, Cybern. Vol. SMC-8(5), 1978, 381–389.
Google Scholar
C. Lund and M. Yannakakis. On the hardness of approximating minimization problems. Proc. ACM STOC'93.
Google Scholar
D. Maier. The complexity of some problems on subsequences and supersequences. J. ACM, 25(1978), 322–336.
Google Scholar
C.H. Papadimitriou and M. Yannakakis. Optimization, Approximation, and Complexity Classes. J. Comput. Syst. Sci. 43(1991), 425–440.
Google Scholar
K. Raiha and E. Ukkonen. The shortest common supersequence problem over binary alphabet is NP-complete. Theoretical Computer Science 16, 187–198, 1981.
Google Scholar
D. Sankoff and J. Kruskall (Eds.) Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA., 1983.
Google Scholar
T. Sellis. Multiple query optimization. ACM Trans. Database Systems, 13:1(1988), 23–52
Google Scholar
T.F. Smith and M.S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147(1981), 195–197.
Google Scholar
J.M. Steele. An Efron-Stein inequality for nonsymmetric statistics. Ann. Stat. 14(1986) 753–758.
Google Scholar
J. Storer. Data compression: methods and theory. Computer Science Press, 1988.
Google Scholar
V.G. Timkovskii. Complexity of common subsequence and supersequence problems and related problems. English Translation from Kibernetika, 5(1989), 1–13.
Google Scholar
R.A. Wagner and M.J. Fischer. The string-to-string correction problem. J. ACM, 21:1(1974).
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, McMaster University, L8S 4K1, Hamilton, Ont., Canada
Tao Jiang
Department of Computer Science, University of Waterloo, N3L 3G1, Waterloo, Ont., Canada
Ming Li

Authors

Tao Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Ming Li
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Serge Abiteboul Eli Shamir

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jiang, T., Li, M. (1994). On the approximation of shortest common supersequences and longest common subsequences. In: Abiteboul, S., Shamir, E. (eds) Automata, Languages and Programming. ICALP 1994. Lecture Notes in Computer Science, vol 820. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-58201-0_68

Download citation

DOI: https://doi.org/10.1007/3-540-58201-0_68
Published: 29 May 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-58201-4
Online ISBN: 978-3-540-48566-7
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics