Abstract
Finding shortest common supersequences (SCS) and longest common subsequences (LCS) for a given set of sequences are two well-known NP-hard problems. They have important applications in many areas including computational molecular biology (e.g., sequence alignment), data compression, planning, text editing (e.g., diff function in UNIX), etc. [1, 6, 7, 8, 10, 17, 19, 22, 23, 24, 26, 27]. The question of approximating SCS and LCS was raised 15 years ago in [19]. A lot of fruitless effort has been spent in searching for such approximation algorithms.
We will attack the question by proving: (i) SCS does not have a polynomial-time linear approximation algorithm, unless P = NP; (ii) There exists a constant δ>0 such that, if SCS has a polynomial-time approximation algorithm with ratio logδ n, where n is the number of input sequences, then NP is contained in DTIME(2polylog n); (iii) There exists a constant δ>0 such that, if LCS has a polynomial-time approximation algorithm with performance ratio Ŋ δ, then P = NP. Item (iii) is straightforward using recent breakthrough results in [3]. However, items (i) and (ii) require new ideas and techniques.
In the second part of the paper, we introduce a new powerful method for analyzing average performance of algorithms. Despite of our non-approximability results (for the worst case), we show that there is a simple greedy algorithm which produces a common supersequence (or a common subsequence) of length OPT + O(OPT 0.707) (or OPT − O(OPT 1/2+ε) for any ε>0, resp.), on the average, where OPT denotes the optimal length.
Incidentally, our analysis also provides tight upper and lower bounds on the expected LCS and SCS length for n random sequences, solving a generalization of another well-known open question on the expected LCS length for two random sequences [2, 5, 22].
Supported in part by NSERC Operating Grant OGP0046613.
Supported in part by the NSERC Operating Grant OGP0046506.
Preview
Unable to display preview. Download preview PDF.
References
A. Aho, J. Hopcroft, and J. Ullman. Data Structures and Algorithms. Addison-Wesley, 1983.
K. Alexander. The rate of convergence of the mean length of the longest common subsequence. Manuscript, Univ. Southern Cal. 1992.
A. Arora, C. Lund, R. Motwani, M. Sudan, and M. Szegedy. Proof Verification and Hardness of Approximation Problems. Proc. 33rd IEEE Symp. Found. Comp. Sci., 1992, 14–23.
A. Blum, T. Jiang, M. Li, J. Tromp, and M. Yannakakis. Linear approximation of shortest superstrings. Proc. 23rd ACM Symp. on Theory of Computing, 1991, 328–336. Also to appear in J. ACM.
V. Chvátal and D. Sankoff. Longest common subsequences of two random sequences. J. Appl. Probab. 12(1975), 306–315.
M.O. Dayhoff. Computer analysis of protein evolution. Scientific American 221:1(1969).
D.E. Foulser. On random strings and sequence comparisons. Ph.D. Thesis, Stanford, 1986.
D.E. Foulser, M. Li, and Q. Yang. Theory and algorithms for plan merging. Artificial Intelligence, 57(1992), 143–181.
M. Garey and D. Johnson. Computers and Intractability. Freeman, New York, 1979.
C.C. Hayes. A model of planning for plan efficiency: Taking advantage of operator overlap. Proc. 11th IJCAI, Detroit, Michigan. (1989), 949–953.
D.S. Hirschberg. The longest common subsequence problem. Ph.D. Thesis, Princeton, 1975.
W.J. Hsu and M.W. Du. Computing a longest common subsequence for a set of strings. BIT 24(1984) 45–59.
R.W. Irving and C.B. Fraser. Two algorithms for the longest common subsequence of three (or more) strings. Proc. Symp. on Combinatorial Pattern Matching, Tucson, 1992.
D. Karger, R. Motwani, and G.D.S. Ramkumar. On approximating the longest path in a graph. Manuscript, Stanford, 1992.
M. Li and P.M.B. Vitányi. An Introduction to Kolmogorov Complexity and Its Applications. Springer-Verlag, New York, 1993.
M. Li and P.M.B. Vitányi. Combinatorial properties of finite sequences with high Kolmogorov complexity. To appear in Math. Syst. Theory.
S.Y. Lu and K.S. Fu. A sentence-to-sentence clustering procedure for pattern analysis. IEEE Trans. Syst., Man, Cybern. Vol. SMC-8(5), 1978, 381–389.
C. Lund and M. Yannakakis. On the hardness of approximating minimization problems. Proc. ACM STOC'93.
D. Maier. The complexity of some problems on subsequences and supersequences. J. ACM, 25(1978), 322–336.
C.H. Papadimitriou and M. Yannakakis. Optimization, Approximation, and Complexity Classes. J. Comput. Syst. Sci. 43(1991), 425–440.
K. Raiha and E. Ukkonen. The shortest common supersequence problem over binary alphabet is NP-complete. Theoretical Computer Science 16, 187–198, 1981.
D. Sankoff and J. Kruskall (Eds.) Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA., 1983.
T. Sellis. Multiple query optimization. ACM Trans. Database Systems, 13:1(1988), 23–52
T.F. Smith and M.S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147(1981), 195–197.
J.M. Steele. An Efron-Stein inequality for nonsymmetric statistics. Ann. Stat. 14(1986) 753–758.
J. Storer. Data compression: methods and theory. Computer Science Press, 1988.
V.G. Timkovskii. Complexity of common subsequence and supersequence problems and related problems. English Translation from Kibernetika, 5(1989), 1–13.
R.A. Wagner and M.J. Fischer. The string-to-string correction problem. J. ACM, 21:1(1974).
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1994 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Jiang, T., Li, M. (1994). On the approximation of shortest common supersequences and longest common subsequences. In: Abiteboul, S., Shamir, E. (eds) Automata, Languages and Programming. ICALP 1994. Lecture Notes in Computer Science, vol 820. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-58201-0_68
Download citation
DOI: https://doi.org/10.1007/3-540-58201-0_68
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-58201-4
Online ISBN: 978-3-540-48566-7
eBook Packages: Springer Book Archive