Skip to main content

On the approximation of shortest common supersequences and longest common subsequences

Extended Abstract

  • Conference paper
  • First Online:
Automata, Languages and Programming (ICALP 1994)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 820))

Included in the following conference series:

Abstract

Finding shortest common supersequences (SCS) and longest common subsequences (LCS) for a given set of sequences are two well-known NP-hard problems. They have important applications in many areas including computational molecular biology (e.g., sequence alignment), data compression, planning, text editing (e.g., diff function in UNIX), etc. [1, 6, 7, 8, 10, 17, 19, 22, 23, 24, 26, 27]. The question of approximating SCS and LCS was raised 15 years ago in [19]. A lot of fruitless effort has been spent in searching for such approximation algorithms.

We will attack the question by proving: (i) SCS does not have a polynomial-time linear approximation algorithm, unless P = NP; (ii) There exists a constant δ>0 such that, if SCS has a polynomial-time approximation algorithm with ratio logδ n, where n is the number of input sequences, then NP is contained in DTIME(2polylog n); (iii) There exists a constant δ>0 such that, if LCS has a polynomial-time approximation algorithm with performance ratio Ŋ δ, then P = NP. Item (iii) is straightforward using recent breakthrough results in [3]. However, items (i) and (ii) require new ideas and techniques.

In the second part of the paper, we introduce a new powerful method for analyzing average performance of algorithms. Despite of our non-approximability results (for the worst case), we show that there is a simple greedy algorithm which produces a common supersequence (or a common subsequence) of length OPT + O(OPT 0.707) (or OPTO(OPT 1/2+ε) for any ε>0, resp.), on the average, where OPT denotes the optimal length.

Incidentally, our analysis also provides tight upper and lower bounds on the expected LCS and SCS length for n random sequences, solving a generalization of another well-known open question on the expected LCS length for two random sequences [2, 5, 22].

Supported in part by NSERC Operating Grant OGP0046613.

Supported in part by the NSERC Operating Grant OGP0046506.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. A. Aho, J. Hopcroft, and J. Ullman. Data Structures and Algorithms. Addison-Wesley, 1983.

    Google Scholar 

  2. K. Alexander. The rate of convergence of the mean length of the longest common subsequence. Manuscript, Univ. Southern Cal. 1992.

    Google Scholar 

  3. A. Arora, C. Lund, R. Motwani, M. Sudan, and M. Szegedy. Proof Verification and Hardness of Approximation Problems. Proc. 33rd IEEE Symp. Found. Comp. Sci., 1992, 14–23.

    Google Scholar 

  4. A. Blum, T. Jiang, M. Li, J. Tromp, and M. Yannakakis. Linear approximation of shortest superstrings. Proc. 23rd ACM Symp. on Theory of Computing, 1991, 328–336. Also to appear in J. ACM.

    Google Scholar 

  5. V. Chvátal and D. Sankoff. Longest common subsequences of two random sequences. J. Appl. Probab. 12(1975), 306–315.

    Google Scholar 

  6. M.O. Dayhoff. Computer analysis of protein evolution. Scientific American 221:1(1969).

    Google Scholar 

  7. D.E. Foulser. On random strings and sequence comparisons. Ph.D. Thesis, Stanford, 1986.

    Google Scholar 

  8. D.E. Foulser, M. Li, and Q. Yang. Theory and algorithms for plan merging. Artificial Intelligence, 57(1992), 143–181.

    Google Scholar 

  9. M. Garey and D. Johnson. Computers and Intractability. Freeman, New York, 1979.

    Google Scholar 

  10. C.C. Hayes. A model of planning for plan efficiency: Taking advantage of operator overlap. Proc. 11th IJCAI, Detroit, Michigan. (1989), 949–953.

    Google Scholar 

  11. D.S. Hirschberg. The longest common subsequence problem. Ph.D. Thesis, Princeton, 1975.

    Google Scholar 

  12. W.J. Hsu and M.W. Du. Computing a longest common subsequence for a set of strings. BIT 24(1984) 45–59.

    Google Scholar 

  13. R.W. Irving and C.B. Fraser. Two algorithms for the longest common subsequence of three (or more) strings. Proc. Symp. on Combinatorial Pattern Matching, Tucson, 1992.

    Google Scholar 

  14. D. Karger, R. Motwani, and G.D.S. Ramkumar. On approximating the longest path in a graph. Manuscript, Stanford, 1992.

    Google Scholar 

  15. M. Li and P.M.B. Vitányi. An Introduction to Kolmogorov Complexity and Its Applications. Springer-Verlag, New York, 1993.

    Google Scholar 

  16. M. Li and P.M.B. Vitányi. Combinatorial properties of finite sequences with high Kolmogorov complexity. To appear in Math. Syst. Theory.

    Google Scholar 

  17. S.Y. Lu and K.S. Fu. A sentence-to-sentence clustering procedure for pattern analysis. IEEE Trans. Syst., Man, Cybern. Vol. SMC-8(5), 1978, 381–389.

    Google Scholar 

  18. C. Lund and M. Yannakakis. On the hardness of approximating minimization problems. Proc. ACM STOC'93.

    Google Scholar 

  19. D. Maier. The complexity of some problems on subsequences and supersequences. J. ACM, 25(1978), 322–336.

    Google Scholar 

  20. C.H. Papadimitriou and M. Yannakakis. Optimization, Approximation, and Complexity Classes. J. Comput. Syst. Sci. 43(1991), 425–440.

    Google Scholar 

  21. K. Raiha and E. Ukkonen. The shortest common supersequence problem over binary alphabet is NP-complete. Theoretical Computer Science 16, 187–198, 1981.

    Google Scholar 

  22. D. Sankoff and J. Kruskall (Eds.) Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA., 1983.

    Google Scholar 

  23. T. Sellis. Multiple query optimization. ACM Trans. Database Systems, 13:1(1988), 23–52

    Google Scholar 

  24. T.F. Smith and M.S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147(1981), 195–197.

    Google Scholar 

  25. J.M. Steele. An Efron-Stein inequality for nonsymmetric statistics. Ann. Stat. 14(1986) 753–758.

    Google Scholar 

  26. J. Storer. Data compression: methods and theory. Computer Science Press, 1988.

    Google Scholar 

  27. V.G. Timkovskii. Complexity of common subsequence and supersequence problems and related problems. English Translation from Kibernetika, 5(1989), 1–13.

    Google Scholar 

  28. R.A. Wagner and M.J. Fischer. The string-to-string correction problem. J. ACM, 21:1(1974).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Serge Abiteboul Eli Shamir

Rights and permissions

Reprints and permissions

Copyright information

© 1994 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Jiang, T., Li, M. (1994). On the approximation of shortest common supersequences and longest common subsequences. In: Abiteboul, S., Shamir, E. (eds) Automata, Languages and Programming. ICALP 1994. Lecture Notes in Computer Science, vol 820. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-58201-0_68

Download citation

  • DOI: https://doi.org/10.1007/3-540-58201-0_68

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-58201-4

  • Online ISBN: 978-3-540-48566-7

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics