Skip to main content
Log in

A linear-time algorithm for finding approximate shortest common superstrings

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

Approximate shortest common superstrings for a given setR of strings can be constructed by applying the greedy heuristics for finding a longest Hamiltonian path in the weighted graph that represents the pairwise overlaps between the strings inR. We develop an efficient implementation of this idea using a modified Aho-Corasick string-matching automaton. The resulting common superstring algorithm runs in timeO(n) or in timeO(n min(logm, log¦Σ¦)) depending on whether or not the goto transitions of the Aho-Corasick automaton can be implemented by direct indexing over the alphabet Σ. Heren is the total length of the strings inR andm is the number of such strings. The best previously known method requires timeO(n logm) orO(n logn) depending on the availability of direct indexing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. A. V. Aho and M. J. Corasick: Efficient string matching: an aid to bibliographic search.Comm. ACM 18 (1975), 333–340.

    Article  MATH  MathSciNet  Google Scholar 

  2. J. K. Gallant: String Compression Algorithms. Ph.D. Thesis, Princeton University, Princeton, NJ, 1982.

    Google Scholar 

  3. J. Gallant, D. Maier, and J. A. Storer: On finding minimal length superstrings.J. Comput. System Sci. 20 (1980), 50–58.

    Article  MATH  MathSciNet  Google Scholar 

  4. M. R. Garey and D. S. Johnson:Computers and Intractability. Freeman, San Francisco, 1979.

    MATH  Google Scholar 

  5. T. R. Gingeras, J. P. Milazzo, D. Sciaky, and R. J. Roberts: Computer programs for the assembly of DNA sequences.Nucleic Acids Res. 7 (1979), 529–545.

    Article  Google Scholar 

  6. D. Knuth, J. Morris, and V. Pratt: Fast pattern matching in strings.SIAM J. Comput. 6 (1977), 323–350.

    Article  MATH  MathSciNet  Google Scholar 

  7. H. Peltola, J. Söderlund, J. Tarhio, and E. Ukkonen: Algorithms for some string-matching problems arising in molecular genetics, inInformation Processing (R. E. A. Mason, ed.). Elsevier Science, Amsterdam, 1983, pp. 59–64.

    Google Scholar 

  8. H. Peltola, H. Söderlund, and E. Ukkonen: SEQAID: a DNA sequence assembling program based on a mathematical model.Nucleic Acids Res. 12 (1984), 307–321.

    Article  Google Scholar 

  9. R. Staden: Automation of the computer handling of gel reading data produced by the shotgun method of DNA sequencing.Nucleic Acids Res. 10 (1982), 4731–4751.

    Article  Google Scholar 

  10. J. Tarhio and E. Ukkonen: A greedy algorithm for constructing shortest common superstrings, inMathematical Foundations of Computer Science. Lecture Notes in Computer Science, Vol. 233. Springer-Verlag, Berlin, 1986, pp. 602–610.

    Google Scholar 

  11. J. Tarhio and E. Ukkonen: A greedy approximation algorithm for constructing shortest common superstrings.Theoret. Comput. Sci. 57 (1988), 131–145.

    Article  MATH  MathSciNet  Google Scholar 

  12. J. S. Turner: Approximation Algorithms for the Shortest Common Superstring Problem. Technical Report WUCS-86-16, Department of Computer Science, Washington University, Saint Louis, MO, 1986.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Additional information

Communicated by Robert Sedgewick.

This work was supported by the Academy of Finland.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ukkonen, E. A linear-time algorithm for finding approximate shortest common superstrings. Algorithmica 5, 313–323 (1990). https://doi.org/10.1007/BF01840391

Download citation

  • Received:

  • Revised:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF01840391

Key words

Navigation