Skip to main content

A linear-time algorithm for finding approximate shortest common superstrings

Abstract

Approximate shortest common superstrings for a given setR of strings can be constructed by applying the greedy heuristics for finding a longest Hamiltonian path in the weighted graph that represents the pairwise overlaps between the strings inR. We develop an efficient implementation of this idea using a modified Aho-Corasick string-matching automaton. The resulting common superstring algorithm runs in timeO(n) or in timeO(n min(logm, log¦Σ¦)) depending on whether or not the goto transitions of the Aho-Corasick automaton can be implemented by direct indexing over the alphabet Σ. Heren is the total length of the strings inR andm is the number of such strings. The best previously known method requires timeO(n logm) orO(n logn) depending on the availability of direct indexing.

This is a preview of subscription content, access via your institution.

References

  1. A. V. Aho and M. J. Corasick: Efficient string matching: an aid to bibliographic search.Comm. ACM 18 (1975), 333–340.

    MATH  Article  MathSciNet  Google Scholar 

  2. J. K. Gallant: String Compression Algorithms. Ph.D. Thesis, Princeton University, Princeton, NJ, 1982.

    Google Scholar 

  3. J. Gallant, D. Maier, and J. A. Storer: On finding minimal length superstrings.J. Comput. System Sci. 20 (1980), 50–58.

    MATH  Article  MathSciNet  Google Scholar 

  4. M. R. Garey and D. S. Johnson:Computers and Intractability. Freeman, San Francisco, 1979.

    MATH  Google Scholar 

  5. T. R. Gingeras, J. P. Milazzo, D. Sciaky, and R. J. Roberts: Computer programs for the assembly of DNA sequences.Nucleic Acids Res. 7 (1979), 529–545.

    Article  Google Scholar 

  6. D. Knuth, J. Morris, and V. Pratt: Fast pattern matching in strings.SIAM J. Comput. 6 (1977), 323–350.

    MATH  Article  MathSciNet  Google Scholar 

  7. H. Peltola, J. Söderlund, J. Tarhio, and E. Ukkonen: Algorithms for some string-matching problems arising in molecular genetics, inInformation Processing (R. E. A. Mason, ed.). Elsevier Science, Amsterdam, 1983, pp. 59–64.

    Google Scholar 

  8. H. Peltola, H. Söderlund, and E. Ukkonen: SEQAID: a DNA sequence assembling program based on a mathematical model.Nucleic Acids Res. 12 (1984), 307–321.

    Article  Google Scholar 

  9. R. Staden: Automation of the computer handling of gel reading data produced by the shotgun method of DNA sequencing.Nucleic Acids Res. 10 (1982), 4731–4751.

    Article  Google Scholar 

  10. J. Tarhio and E. Ukkonen: A greedy algorithm for constructing shortest common superstrings, inMathematical Foundations of Computer Science. Lecture Notes in Computer Science, Vol. 233. Springer-Verlag, Berlin, 1986, pp. 602–610.

    Google Scholar 

  11. J. Tarhio and E. Ukkonen: A greedy approximation algorithm for constructing shortest common superstrings.Theoret. Comput. Sci. 57 (1988), 131–145.

    MATH  Article  MathSciNet  Google Scholar 

  12. J. S. Turner: Approximation Algorithms for the Shortest Common Superstring Problem. Technical Report WUCS-86-16, Department of Computer Science, Washington University, Saint Louis, MO, 1986.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Additional information

This work was supported by the Academy of Finland.

Communicated by Robert Sedgewick.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Ukkonen, E. A linear-time algorithm for finding approximate shortest common superstrings. Algorithmica 5, 313–323 (1990). https://doi.org/10.1007/BF01840391

Download citation

  • Received:

  • Revised:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF01840391

Key words

  • Shortest common superstring
  • Approximation algorithm
  • Linear-time algorithm
  • Greedy heuristics
  • Hamiltonian path