Skip to main content

Abstract

Computational genomics involves comparing sequences based on “similarity” for detecting evolutionary and functional relationships. Until very recently, available portions of the human genome sequence (and that of other species) were fairly short and sparse. Most sequencing effort was focused on genes and other short units; similarity between such sequences was measured based on character level differences. However with the advent of whole genome sequencing technology there is emerging consensus that the measure of similarity between long genome sequences must capture the rearrangements of large segments found in abundance in the human genome.

In this paper, we abstract the general problem of computing sequence similarity in the presence of segment rearrangements. This problem is closely related to computing the smallest grammar for a string or the block edit distance between two strings. Our problem, like these other problems, is NP hard. Our main result here is a simple O(1) factor approximation algorithm for this problem. In contrast, best known approximations for the related problems are factor Ω(log n) off from the optimal. Our algorithm works in linear time, and in one pass. In proving our result, we relate sequence similarity measures based on different segment rearrangements, to each other, tight up to constant factors.

Funda Ergun’s research supported in part by NSF grant CCR 0311548. Muthukrishnan’s research supported in part by NSF EIA 0087022, NSF ITR 0220280 and NSF EIA 02-05116. Sahinalp’s research supported in part by NSF Career Award, Charles B. Wang foundation and a DIMACS visit.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Arslan, A.N., Egecioglu, O., Pevzner, P.A.: A new approach to sequence comparison: normalized sequence alignment. In: Proceedings of RECOMB 2001 (2001)

    Google Scholar 

  2. Bafna, V., Pevzner, P.: Genome Rearrangements and Sorting by Reversals. In: Proc. IEEE FOCS, pp. 148–157(1993)

    Google Scholar 

  3. Bafna, V., Pevzner, P.: Sorting Permutations by Transpositions. In: Proc. ACM-SIAM SODA, pp. 614–623 (1995)

    Google Scholar 

  4. Benedetto, D., Caglioti, E., Lorento, V.: Language Trees and Zipping. Physical Review Letters 88(4) (January 2002)

    Google Scholar 

  5. Ball, P.: Algorithm makes tongue tree, Nature, Science update, January 22 (2002)

    Google Scholar 

  6. Borodin, A., Ostrovsky, R., Rabani, Y.: Lower Bounds for High Dimensional Nearest Neighbor Search and Related Problems. In: Proc. of ACM STOC (1999)

    Google Scholar 

  7. Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Rasala, A., Sahai, A., Shelat, A.: Approximating the smallest grammar: Kolmogorov complexity in natural models. In: Proc. ACM STOC 2002, pp. 792–801 (2002)

    Google Scholar 

  8. Caprara, A.: Formulations and complexity of multiple sorting by reversals. In: Proc. ACM RECOMB (1999)

    Google Scholar 

  9. Christie, D.: A 3/2 approximation algorithm for sorting by reversals. In: Proc. ACMSIAM SODA (1998)

    Google Scholar 

  10. Cormode, G., Paterson, M., Sahinalp, S.C., Vishkin, U.: Communication Complexity of Document Exchange. In: Proc. ACM-SIAM Symp. on Discrete Algorithms (2000)

    Google Scholar 

  11. Cormode, G., Muthukrishnan, S.: The string edit distance matching problem with moves. In: Proc. ACM-SIAM Symp. on Discrete Algorithms (2001)

    Google Scholar 

  12. Durand, Nadeau, Salzberg, Sankof (eds.): DIMACS Workshop on Whole Genome Comparison (2001)

    Google Scholar 

  13. Ergun, F., Muthukrishnan, S., Sahinalp, S.C.: Comparing sequences with segment rearrangements, http://cs.rutgers.edu/muthu/resrch_chrono.html

  14. Hirschberg, D.: A Linear Space Algorithm for Computing Maximal Common Subsequences. CACM 18(6), 341–343 (1975)

    MATH  MathSciNet  Google Scholar 

  15. Hannenhalli, S., Pevzner, P.: Transforming men into mice (polynomial algorithm for genomic distance problem). In: Proc. IEEE FOCS, pp. 581–592 (1995)

    Google Scholar 

  16. Hannenhalli, S., Pevzner, P.: Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals. In: Proc. ACM STOC, pp. 178–189 (1995)

    Google Scholar 

  17. Andoni, A., Deza, M., Gupta, A., Indyk, P., Raskhodnikova, S.: Lower Bounds for Embedding of Edit Distance into Normed Spaces. In: To appear in 14th Symposium on Discrete Algorithms (SODA) (2003)

    Google Scholar 

  18. Ji, Y., Eichler, E.E., Schwartz, S., Nicholls, R.D.: Structure of Chromosomal Duplications and their Role in Mediating Human Genomic Disorders. Genome Research 10 (2000)

    Google Scholar 

  19. Kaplan, H., Shamir, R., Tarjan, R.: A faster and simpler algorithm for sorting signed permutations by reversals. SIAM Journal on Computing (2000)

    Google Scholar 

  20. Karp, R., Rabin, M.: Efficient randomized pattern-matching algorithms. IBM J. of Res. and Dev. 31, 249–260 (1987)

    Article  MATH  MathSciNet  Google Scholar 

  21. Kushilevitz, E., Ostrovsky, R., Rabani, Y.: Efficient search for approximate nearest neighbor in high dimensional spaces. In: Proc. ACM STOC, pp. 614–623 (1998)

    Google Scholar 

  22. Levenshtein, V.I.: Binary codes capable of correcting deletions. Insertions and reversals. Cybernetics and Control Theory 10(8), 707–710 (1966)

    MathSciNet  Google Scholar 

  23. Lander, E.S., et al.: Initial sequencing and analysis of the human genome. Nature 15, 409 (2001)

    Google Scholar 

  24. Lehman, E., Shelat, A.: Approximation algorithms for grammar-based compression. In: Proc. ACM-SIAM SODA 2002, pp. 205–212 (2002)

    Google Scholar 

  25. Li, M., Badger, J.H., Xin, C., Kwong, S., Kearney, P., Zhang, H.: An information based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17 (2001)

    Google Scholar 

  26. Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.: The Similarity Metric. In: Proceedings of ACM-SIAM SODA, Baltimore MD (2003)

    Google Scholar 

  27. Lopresti, D., Tomkins, A.: Block edit models for approximate string matching. Theoretical Computer Science (1996)

    Google Scholar 

  28. Muthukrishnan, S.: Data streams: Algorithms and applications (2003), http://athos.rutgers.edu/muthu/stream-1-1.ps

  29. Muthukrishnan, S., Sahinalp, S.C.: Approximate nearest neighbors and sequence comparison with block operations. In: Proc. ACM STOC (2000)

    Google Scholar 

  30. Muthukrishnan, S., Sahinalp, S.C.: Improved algorithm for sequence comparison with block reversals. In: Proc. LATIN (2002)

    Google Scholar 

  31. Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)

    Article  Google Scholar 

  32. Rodeh, M., Pratt, V., Even, S.: Linear Algorithm for Data Compression via String Matching. JACM 28(1), 16–24 (1981)

    Article  MATH  MathSciNet  Google Scholar 

  33. Sellers, P.: The Theory and Computation of Evolutionary Distances: Pattern Recognition. Journal of Algorithms 1, 359–373 (1980)

    Article  MATH  MathSciNet  Google Scholar 

  34. Shapira, D., Storer, J.: In-place Differential File Compression. In: Proc. DCC, pp. 263–272 (2003)

    Google Scholar 

  35. Storer, J.A.: Data compression: methods and theory. Computer Science Press, Rockville (1988)

    Google Scholar 

  36. Tichy, W.F.: The string-to-string correction problem with block moves. ACM Trans. on Computer Systems 2(4), 309–321 (1984)

    Article  MathSciNet  Google Scholar 

  37. Venter, C., et al.: The sequence of the human genome. Science 16, 291 (2001)

    Google Scholar 

  38. Varre, J.S., Delahaye, J.P., Rivals, E.: The Transformation Distance: A Dissimilarity Measure Based on Movements of Segments. Bioinformatics 15(3), 194–202 (1999)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ergun, F., Muthukrishnan, S., Sahinalp, S.C. (2003). Comparing Sequences with Segment Rearrangements. In: Pandya, P.K., Radhakrishnan, J. (eds) FST TCS 2003: Foundations of Software Technology and Theoretical Computer Science. FSTTCS 2003. Lecture Notes in Computer Science, vol 2914. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24597-1_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-24597-1_16

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-20680-4

  • Online ISBN: 978-3-540-24597-1

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics