Comparing Sequences with Segment Rearrangements

Ergun, Funda; Muthukrishnan, S.; Sahinalp, S. Cenk

doi:10.1007/978-3-540-24597-1_16

Funda Ergun⁶,
S. Muthukrishnan⁷ &
S. Cenk Sahinalp⁸

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2914))

Included in the following conference series:

International Conference on Foundations of Software Technology and Theoretical Computer Science

414 Accesses
16 Citations

Abstract

Computational genomics involves comparing sequences based on “similarity” for detecting evolutionary and functional relationships. Until very recently, available portions of the human genome sequence (and that of other species) were fairly short and sparse. Most sequencing effort was focused on genes and other short units; similarity between such sequences was measured based on character level differences. However with the advent of whole genome sequencing technology there is emerging consensus that the measure of similarity between long genome sequences must capture the rearrangements of large segments found in abundance in the human genome.

In this paper, we abstract the general problem of computing sequence similarity in the presence of segment rearrangements. This problem is closely related to computing the smallest grammar for a string or the block edit distance between two strings. Our problem, like these other problems, is NP hard. Our main result here is a simple O(1) factor approximation algorithm for this problem. In contrast, best known approximations for the related problems are factor Ω(log n) off from the optimal. Our algorithm works in linear time, and in one pass. In proving our result, we relate sequence similarity measures based on different segment rearrangements, to each other, tight up to constant factors.

Funda Ergun’s research supported in part by NSF grant CCR 0311548. Muthukrishnan’s research supported in part by NSF EIA 0087022, NSF ITR 0220280 and NSF EIA 02-05116. Sahinalp’s research supported in part by NSF Career Award, Charles B. Wang foundation and a DIMACS visit.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Arslan, A.N., Egecioglu, O., Pevzner, P.A.: A new approach to sequence comparison: normalized sequence alignment. In: Proceedings of RECOMB 2001 (2001)
Google Scholar
Bafna, V., Pevzner, P.: Genome Rearrangements and Sorting by Reversals. In: Proc. IEEE FOCS, pp. 148–157(1993)
Google Scholar
Bafna, V., Pevzner, P.: Sorting Permutations by Transpositions. In: Proc. ACM-SIAM SODA, pp. 614–623 (1995)
Google Scholar
Benedetto, D., Caglioti, E., Lorento, V.: Language Trees and Zipping. Physical Review Letters 88(4) (January 2002)
Google Scholar
Ball, P.: Algorithm makes tongue tree, Nature, Science update, January 22 (2002)
Google Scholar
Borodin, A., Ostrovsky, R., Rabani, Y.: Lower Bounds for High Dimensional Nearest Neighbor Search and Related Problems. In: Proc. of ACM STOC (1999)
Google Scholar
Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Rasala, A., Sahai, A., Shelat, A.: Approximating the smallest grammar: Kolmogorov complexity in natural models. In: Proc. ACM STOC 2002, pp. 792–801 (2002)
Google Scholar
Caprara, A.: Formulations and complexity of multiple sorting by reversals. In: Proc. ACM RECOMB (1999)
Google Scholar
Christie, D.: A 3/2 approximation algorithm for sorting by reversals. In: Proc. ACMSIAM SODA (1998)
Google Scholar
Cormode, G., Paterson, M., Sahinalp, S.C., Vishkin, U.: Communication Complexity of Document Exchange. In: Proc. ACM-SIAM Symp. on Discrete Algorithms (2000)
Google Scholar
Cormode, G., Muthukrishnan, S.: The string edit distance matching problem with moves. In: Proc. ACM-SIAM Symp. on Discrete Algorithms (2001)
Google Scholar
Durand, Nadeau, Salzberg, Sankof (eds.): DIMACS Workshop on Whole Genome Comparison (2001)
Google Scholar
Ergun, F., Muthukrishnan, S., Sahinalp, S.C.: Comparing sequences with segment rearrangements, http://cs.rutgers.edu/muthu/resrch_chrono.html
Hirschberg, D.: A Linear Space Algorithm for Computing Maximal Common Subsequences. CACM 18(6), 341–343 (1975)
MATH MathSciNet Google Scholar
Hannenhalli, S., Pevzner, P.: Transforming men into mice (polynomial algorithm for genomic distance problem). In: Proc. IEEE FOCS, pp. 581–592 (1995)
Google Scholar
Hannenhalli, S., Pevzner, P.: Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals. In: Proc. ACM STOC, pp. 178–189 (1995)
Google Scholar
Andoni, A., Deza, M., Gupta, A., Indyk, P., Raskhodnikova, S.: Lower Bounds for Embedding of Edit Distance into Normed Spaces. In: To appear in 14th Symposium on Discrete Algorithms (SODA) (2003)
Google Scholar
Ji, Y., Eichler, E.E., Schwartz, S., Nicholls, R.D.: Structure of Chromosomal Duplications and their Role in Mediating Human Genomic Disorders. Genome Research 10 (2000)
Google Scholar
Kaplan, H., Shamir, R., Tarjan, R.: A faster and simpler algorithm for sorting signed permutations by reversals. SIAM Journal on Computing (2000)
Google Scholar
Karp, R., Rabin, M.: Efficient randomized pattern-matching algorithms. IBM J. of Res. and Dev. 31, 249–260 (1987)
Article MATH MathSciNet Google Scholar
Kushilevitz, E., Ostrovsky, R., Rabani, Y.: Efficient search for approximate nearest neighbor in high dimensional spaces. In: Proc. ACM STOC, pp. 614–623 (1998)
Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions. Insertions and reversals. Cybernetics and Control Theory 10(8), 707–710 (1966)
MathSciNet Google Scholar
Lander, E.S., et al.: Initial sequencing and analysis of the human genome. Nature 15, 409 (2001)
Google Scholar
Lehman, E., Shelat, A.: Approximation algorithms for grammar-based compression. In: Proc. ACM-SIAM SODA 2002, pp. 205–212 (2002)
Google Scholar
Li, M., Badger, J.H., Xin, C., Kwong, S., Kearney, P., Zhang, H.: An information based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17 (2001)
Google Scholar
Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.: The Similarity Metric. In: Proceedings of ACM-SIAM SODA, Baltimore MD (2003)
Google Scholar
Lopresti, D., Tomkins, A.: Block edit models for approximate string matching. Theoretical Computer Science (1996)
Google Scholar
Muthukrishnan, S.: Data streams: Algorithms and applications (2003), http://athos.rutgers.edu/muthu/stream-1-1.ps
Muthukrishnan, S., Sahinalp, S.C.: Approximate nearest neighbors and sequence comparison with block operations. In: Proc. ACM STOC (2000)
Google Scholar
Muthukrishnan, S., Sahinalp, S.C.: Improved algorithm for sequence comparison with block reversals. In: Proc. LATIN (2002)
Google Scholar
Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)
Article Google Scholar
Rodeh, M., Pratt, V., Even, S.: Linear Algorithm for Data Compression via String Matching. JACM 28(1), 16–24 (1981)
Article MATH MathSciNet Google Scholar
Sellers, P.: The Theory and Computation of Evolutionary Distances: Pattern Recognition. Journal of Algorithms 1, 359–373 (1980)
Article MATH MathSciNet Google Scholar
Shapira, D., Storer, J.: In-place Differential File Compression. In: Proc. DCC, pp. 263–272 (2003)
Google Scholar
Storer, J.A.: Data compression: methods and theory. Computer Science Press, Rockville (1988)
Google Scholar
Tichy, W.F.: The string-to-string correction problem with block moves. ACM Trans. on Computer Systems 2(4), 309–321 (1984)
Article MathSciNet Google Scholar
Venter, C., et al.: The sequence of the human genome. Science 16, 291 (2001)
Google Scholar
Varre, J.S., Delahaye, J.P., Rivals, E.: The Transformation Distance: A Dissimilarity Measure Based on Movements of Segments. Bioinformatics 15(3), 194–202 (1999)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of EECS, CWRU,
Funda Ergun
Rutgers Univ. and AT&T Research,
S. Muthukrishnan
Depts of EECS, Genetics and Center for Comp. Genomics, CWRU,
S. Cenk Sahinalp

Authors

Funda Ergun
View author publications
You can also search for this author in PubMed Google Scholar
S. Muthukrishnan
View author publications
You can also search for this author in PubMed Google Scholar
S. Cenk Sahinalp
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Tata Institute of Fundamental Research, India
Paritosh K. Pandya
Tata Institute of Fundamental Research, School of Technology and Computer Science, Mumbai, India
Jaikumar Radhakrishnan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ergun, F., Muthukrishnan, S., Sahinalp, S.C. (2003). Comparing Sequences with Segment Rearrangements. In: Pandya, P.K., Radhakrishnan, J. (eds) FST TCS 2003: Foundations of Software Technology and Theoretical Computer Science. FSTTCS 2003. Lecture Notes in Computer Science, vol 2914. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24597-1_16

Download citation

DOI: https://doi.org/10.1007/978-3-540-24597-1_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20680-4
Online ISBN: 978-3-540-24597-1
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics