Abstract
Computational genomics involves comparing sequences based on “similarity” for detecting evolutionary and functional relationships. Until very recently, available portions of the human genome sequence (and that of other species) were fairly short and sparse. Most sequencing effort was focused on genes and other short units; similarity between such sequences was measured based on character level differences. However with the advent of whole genome sequencing technology there is emerging consensus that the measure of similarity between long genome sequences must capture the rearrangements of large segments found in abundance in the human genome.
In this paper, we abstract the general problem of computing sequence similarity in the presence of segment rearrangements. This problem is closely related to computing the smallest grammar for a string or the block edit distance between two strings. Our problem, like these other problems, is NP hard. Our main result here is a simple O(1) factor approximation algorithm for this problem. In contrast, best known approximations for the related problems are factor Ω(log n) off from the optimal. Our algorithm works in linear time, and in one pass. In proving our result, we relate sequence similarity measures based on different segment rearrangements, to each other, tight up to constant factors.
Funda Ergun’s research supported in part by NSF grant CCR 0311548. Muthukrishnan’s research supported in part by NSF EIA 0087022, NSF ITR 0220280 and NSF EIA 02-05116. Sahinalp’s research supported in part by NSF Career Award, Charles B. Wang foundation and a DIMACS visit.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Arslan, A.N., Egecioglu, O., Pevzner, P.A.: A new approach to sequence comparison: normalized sequence alignment. In: Proceedings of RECOMB 2001 (2001)
Bafna, V., Pevzner, P.: Genome Rearrangements and Sorting by Reversals. In: Proc. IEEE FOCS, pp. 148–157(1993)
Bafna, V., Pevzner, P.: Sorting Permutations by Transpositions. In: Proc. ACM-SIAM SODA, pp. 614–623 (1995)
Benedetto, D., Caglioti, E., Lorento, V.: Language Trees and Zipping. Physical Review Letters 88(4) (January 2002)
Ball, P.: Algorithm makes tongue tree, Nature, Science update, January 22 (2002)
Borodin, A., Ostrovsky, R., Rabani, Y.: Lower Bounds for High Dimensional Nearest Neighbor Search and Related Problems. In: Proc. of ACM STOC (1999)
Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Rasala, A., Sahai, A., Shelat, A.: Approximating the smallest grammar: Kolmogorov complexity in natural models. In: Proc. ACM STOC 2002, pp. 792–801 (2002)
Caprara, A.: Formulations and complexity of multiple sorting by reversals. In: Proc. ACM RECOMB (1999)
Christie, D.: A 3/2 approximation algorithm for sorting by reversals. In: Proc. ACMSIAM SODA (1998)
Cormode, G., Paterson, M., Sahinalp, S.C., Vishkin, U.: Communication Complexity of Document Exchange. In: Proc. ACM-SIAM Symp. on Discrete Algorithms (2000)
Cormode, G., Muthukrishnan, S.: The string edit distance matching problem with moves. In: Proc. ACM-SIAM Symp. on Discrete Algorithms (2001)
Durand, Nadeau, Salzberg, Sankof (eds.): DIMACS Workshop on Whole Genome Comparison (2001)
Ergun, F., Muthukrishnan, S., Sahinalp, S.C.: Comparing sequences with segment rearrangements, http://cs.rutgers.edu/muthu/resrch_chrono.html
Hirschberg, D.: A Linear Space Algorithm for Computing Maximal Common Subsequences. CACM 18(6), 341–343 (1975)
Hannenhalli, S., Pevzner, P.: Transforming men into mice (polynomial algorithm for genomic distance problem). In: Proc. IEEE FOCS, pp. 581–592 (1995)
Hannenhalli, S., Pevzner, P.: Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals. In: Proc. ACM STOC, pp. 178–189 (1995)
Andoni, A., Deza, M., Gupta, A., Indyk, P., Raskhodnikova, S.: Lower Bounds for Embedding of Edit Distance into Normed Spaces. In: To appear in 14th Symposium on Discrete Algorithms (SODA) (2003)
Ji, Y., Eichler, E.E., Schwartz, S., Nicholls, R.D.: Structure of Chromosomal Duplications and their Role in Mediating Human Genomic Disorders. Genome Research 10 (2000)
Kaplan, H., Shamir, R., Tarjan, R.: A faster and simpler algorithm for sorting signed permutations by reversals. SIAM Journal on Computing (2000)
Karp, R., Rabin, M.: Efficient randomized pattern-matching algorithms. IBM J. of Res. and Dev. 31, 249–260 (1987)
Kushilevitz, E., Ostrovsky, R., Rabani, Y.: Efficient search for approximate nearest neighbor in high dimensional spaces. In: Proc. ACM STOC, pp. 614–623 (1998)
Levenshtein, V.I.: Binary codes capable of correcting deletions. Insertions and reversals. Cybernetics and Control Theory 10(8), 707–710 (1966)
Lander, E.S., et al.: Initial sequencing and analysis of the human genome. Nature 15, 409 (2001)
Lehman, E., Shelat, A.: Approximation algorithms for grammar-based compression. In: Proc. ACM-SIAM SODA 2002, pp. 205–212 (2002)
Li, M., Badger, J.H., Xin, C., Kwong, S., Kearney, P., Zhang, H.: An information based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17 (2001)
Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.: The Similarity Metric. In: Proceedings of ACM-SIAM SODA, Baltimore MD (2003)
Lopresti, D., Tomkins, A.: Block edit models for approximate string matching. Theoretical Computer Science (1996)
Muthukrishnan, S.: Data streams: Algorithms and applications (2003), http://athos.rutgers.edu/muthu/stream-1-1.ps
Muthukrishnan, S., Sahinalp, S.C.: Approximate nearest neighbors and sequence comparison with block operations. In: Proc. ACM STOC (2000)
Muthukrishnan, S., Sahinalp, S.C.: Improved algorithm for sequence comparison with block reversals. In: Proc. LATIN (2002)
Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)
Rodeh, M., Pratt, V., Even, S.: Linear Algorithm for Data Compression via String Matching. JACM 28(1), 16–24 (1981)
Sellers, P.: The Theory and Computation of Evolutionary Distances: Pattern Recognition. Journal of Algorithms 1, 359–373 (1980)
Shapira, D., Storer, J.: In-place Differential File Compression. In: Proc. DCC, pp. 263–272 (2003)
Storer, J.A.: Data compression: methods and theory. Computer Science Press, Rockville (1988)
Tichy, W.F.: The string-to-string correction problem with block moves. ACM Trans. on Computer Systems 2(4), 309–321 (1984)
Venter, C., et al.: The sequence of the human genome. Science 16, 291 (2001)
Varre, J.S., Delahaye, J.P., Rivals, E.: The Transformation Distance: A Dissimilarity Measure Based on Movements of Segments. Bioinformatics 15(3), 194–202 (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ergun, F., Muthukrishnan, S., Sahinalp, S.C. (2003). Comparing Sequences with Segment Rearrangements. In: Pandya, P.K., Radhakrishnan, J. (eds) FST TCS 2003: Foundations of Software Technology and Theoretical Computer Science. FSTTCS 2003. Lecture Notes in Computer Science, vol 2914. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24597-1_16
Download citation
DOI: https://doi.org/10.1007/978-3-540-24597-1_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20680-4
Online ISBN: 978-3-540-24597-1
eBook Packages: Springer Book Archive