Statistical Identification of Uniformly Mutated Segments within Repeats

Ṣahinalp, S. Cenk; Eichler, Evan; Goldberg, Paul; Berenbrink, Petra; Friedetzky, Tom; Ergun, Funda

doi:10.1007/3-540-45452-7_21

S. Cenk Ṣahinalp⁶,
Evan Eichler⁷,
Paul Goldberg⁸,
Petra Berenbrink⁹,
Tom Friedetzky¹⁰ &
…
Funda Ergun¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2373))

Included in the following conference series:

Annual Symposium on Combinatorial Pattern Matching

377 Accesses

Abstract

Given a long string of characters from a constant size (w.l.o.g. binary) alphabet we present an algorithm to determine whether its characters have been generated by a single i.i.d. random source. More specifically, consider all possible k-coin models for generating a binary string S, where each bit of S is generated via an independent toss of one of the k coins in the model. The choice of which coin to toss is decided by a random walk on the set of coins where the probability of a coin change is much lower than the probability of using the same coin repeatedly. We present a statistical test procedure which, for any given S, determines whether the a posteriori probability for k = 1 is higher than for any other k > 1. Our algorithm runs in time O(l ⁴ log l), where l is the length of S, through a dynamic programming approach which exploits the convexity of the a posteriori probability for k.

The problem we consider arises from two critical applications in analyzing long alignments between pairs of genomic sequences. A high alignment score between two DNA sequences usually indicates an evolutionary relationship, i.e. that the sequences have been generated as a result of one or more copy events followed by random point mutations. Such sequences may include functional regions (e.g. exons) as well as nonfunctional ones (e.g. introns). Functional regions with critical importance exhibit much lower mutation rates than non-functional DNA (or DNA

Supported in part by an NSF Career Award and by Charles B. Wang Foundation.

Partially supported by the IST Programme of the EU under contract number IST-1999-14186 (ALCOM-FT).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

E. F. Adebiyi, T. Jiang, M. Kaufmann, An Efficient Algorithm for Finding Short Approximate Non-Tandem Repeats, In Proceedings of ISMB 2001.
Google Scholar
A. N. Arslan, O. Egecioglu, P. A. Pevzner A new approach to sequence comparison: normalized sequence alignment, Proceedings of RECOMB 2001.
Google Scholar
Bailey J. A., Yavor A. M., Massa H. F., Trask B. J., Eichler E. E., Segmental duplications: organization and impact within the current human genome project assembly, Genome Research 11(6), Jun 2001.
Google Scholar
T. Bailey, C. Elkan. Fitting a mixture model by expectation maximization to discover motifs in biopolymers, Proceedings of ISMB 1994, AAAI Press.
Google Scholar
J. Buhler and M. Tompa Finding Motifs Using Random Projections, In Proc. of RECOMB 2001.
Google Scholar
J. Buhler Efficient Large Scale Sequence Comparison by Locality Sensitive Hashing, Bioinformatics17(5), 2001.
Google Scholar
Richard Cole and Ramesh Hariharan, Approximate String Matching: A Simpler Faster Algorithm, Proc. ACM-SIAM Symposium on Discrete Algorithms, pp. 463–472, 25–27 January 1998.
Google Scholar
Churchill, G. A. Stochastic models for heterogeneous DNA sequences, Bulletin of Mathemathical Biology 51, 79–94 (1989).
MATH MathSciNet Google Scholar
W. Chang and E. Lawler, Approximate String Matching in Sublinear Expected Time, Proc. IEEE Symposium on Foundations of Computer Science, 1990.
Google Scholar
Fu, Y.-X and R. N. Curnow. Maximum likelihood estimation of multiple change points, Biometrika 77, 563–573 (1990).
Article MATH MathSciNet Google Scholar
Green, P. J. Reversible Jump Markov chain Monte Carlo Computation and Bayesian Model Determination Biometrika 82, 711–732 (1995)
Article MATH MathSciNet Google Scholar
A. L. Halpern Minimally Selected p and Other Tests for a Single Abrupt Change-point in a Binary Sequence Biometrics 55, Dec 1999.
Google Scholar
A. L. Halpern Multiple Changepoint Testing for an Alternating Segments Model of a Binary Sequence Biometrics 56, Sep 2000.
Google Scholar
J. E. Horvath, L. Viggiano, B. J. Loftus, M. D. Adams, N. Archidiacono, M. Rocchi, E. E. Eichler Molecular structure and evolution of an alpha satellite/non-satellite junction at 16p11. Human Molecular Genetics, 2000, Vol 9, No 1.
Google Scholar
Jackson, Strachan, Dover, Human Genome Evolution, Bios Scientific Publishers, 1996.
Google Scholar
E. S. Lander et al., Initial sequencing and analysis of the human genome, Nature, 15:409, Feb 2001.
Google Scholar
V. I. Levenshtein, Binary codes capable of correcting deletions, insertions and reversals, Cybernetics and Control Theory, 10(8):707–710, 1966.
MathSciNet Google Scholar
T. Mashkova, N. Oparina, I. Alexandrov, O. Zinovieva, A. Marusina, Y. Yurov, M. Lacroix, L. Kisselev, Unequal crossover is involved in human alpha satellite DNA rearrangements on a border of the satellite domain, FEBS Letters, 441 (1998).
Google Scholar
A. Marzal and E. Vidal, Computation of normalized edit distances and applications, IEEE Trans. on PAMI, 15(9):926–932, 1993.
Google Scholar
L. Parida, I. Rigoutsos, A. Floratsas, D. Platt, Y. Gao, Pattern discovery on character sets and real valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm, Proceedings of ACM-SIAM SODA, 2000.
Google Scholar
S. C. Sahinalp and U. Vishkin, Approximate and Dynamic Matching of Patterns Using a Labeling Paradigm, Proc. IEEE Symposium on Foundations of Computer Science, 1996.
Google Scholar
George P. Smith Evolution of Repeated DNA Sequences by Unequal Crossover, Science, vol 191, pp 528–535.
Google Scholar
J. D. Thompson, D. G. Higgins, T. J. Gibson, Clustal-W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice, Nucleic Acid Research 1994, Vol. 22, No. 22.
Google Scholar
E. Ukkonen, On Approximate String Matching, Proc. Conference on Foundations of Computation Theory, 1983.
Google Scholar
Venter, J. and Steel, S. Finding multiple abrupt change points. Computational Statistics and Data Analysis 22, 481–501. (1996).
Article MATH MathSciNet Google Scholar
C. Venter et. al., The sequence of the human genome, Science, 16:291, Feb 2001.
Google Scholar

Download references

Author information

Authors and Affiliations

Dept of EECS, Dept of Genetics and Center for Computational Genomics, CWRU, USA
S. Cenk Ṣahinalp
Dept of Genetics and Center for Computational Genomics, CWRU, USA
Evan Eichler
Dept of Computer Science, University of Warwick, UK
Paul Goldberg
School of Computing, Simon Fraser University, Canada
Petra Berenbrink
Pacific Institute of Mathematics, Simon Fraser University, Canada
Tom Friedetzky
NEC Research Institute and Dept of EECS, CWRU, USA
Funda Ergun

Authors

S. Cenk Ṣahinalp
View author publications
You can also search for this author in PubMed Google Scholar
Evan Eichler
View author publications
You can also search for this author in PubMed Google Scholar
Paul Goldberg
View author publications
You can also search for this author in PubMed Google Scholar
Petra Berenbrink
View author publications
You can also search for this author in PubMed Google Scholar
Tom Friedetzky
View author publications
You can also search for this author in PubMed Google Scholar
Funda Ergun
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Electrial Engineering and Computer Science, University of Padova, Via Gradenigo 6/A, 35131, Padova, Italy
Alberto Apostolico
Department of Informatics, Kyushu University, Fukuoka 812-8581, Japan
Masayuki Takeda

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ṣahinalp, S.C., Eichler, E., Goldberg, P., Berenbrink, P., Friedetzky, T., Ergun, F. (2002). Statistical Identification of Uniformly Mutated Segments within Repeats. In: Apostolico, A., Takeda, M. (eds) Combinatorial Pattern Matching. CPM 2002. Lecture Notes in Computer Science, vol 2373. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45452-7_21

Download citation

DOI: https://doi.org/10.1007/3-540-45452-7_21
Published: 21 June 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43862-5
Online ISBN: 978-3-540-45452-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics