Statistical Identification of Uniformly Mutated Segments within Repeats

  • S. Cenk Ṣahinalp
  • Evan Eichler
  • Paul Goldberg
  • Petra Berenbrink
  • Tom Friedetzky
  • Funda Ergun
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2373)


Given a long string of characters from a constant size (w.l.o.g. binary) alphabet we present an algorithm to determine whether its characters have been generated by a single i.i.d. random source. More specifically, consider all possible k-coin models for generating a binary string S, where each bit of S is generated via an independent toss of one of the k coins in the model. The choice of which coin to toss is decided by a random walk on the set of coins where the probability of a coin change is much lower than the probability of using the same coin repeatedly. We present a statistical test procedure which, for any given S, determines whether the a posteriori probability for k = 1 is higher than for any other k > 1. Our algorithm runs in time O(l 4 log l), where l is the length of S, through a dynamic programming approach which exploits the convexity of the a posteriori probability for k.

The problem we consider arises from two critical applications in analyzing long alignments between pairs of genomic sequences. A high alignment score between two DNA sequences usually indicates an evolutionary relationship, i.e. that the sequences have been generated as a result of one or more copy events followed by random point mutations. Such sequences may include functional regions (e.g. exons) as well as nonfunctional ones (e.g. introns). Functional regions with critical importance exhibit much lower mutation rates than non-functional DNA (or DNA


Genome Segment Posteriori Probability Locality Sensitive Hash Random Source High Similarity Score 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    E. F. Adebiyi, T. Jiang, M. Kaufmann, An Efficient Algorithm for Finding Short Approximate Non-Tandem Repeats, In Proceedings of ISMB 2001.Google Scholar
  2. 2.
    A. N. Arslan, O. Egecioglu, P. A. Pevzner A new approach to sequence comparison: normalized sequence alignment, Proceedings of RECOMB 2001.Google Scholar
  3. 3.
    Bailey J. A., Yavor A. M., Massa H. F., Trask B. J., Eichler E. E., Segmental duplications: organization and impact within the current human genome project assembly, Genome Research 11(6), Jun 2001.Google Scholar
  4. 4.
    T. Bailey, C. Elkan. Fitting a mixture model by expectation maximization to discover motifs in biopolymers, Proceedings of ISMB 1994, AAAI Press.Google Scholar
  5. 5.
    J. Buhler and M. Tompa Finding Motifs Using Random Projections, In Proc. of RECOMB 2001.Google Scholar
  6. 6.
    J. Buhler Efficient Large Scale Sequence Comparison by Locality Sensitive Hashing, Bioinformatics17(5), 2001.Google Scholar
  7. 7.
    Richard Cole and Ramesh Hariharan, Approximate String Matching: A Simpler Faster Algorithm, Proc. ACM-SIAM Symposium on Discrete Algorithms, pp. 463–472, 25–27 January 1998.Google Scholar
  8. 8.
    Churchill, G. A. Stochastic models for heterogeneous DNA sequences, Bulletin of Mathemathical Biology 51, 79–94 (1989).zbMATHMathSciNetGoogle Scholar
  9. 9.
    W. Chang and E. Lawler, Approximate String Matching in Sublinear Expected Time, Proc. IEEE Symposium on Foundations of Computer Science, 1990.Google Scholar
  10. 10.
    Fu, Y.-X and R. N. Curnow. Maximum likelihood estimation of multiple change points, Biometrika 77, 563–573 (1990).zbMATHCrossRefMathSciNetGoogle Scholar
  11. 11.
    Green, P. J. Reversible Jump Markov chain Monte Carlo Computation and Bayesian Model Determination Biometrika 82, 711–732 (1995)zbMATHCrossRefMathSciNetGoogle Scholar
  12. 12.
    A. L. Halpern Minimally Selected p and Other Tests for a Single Abrupt Change-point in a Binary Sequence Biometrics 55, Dec 1999.Google Scholar
  13. 13.
    A. L. Halpern Multiple Changepoint Testing for an Alternating Segments Model of a Binary Sequence Biometrics 56, Sep 2000.Google Scholar
  14. 14.
    J. E. Horvath, L. Viggiano, B. J. Loftus, M. D. Adams, N. Archidiacono, M. Rocchi, E. E. Eichler Molecular structure and evolution of an alpha satellite/non-satellite junction at 16p11. Human Molecular Genetics, 2000, Vol 9, No 1.Google Scholar
  15. 15.
    Jackson, Strachan, Dover, Human Genome Evolution, Bios Scientific Publishers, 1996.Google Scholar
  16. 16.
    E. S. Lander et al., Initial sequencing and analysis of the human genome, Nature, 15:409, Feb 2001.Google Scholar
  17. 17.
    V. I. Levenshtein, Binary codes capable of correcting deletions, insertions and reversals, Cybernetics and Control Theory, 10(8):707–710, 1966.MathSciNetGoogle Scholar
  18. 18.
    T. Mashkova, N. Oparina, I. Alexandrov, O. Zinovieva, A. Marusina, Y. Yurov, M. Lacroix, L. Kisselev, Unequal crossover is involved in human alpha satellite DNA rearrangements on a border of the satellite domain, FEBS Letters, 441 (1998).Google Scholar
  19. 19.
    A. Marzal and E. Vidal, Computation of normalized edit distances and applications, IEEE Trans. on PAMI, 15(9):926–932, 1993.Google Scholar
  20. 20.
    L. Parida, I. Rigoutsos, A. Floratsas, D. Platt, Y. Gao, Pattern discovery on character sets and real valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm, Proceedings of ACM-SIAM SODA, 2000.Google Scholar
  21. 21.
    S. C. Sahinalp and U. Vishkin, Approximate and Dynamic Matching of Patterns Using a Labeling Paradigm, Proc. IEEE Symposium on Foundations of Computer Science, 1996.Google Scholar
  22. 22.
    George P. Smith Evolution of Repeated DNA Sequences by Unequal Crossover, Science, vol 191, pp 528–535.Google Scholar
  23. 23.
    J. D. Thompson, D. G. Higgins, T. J. Gibson, Clustal-W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice, Nucleic Acid Research 1994, Vol. 22, No. 22.Google Scholar
  24. 24.
    E. Ukkonen, On Approximate String Matching, Proc. Conference on Foundations of Computation Theory, 1983.Google Scholar
  25. 25.
    Venter, J. and Steel, S. Finding multiple abrupt change points. Computational Statistics and Data Analysis 22, 481–501. (1996).zbMATHCrossRefMathSciNetGoogle Scholar
  26. 26.
    C. Venter et. al., The sequence of the human genome, Science, 16:291, Feb 2001.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • S. Cenk Ṣahinalp
    • 1
  • Evan Eichler
    • 2
  • Paul Goldberg
    • 3
  • Petra Berenbrink
    • 4
  • Tom Friedetzky
    • 5
  • Funda Ergun
    • 6
  1. 1.Dept of EECS, Dept of Genetics and Center for Computational GenomicsCWRUUSA
  2. 2.Dept of Genetics and Center for Computational GenomicsCWRUUSA
  3. 3.Dept of Computer ScienceUniversity of WarwickUK
  4. 4.School of ComputingSimon Fraser UniversityCanada
  5. 5.Pacific Institute of MathematicsSimon Fraser UniversityCanada
  6. 6.NEC Research Institute and Dept of EECSCWRUUSA

Personalised recommendations