Skip to main content

Statistical Identification of Uniformly Mutated Segments within Repeats

  • Conference paper
  • First Online:
Combinatorial Pattern Matching (CPM 2002)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2373))

Included in the following conference series:

  • 377 Accesses

Abstract

Given a long string of characters from a constant size (w.l.o.g. binary) alphabet we present an algorithm to determine whether its characters have been generated by a single i.i.d. random source. More specifically, consider all possible k-coin models for generating a binary string S, where each bit of S is generated via an independent toss of one of the k coins in the model. The choice of which coin to toss is decided by a random walk on the set of coins where the probability of a coin change is much lower than the probability of using the same coin repeatedly. We present a statistical test procedure which, for any given S, determines whether the a posteriori probability for k = 1 is higher than for any other k > 1. Our algorithm runs in time O(l 4 log l), where l is the length of S, through a dynamic programming approach which exploits the convexity of the a posteriori probability for k.

The problem we consider arises from two critical applications in analyzing long alignments between pairs of genomic sequences. A high alignment score between two DNA sequences usually indicates an evolutionary relationship, i.e. that the sequences have been generated as a result of one or more copy events followed by random point mutations. Such sequences may include functional regions (e.g. exons) as well as nonfunctional ones (e.g. introns). Functional regions with critical importance exhibit much lower mutation rates than non-functional DNA (or DNA

Supported in part by an NSF Career Award and by Charles B. Wang Foundation.

Partially supported by the IST Programme of the EU under contract number IST-1999-14186 (ALCOM-FT).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. E. F. Adebiyi, T. Jiang, M. Kaufmann, An Efficient Algorithm for Finding Short Approximate Non-Tandem Repeats, In Proceedings of ISMB 2001.

    Google ScholarĀ 

  2. A. N. Arslan, O. Egecioglu, P. A. Pevzner A new approach to sequence comparison: normalized sequence alignment, Proceedings of RECOMB 2001.

    Google ScholarĀ 

  3. Bailey J. A., Yavor A. M., Massa H. F., Trask B. J., Eichler E. E., Segmental duplications: organization and impact within the current human genome project assembly, Genome Research 11(6), Jun 2001.

    Google ScholarĀ 

  4. T. Bailey, C. Elkan. Fitting a mixture model by expectation maximization to discover motifs in biopolymers, Proceedings of ISMB 1994, AAAI Press.

    Google ScholarĀ 

  5. J. Buhler and M. Tompa Finding Motifs Using Random Projections, In Proc. of RECOMB 2001.

    Google ScholarĀ 

  6. J. Buhler Efficient Large Scale Sequence Comparison by Locality Sensitive Hashing, Bioinformatics17(5), 2001.

    Google ScholarĀ 

  7. Richard Cole and Ramesh Hariharan, Approximate String Matching: A Simpler Faster Algorithm, Proc. ACM-SIAM Symposium on Discrete Algorithms, pp. 463ā€“472, 25ā€“27 January 1998.

    Google ScholarĀ 

  8. Churchill, G. A. Stochastic models for heterogeneous DNA sequences, Bulletin of Mathemathical Biology 51, 79ā€“94 (1989).

    MATHĀ  MathSciNetĀ  Google ScholarĀ 

  9. W. Chang and E. Lawler, Approximate String Matching in Sublinear Expected Time, Proc. IEEE Symposium on Foundations of Computer Science, 1990.

    Google ScholarĀ 

  10. Fu, Y.-X and R. N. Curnow. Maximum likelihood estimation of multiple change points, Biometrika 77, 563ā€“573 (1990).

    ArticleĀ  MATHĀ  MathSciNetĀ  Google ScholarĀ 

  11. Green, P. J. Reversible Jump Markov chain Monte Carlo Computation and Bayesian Model Determination Biometrika 82, 711ā€“732 (1995)

    ArticleĀ  MATHĀ  MathSciNetĀ  Google ScholarĀ 

  12. A. L. Halpern Minimally Selected p and Other Tests for a Single Abrupt Change-point in a Binary Sequence Biometrics 55, Dec 1999.

    Google ScholarĀ 

  13. A. L. Halpern Multiple Changepoint Testing for an Alternating Segments Model of a Binary Sequence Biometrics 56, Sep 2000.

    Google ScholarĀ 

  14. J. E. Horvath, L. Viggiano, B. J. Loftus, M. D. Adams, N. Archidiacono, M. Rocchi, E. E. Eichler Molecular structure and evolution of an alpha satellite/non-satellite junction at 16p11. Human Molecular Genetics, 2000, Vol 9, No 1.

    Google ScholarĀ 

  15. Jackson, Strachan, Dover, Human Genome Evolution, Bios Scientific Publishers, 1996.

    Google ScholarĀ 

  16. E. S. Lander et al., Initial sequencing and analysis of the human genome, Nature, 15:409, Feb 2001.

    Google ScholarĀ 

  17. V. I. Levenshtein, Binary codes capable of correcting deletions, insertions and reversals, Cybernetics and Control Theory, 10(8):707ā€“710, 1966.

    MathSciNetĀ  Google ScholarĀ 

  18. T. Mashkova, N. Oparina, I. Alexandrov, O. Zinovieva, A. Marusina, Y. Yurov, M. Lacroix, L. Kisselev, Unequal crossover is involved in human alpha satellite DNA rearrangements on a border of the satellite domain, FEBS Letters, 441 (1998).

    Google ScholarĀ 

  19. A. Marzal and E. Vidal, Computation of normalized edit distances and applications, IEEE Trans. on PAMI, 15(9):926ā€“932, 1993.

    Google ScholarĀ 

  20. L. Parida, I. Rigoutsos, A. Floratsas, D. Platt, Y. Gao, Pattern discovery on character sets and real valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm, Proceedings of ACM-SIAM SODA, 2000.

    Google ScholarĀ 

  21. S. C. Sahinalp and U. Vishkin, Approximate and Dynamic Matching of Patterns Using a Labeling Paradigm, Proc. IEEE Symposium on Foundations of Computer Science, 1996.

    Google ScholarĀ 

  22. George P. Smith Evolution of Repeated DNA Sequences by Unequal Crossover, Science, vol 191, pp 528ā€“535.

    Google ScholarĀ 

  23. J. D. Thompson, D. G. Higgins, T. J. Gibson, Clustal-W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice, Nucleic Acid Research 1994, Vol. 22, No. 22.

    Google ScholarĀ 

  24. E. Ukkonen, On Approximate String Matching, Proc. Conference on Foundations of Computation Theory, 1983.

    Google ScholarĀ 

  25. Venter, J. and Steel, S. Finding multiple abrupt change points. Computational Statistics and Data Analysis 22, 481ā€“501. (1996).

    ArticleĀ  MATHĀ  MathSciNetĀ  Google ScholarĀ 

  26. C. Venter et. al., The sequence of the human genome, Science, 16:291, Feb 2001.

    Google ScholarĀ 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

Ā© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

į¹¢ahinalp, S.C., Eichler, E., Goldberg, P., Berenbrink, P., Friedetzky, T., Ergun, F. (2002). Statistical Identification of Uniformly Mutated Segments within Repeats. In: Apostolico, A., Takeda, M. (eds) Combinatorial Pattern Matching. CPM 2002. Lecture Notes in Computer Science, vol 2373. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45452-7_21

Download citation

  • DOI: https://doi.org/10.1007/3-540-45452-7_21

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-43862-5

  • Online ISBN: 978-3-540-45452-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics