Skip to main content

Exact Distribution of a Spaced Seed Statistic for DNA Homology Detection

  • Conference paper
Book cover String Processing and Information Retrieval (SPIRE 2008)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5280))

Included in the following conference series:

Abstract

Let a seed, S, be a string from the alphabet {1,*}, of arbitrary length k, which starts and ends with a 1. For example, Sā€‰=ā€‰11*1. S occurs in a binary string T at position h if the length k substring of T ending at position h contains a 1 in every position where there is a 1 in S. We say that the 1s at the corresponding positions in T are covered. We are interested in calculating the probability distribution for the number of 1s covered by a seed S in an iid Bernoulli string of length n with probability of 1 equal to p. We refer to this new probability distribution as C nSp , for covered, with S being the seed. We present an efficient method to calculate this distribution exactly. Covered 1s represent matching positions detected in DNA sequences when using multiple hits of a spaced seed. Knowledge of the distribution provides a statistical threshold for distinguishing true homologies from randomly matching sequences.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aho, A.V., Corasick, M.J.: Efficient string matching: An aid to bibliographic search. Comm. ACMĀ 18, 333ā€“340 (1975)

    ArticleĀ  MathSciNetĀ  MATHĀ  Google ScholarĀ 

  2. Benson, G.: Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids ResearchĀ 27, 573ā€“580 (1999)

    ArticleĀ  Google ScholarĀ 

  3. Benson, G., Mak, D.Y.F.: Exact distribution of a spaced seed statistic for applications in DNA repeat detection. In: Proceedings of the 2008 International Workshop on Applied Probability (IWAP 2008) (2008)

    Google ScholarĀ 

  4. Benson, G., Su, X.: On the distribution of k-tuple matches for sequence homology: a constant time exact calculation of the variance. J. Computational BiologyĀ 5, 87ā€“100 (1998)

    ArticleĀ  Google ScholarĀ 

  5. Buhler, J., Keich, U., Sun, Y.: Designing seeds for similarity search in genomic DNA. Journal of Computing and System SciencesĀ 70, 342ā€“363 (2005)

    ArticleĀ  MathSciNetĀ  Google ScholarĀ 

  6. Burkhardt, S., KƤrkkƤinen, J.: Better filtering with gapped q-grams. Fundam. Inform.Ā 56(1-2), 51ā€“70 (2003)

    MathSciNetĀ  MATHĀ  Google ScholarĀ 

  7. Keich, U., Li, M., Ma, B., Tromp, J.: On spaced seeds for similarity search. Discrete Applied MathematicsĀ 138(3), 253ā€“263 (2004)

    ArticleĀ  MathSciNetĀ  MATHĀ  Google ScholarĀ 

  8. Lou, W.Y.W.: The exact distribution of the k-tuple statistic for sequence homology. Statistics and Probability LettersĀ 61(1), 51ā€“59 (2003)

    ArticleĀ  MathSciNetĀ  MATHĀ  Google ScholarĀ 

  9. Ma, B., Tromp, J., Li, M.: Patternhunter: faster and more sensitive homology search. BioinformaticsĀ 18, 440ā€“445 (2002)

    ArticleĀ  Google ScholarĀ 

  10. Mak, D.Y.F., Benson, G.: All hits all the time: Parameter free calculation of seed sensitivity. In: Proceedings of the 5th Asia-Pacific Bioinformatics Conference, pp. 327ā€“340. Imperial College Press (2007)

    Google ScholarĀ 

  11. Warburton, P., Giordano, J., Cheung, F., Gelfand, Y., Benson, G.: Inverted repeat structure of the human genome: the X chromosome contains a preponderance of large highly homologous inverted repeats which contain testes genes. Genome Res., 1861ā€“1869 (2004)

    Google ScholarĀ 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

Ā© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Benson, G., Mak, D.Y.F. (2008). Exact Distribution of a Spaced Seed Statistic for DNA Homology Detection. In: Amir, A., Turpin, A., Moffat, A. (eds) String Processing and Information Retrieval. SPIRE 2008. Lecture Notes in Computer Science, vol 5280. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-89097-3_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-89097-3_27

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-89096-6

  • Online ISBN: 978-3-540-89097-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics