Exact Distribution of a Spaced Seed Statistic for DNA Homology Detection

Benson, Gary; Mak, Denise Y. F.

doi:10.1007/978-3-540-89097-3_27

Gary Benson⁴ &
Denise Y. F. Mak⁵

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5280))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

765 Accesses
4 Citations

Abstract

Let a seed, S, be a string from the alphabet {1,*}, of arbitrary length k, which starts and ends with a 1. For example, S = 11*1. S occurs in a binary string T at position h if the length k substring of T ending at position h contains a 1 in every position where there is a 1 in S. We say that the 1s at the corresponding positions in T are covered. We are interested in calculating the probability distribution for the number of 1s covered by a seed S in an iid Bernoulli string of length n with probability of 1 equal to p. We refer to this new probability distribution as C _nSp, for covered, with S being the seed. We present an efficient method to calculate this distribution exactly. Covered 1s represent matching positions detected in DNA sequences when using multiple hits of a spaced seed. Knowledge of the distribution provides a statistical threshold for distinguishing true homologies from randomly matching sequences.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aho, A.V., Corasick, M.J.: Efficient string matching: An aid to bibliographic search. Comm. ACM 18, 333–340 (1975)
Article MathSciNet MATH Google Scholar
Benson, G.: Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Research 27, 573–580 (1999)
Article Google Scholar
Benson, G., Mak, D.Y.F.: Exact distribution of a spaced seed statistic for applications in DNA repeat detection. In: Proceedings of the 2008 International Workshop on Applied Probability (IWAP 2008) (2008)
Google Scholar
Benson, G., Su, X.: On the distribution of k-tuple matches for sequence homology: a constant time exact calculation of the variance. J. Computational Biology 5, 87–100 (1998)
Article Google Scholar
Buhler, J., Keich, U., Sun, Y.: Designing seeds for similarity search in genomic DNA. Journal of Computing and System Sciences 70, 342–363 (2005)
Article MathSciNet Google Scholar
Burkhardt, S., Kärkkäinen, J.: Better filtering with gapped q-grams. Fundam. Inform. 56(1-2), 51–70 (2003)
MathSciNet MATH Google Scholar
Keich, U., Li, M., Ma, B., Tromp, J.: On spaced seeds for similarity search. Discrete Applied Mathematics 138(3), 253–263 (2004)
Article MathSciNet MATH Google Scholar
Lou, W.Y.W.: The exact distribution of the k-tuple statistic for sequence homology. Statistics and Probability Letters 61(1), 51–59 (2003)
Article MathSciNet MATH Google Scholar
Ma, B., Tromp, J., Li, M.: Patternhunter: faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002)
Article Google Scholar
Mak, D.Y.F., Benson, G.: All hits all the time: Parameter free calculation of seed sensitivity. In: Proceedings of the 5th Asia-Pacific Bioinformatics Conference, pp. 327–340. Imperial College Press (2007)
Google Scholar
Warburton, P., Giordano, J., Cheung, F., Gelfand, Y., Benson, G.: Inverted repeat structure of the human genome: the X chromosome contains a preponderance of large highly homologous inverted repeats which contain testes genes. Genome Res., 1861–1869 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Departments of Computer Science, Biology, Program in Bioinformatics, Boston University, Boston, MA 02215
Gary Benson
Graduate Program in Bioinformatics, Boston University, Boston, MA 02215
Denise Y. F. Mak

Authors

Gary Benson
View author publications
You can also search for this author in PubMed Google Scholar
Denise Y. F. Mak
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Bar-Ilan University, 52900, Ramat-Gan, Israel
Amihood Amir
School of Computer Science and Information Technology, RMIT University, Melbourne, Australia
Andrew Turpin
NICTA Victoria Laboratory, Department of Computer Science and Software Engineering, The University of Melbourne, Victoria, Australia
Alistair Moffat

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Benson, G., Mak, D.Y.F. (2008). Exact Distribution of a Spaced Seed Statistic for DNA Homology Detection. In: Amir, A., Turpin, A., Moffat, A. (eds) String Processing and Information Retrieval. SPIRE 2008. Lecture Notes in Computer Science, vol 5280. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-89097-3_27

Download citation

DOI: https://doi.org/10.1007/978-3-540-89097-3_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-89096-6
Online ISBN: 978-3-540-89097-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics