Abstract
Let a seed, S, be a string from the alphabet {1,*}, of arbitrary length k, which starts and ends with a 1. For example, Sā=ā11*1. S occurs in a binary string T at position h if the length k substring of T ending at position h contains a 1 in every position where there is a 1 in S. We say that the 1s at the corresponding positions in T are covered. We are interested in calculating the probability distribution for the number of 1s covered by a seed S in an iid Bernoulli string of length n with probability of 1 equal to p. We refer to this new probability distribution as C nSp , for covered, with S being the seed. We present an efficient method to calculate this distribution exactly. Covered 1s represent matching positions detected in DNA sequences when using multiple hits of a spaced seed. Knowledge of the distribution provides a statistical threshold for distinguishing true homologies from randomly matching sequences.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aho, A.V., Corasick, M.J.: Efficient string matching: An aid to bibliographic search. Comm. ACMĀ 18, 333ā340 (1975)
Benson, G.: Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids ResearchĀ 27, 573ā580 (1999)
Benson, G., Mak, D.Y.F.: Exact distribution of a spaced seed statistic for applications in DNA repeat detection. In: Proceedings of the 2008 International Workshop on Applied Probability (IWAP 2008) (2008)
Benson, G., Su, X.: On the distribution of k-tuple matches for sequence homology: a constant time exact calculation of the variance. J. Computational BiologyĀ 5, 87ā100 (1998)
Buhler, J., Keich, U., Sun, Y.: Designing seeds for similarity search in genomic DNA. Journal of Computing and System SciencesĀ 70, 342ā363 (2005)
Burkhardt, S., KƤrkkƤinen, J.: Better filtering with gapped q-grams. Fundam. Inform.Ā 56(1-2), 51ā70 (2003)
Keich, U., Li, M., Ma, B., Tromp, J.: On spaced seeds for similarity search. Discrete Applied MathematicsĀ 138(3), 253ā263 (2004)
Lou, W.Y.W.: The exact distribution of the k-tuple statistic for sequence homology. Statistics and Probability LettersĀ 61(1), 51ā59 (2003)
Ma, B., Tromp, J., Li, M.: Patternhunter: faster and more sensitive homology search. BioinformaticsĀ 18, 440ā445 (2002)
Mak, D.Y.F., Benson, G.: All hits all the time: Parameter free calculation of seed sensitivity. In: Proceedings of the 5th Asia-Pacific Bioinformatics Conference, pp. 327ā340. Imperial College Press (2007)
Warburton, P., Giordano, J., Cheung, F., Gelfand, Y., Benson, G.: Inverted repeat structure of the human genome: the X chromosome contains a preponderance of large highly homologous inverted repeats which contain testes genes. Genome Res., 1861ā1869 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Ā© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Benson, G., Mak, D.Y.F. (2008). Exact Distribution of a Spaced Seed Statistic for DNA Homology Detection. In: Amir, A., Turpin, A., Moffat, A. (eds) String Processing and Information Retrieval. SPIRE 2008. Lecture Notes in Computer Science, vol 5280. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-89097-3_27
Download citation
DOI: https://doi.org/10.1007/978-3-540-89097-3_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-89096-6
Online ISBN: 978-3-540-89097-3
eBook Packages: Computer ScienceComputer Science (R0)