Annals of the Institute of Statistical Mathematics

, Volume 69, Issue 1, pp 231–248

Faster exact distributions of pattern statistics through sequential elimination of states

Article

Abstract

When using an auxiliary Markov chain (AMC) to compute sampling distributions, the computational complexity is directly related to the number of Markov chain states. For certain complex pattern statistics, minimal deterministic finite automata (DFA) have been used to facilitate efficient computation by reducing the number of AMC states. For example, when statistics of overlapping pattern occurrences are counted differently than non-overlapping occurrences, a DFA consisting of prefixes of patterns extended to overlapping occurrences has been generated and then minimized to form an AMC. However, there are situations where forming such a DFA is computationally expensive, e.g., with computing the distribution of spaced seed coverage. In dealing with this problem, we develop a method to obtain a small set of states during the state generation process without forming a DFA, and show that a huge reduction in the size of the AMC can be attained.

Keywords

Active proper suffix Auxiliary Markov chain Computational efficiency Extended seed patterns Minimal deterministic finite automaton Overlapping pattern occurrences  Seeded alignments Spaced seed coverage 

References

  1. Aho, A. V., Corasick, M. J. (1975). Efficient string matching: An aid to bibliographic search. Communications of the ACM, 18(6), 333–340.Google Scholar
  2. Aston, J. A. D., Martin, D. E. K. (2007). Distributions associated with general runs and patterns in hidden Markov models. Annals of Applied Statistics, 1(2), 585–611.Google Scholar
  3. Balakrishnan, N., Koutras, M. V. (2002). Runs and scans with applications. New York: Wiley.Google Scholar
  4. Bassino, F., Clement, J., Fayolle, J., Nicodème, P. (2008). Constructions for clumps statistics. Discrete Mathematics and Theoretical Computer Science (DMTCS), AI, 179–194.Google Scholar
  5. Benson, G. (1999). Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Research, 27, 573–580.CrossRefGoogle Scholar
  6. Benson, G., Mak, D. Y. F. (2009). Exact distribution of a spaced seed statistic for DNA homology detection. String processing and information retrieval, Lecture Notes in Computer Science, 5280, 282–293.Google Scholar
  7. Buhler, J., Keich, U., Sun, Y. (2005). Designing seeds for similarity search in genomic DNA. Journal of Computer and System Sciences, 70, 342–363.Google Scholar
  8. Ebneshahrashoob, M., Gao, T., Wu, M. (2005). An efficient algorithm for exact distribution of discrete scan statistic. Methodology and Computing in Applied Probability, 7(4), 459–471.Google Scholar
  9. Fu, J. C. (1996). Distribution theory of runs and patterns associated with a sequence of multi-state trials. Statistica Sinica, 6, 957–974.MathSciNetMATHGoogle Scholar
  10. Fu, J. C., Koutras, M. V. (1994). Distribution theory of runs: a Markov chain approach. Journal of the American Statistical Association, 89, 1050–1058.Google Scholar
  11. Fu, J. C., Lou, W. Y. (2003). Distribution theory of runs and patterns and its applications. Singapore: World Scientific Publishing Co.Google Scholar
  12. Fu, J. C., Lou, W. Y. W., Bai, Z.-D., Li, G. (2002). The exact and limiting distributions of the number of successes in success runs within a sequence of Markov-dependent two-state trials. Annals of the Institute of Statistical Mathematics, 54(4), 719–730.Google Scholar
  13. Hopcroft, J. E. (1971). An \(n\) log \(n\) algorithm for minimizing states in a finite automaton. In Z. Kohavi & A. Paz (Eds.), Theory of Machines and Computations (pp. 189–196). New York: Academic Press.CrossRefGoogle Scholar
  14. Hopcroft, J. E., Motwani, R., Ullman, J. D. (2001). Introduction to automata theory, languages, and computation. New York: Addison-Wesley.Google Scholar
  15. Keich, U., Li, M., Ma, B., Tromp, J. (2004). On spaced seeds for similarity search. Discrete Applied Mathematics, 138(3), 253–263.Google Scholar
  16. Koutras, M. V., Alexandrou, V. A. (1995). Runs, scans and urn models: A unified Markov chain approach. Annals of the Institute of Statistical Mathematics, 47, 743–766.Google Scholar
  17. Kucherov G., Noé, L., Roytberg, M. (2007). Subset seed automaton. In Implementation and application of automata, Lecture Notes in Computer Science, Volume 4783 (pp. 180–191).Google Scholar
  18. Ledent, S., Robin, S. (2005). Checking homogeneity of motifs’ distribution in heterogenous sequences. Journal of Computational Biology, 12, 672–685.Google Scholar
  19. Lladser, M., Betterton, M. D., Knight, R. (2008). Multiple pattern matching: a Markov chain approach. Journal of Mathematical Biology, 56(1–2), 51–92.Google Scholar
  20. Lou, W. Y. W. (2003). The exact distribution of the \(k\)-tuple statistic for sequence homology. Statistics and Probability Letters, 61, 51–59.MathSciNetCrossRefMATHGoogle Scholar
  21. Ma, B., Tromp, J., Li, M. (2002). Patternhunter-faster and more sensitive homology search. Bioinformatics, 18(3), 440–445.Google Scholar
  22. Marshall, T. and Rahmann, S. (2008). Probabilistic arithmetic automata and their application to pattern matching statistics. Lecture Notes In Computer Science; Vol. 5029, Proceedings of the 19th Annual Symposium on Combinatorial Pattern Matching (pp. 95–106).Google Scholar
  23. Martin, D. E. K. (2006). The exact joint distribution of the sum of heads and apparent size statistics of a “tandem repeats finder” algorithm. Bulletin of Mathematical Biology, 68, 2353–2364.MathSciNetCrossRefMATHGoogle Scholar
  24. Martin, D. E. K. (2008). Application of auxiliary Markov chains to start-up demonstration tests. European Journal of Operational Research, 184(2), 574–583.CrossRefMATHGoogle Scholar
  25. Martin, D. E. K. (2013). Coverage of spaced seeds as a measure of clumping. In American Statistical (Ed.), Association 2013 Proceedings of the Section on Computational Statistics. Alexandria, VA: American Statistical Association.Google Scholar
  26. Martin, D. E. K. (2014). P-values for the discrete scan statistic through slack variables. Communications in Statistics, Simulation and Computation. doi:10.1080/03610918.2013.777457.
  27. Martin, D. E. K., Aston, J. A. D. (2008). Waiting time distribution of generalized later patterns. Computational Statistics and Data Analysis, 52, 4879–4890.Google Scholar
  28. Martin, D. E. K., Aston, J. A. D. (2013). Distributions of statistics of hidden state sequences through the sum-product algorithm. Methodology and Computing in Applied Probability, 15(4), 897–918.Google Scholar
  29. Martin, D. E. K., Coleman, D. A. (2011). Distributions of clump statistics for a collection of words. Journal of Applied Probability, 48, 1049–1059.Google Scholar
  30. Mealy, G. H. (1955). A method for synthesizing sequential circuits. Bell System Technical Journal, 34(5), 1045–1079.MathSciNetCrossRefGoogle Scholar
  31. Moore E.F. (1956) Gedanken-experiments on sequential machines. Automata Studies: Annals of Mathematical Studies, \(34\), 129–153. Princeton, N.J.: Princeton University Press.Google Scholar
  32. Noé, L., Kucherov, G. (2004). Improved hit criteria for DNA local alignment. BMC Bioinformatics, 5, 149.Google Scholar
  33. Noé, L., Martin, D. E. K. (2014). A coverage criterion for spaced seeds and its applications to SVM string kernels and \(k\)-mer distances. Journal of Computational Biology, 21(12), 947–963.Google Scholar
  34. Nuel, G. (2008). Pattern Markov chains: Optimal Markov chain embedding through deterministic finite automata. Journal of Applied Probability, 45(1), 226–243.MathSciNetCrossRefMATHGoogle Scholar
  35. Parzen, E. (1962). Stochastic Processes. San Francisco: Holden-Day Inc.MATHGoogle Scholar
  36. Ribeca, P., Raineri, E. (2008). Faster exact Markovian probability functions for motif occurrences: a DFA-only approach. Bioinformatics, 24(24), 2839–2848.Google Scholar
  37. Robin, S., Rodolphe, F., Schbath, S. (2005). DNA, words and models. United Kingdom: Cambridge University Press.Google Scholar
  38. Tewari, A., Srivastava, U., Gupta, P. (2002). A parallel DFA minimization algorithm. In High performance computing Hi PC, Lecture Notes in Computer Science (Vol. 2552, pp. 34–40).Google Scholar

Copyright information

© The Institute of Statistical Mathematics, Tokyo 2015

Authors and Affiliations

  1. 1.Department of StatisticsNorth Carolina State UniversityRaleighUSA
  2. 2.CRIStAL (UMR 9189 Lille University/CNRS), INRIA Lille Nord-EuropeVilleneuve d’AscqFrance

Personalised recommendations