# Faster exact distributions of pattern statistics through sequential elimination of states

## Abstract

When using an auxiliary Markov chain (AMC) to compute sampling distributions, the computational complexity is directly related to the number of Markov chain states. For certain complex pattern statistics, minimal deterministic finite automata (DFA) have been used to facilitate efficient computation by reducing the number of AMC states. For example, when statistics of overlapping pattern occurrences are counted differently than non-overlapping occurrences, a DFA consisting of prefixes of patterns extended to overlapping occurrences has been generated and then minimized to form an AMC. However, there are situations where forming such a DFA is computationally expensive, e.g., with computing the distribution of spaced seed coverage. In dealing with this problem, we develop a method to obtain a small set of states during the state generation process without forming a DFA, and show that a huge reduction in the size of the AMC can be attained.

## Keywords

Active proper suffix Auxiliary Markov chain Computational efficiency Extended seed patterns Minimal deterministic finite automaton Overlapping pattern occurrences Seeded alignments Spaced seed coverage## Notes

### Acknowledgments

The authors greatly appreciate the comments of the associate editor and the referees, which proved to be extremely useful in improving the manuscript. D. E. K. Martin was supported in this research by the National Science Foundation under Grant DMS-1107084. Laurent Noé was supported by a CNRS Mastodons grant, and benefited from a half-time course buyout from the French Institute for Research in Computer Science and Automation (Inria).

## References

- Aho, A. V., Corasick, M. J. (1975). Efficient string matching: An aid to bibliographic search.
*Communications of the ACM*,*18*(6), 333–340.Google Scholar - Aston, J. A. D., Martin, D. E. K. (2007). Distributions associated with general runs and patterns in hidden Markov models.
*Annals of Applied Statistics*,*1*(2), 585–611.Google Scholar - Balakrishnan, N., Koutras, M. V. (2002).
*Runs and scans with applications*. New York: Wiley.Google Scholar - Bassino, F., Clement, J., Fayolle, J., Nicodème, P. (2008). Constructions for clumps statistics.
*Discrete Mathematics and Theoretical Computer Science*(DMTCS), AI, 179–194.Google Scholar - Benson, G. (1999). Tandem repeats finder: a program to analyze DNA sequences.
*Nucleic Acids Research*,*27*, 573–580.CrossRefGoogle Scholar - Benson, G., Mak, D. Y. F. (2009). Exact distribution of a spaced seed statistic for DNA homology detection.
*String processing and information retrieval, Lecture Notes in Computer Science*,*5280*, 282–293.Google Scholar - Buhler, J., Keich, U., Sun, Y. (2005). Designing seeds for similarity search in genomic DNA.
*Journal of Computer and System Sciences*,*70*, 342–363.Google Scholar - Ebneshahrashoob, M., Gao, T., Wu, M. (2005). An efficient algorithm for exact distribution of discrete scan statistic.
*Methodology and Computing in Applied Probability*,*7*(4), 459–471.Google Scholar - Fu, J. C. (1996). Distribution theory of runs and patterns associated with a sequence of multi-state trials.
*Statistica Sinica*,*6*, 957–974.MathSciNetMATHGoogle Scholar - Fu, J. C., Koutras, M. V. (1994). Distribution theory of runs: a Markov chain approach.
*Journal of the American Statistical Association*,*89*, 1050–1058.Google Scholar - Fu, J. C., Lou, W. Y. (2003).
*Distribution theory of runs and patterns and its applications*. Singapore: World Scientific Publishing Co.Google Scholar - Fu, J. C., Lou, W. Y. W., Bai, Z.-D., Li, G. (2002). The exact and limiting distributions of the number of successes in success runs within a sequence of Markov-dependent two-state trials.
*Annals of the Institute of Statistical Mathematics*,*54*(4), 719–730.Google Scholar - Hopcroft, J. E. (1971). An \(n\) log \(n\) algorithm for minimizing states in a finite automaton. In Z. Kohavi & A. Paz (Eds.),
*Theory of Machines and Computations*(pp. 189–196). New York: Academic Press.CrossRefGoogle Scholar - Hopcroft, J. E., Motwani, R., Ullman, J. D. (2001).
*Introduction to automata theory, languages, and computation*. New York: Addison-Wesley.Google Scholar - Keich, U., Li, M., Ma, B., Tromp, J. (2004). On spaced seeds for similarity search.
*Discrete Applied Mathematics*,*138*(3), 253–263.Google Scholar - Koutras, M. V., Alexandrou, V. A. (1995). Runs, scans and urn models: A unified Markov chain approach.
*Annals of the Institute of Statistical Mathematics*,*47*, 743–766.Google Scholar - Kucherov G., Noé, L., Roytberg, M. (2007). Subset seed automaton. In
*Implementation and application of automata, Lecture Notes in Computer Science*,*Volume 4783*(pp. 180–191).Google Scholar - Ledent, S., Robin, S. (2005). Checking homogeneity of motifs’ distribution in heterogenous sequences.
*Journal of Computational Biology*,*12*, 672–685.Google Scholar - Lladser, M., Betterton, M. D., Knight, R. (2008). Multiple pattern matching: a Markov chain approach.
*Journal of Mathematical Biology*,*56*(1–2), 51–92.Google Scholar - Lou, W. Y. W. (2003). The exact distribution of the \(k\)-tuple statistic for sequence homology.
*Statistics and Probability Letters*,*61*, 51–59.MathSciNetCrossRefMATHGoogle Scholar - Ma, B., Tromp, J., Li, M. (2002). Patternhunter-faster and more sensitive homology search.
*Bioinformatics*,*18*(3), 440–445.Google Scholar - Marshall, T. and Rahmann, S. (2008). Probabilistic arithmetic automata and their application to pattern matching statistics.
*Lecture Notes In Computer Science*; Vol. 5029,*Proceedings of the 19th Annual Symposium on Combinatorial Pattern Matching*(pp. 95–106).Google Scholar - Martin, D. E. K. (2006). The exact joint distribution of the sum of heads and apparent size statistics of a “tandem repeats finder” algorithm.
*Bulletin of Mathematical Biology*,*68*, 2353–2364.MathSciNetCrossRefMATHGoogle Scholar - Martin, D. E. K. (2008). Application of auxiliary Markov chains to start-up demonstration tests.
*European Journal of Operational Research*,*184*(2), 574–583.CrossRefMATHGoogle Scholar - Martin, D. E. K. (2013). Coverage of spaced seeds as a measure of clumping. In American Statistical (Ed.),
*Association 2013 Proceedings of the Section on Computational Statistics*. Alexandria, VA: American Statistical Association.Google Scholar - Martin, D. E. K. (2014). P-values for the discrete scan statistic through slack variables.
*Communications in Statistics, Simulation and Computation*. doi: 10.1080/03610918.2013.777457. - Martin, D. E. K., Aston, J. A. D. (2008). Waiting time distribution of generalized later patterns.
*Computational Statistics and Data Analysis*,*52*, 4879–4890.Google Scholar - Martin, D. E. K., Aston, J. A. D. (2013). Distributions of statistics of hidden state sequences through the sum-product algorithm.
*Methodology and Computing in Applied Probability*,*15*(4), 897–918.Google Scholar - Martin, D. E. K., Coleman, D. A. (2011). Distributions of clump statistics for a collection of words.
*Journal of Applied Probability*,*48*, 1049–1059.Google Scholar - Mealy, G. H. (1955). A method for synthesizing sequential circuits.
*Bell System Technical Journal*,*34*(5), 1045–1079.MathSciNetCrossRefGoogle Scholar - Moore E.F. (1956) Gedanken-experiments on sequential machines.
*Automata Studies: Annals of Mathematical Studies*, \(34\), 129–153. Princeton, N.J.: Princeton University Press.Google Scholar - Noé, L., Kucherov, G. (2004). Improved hit criteria for DNA local alignment.
*BMC Bioinformatics*,*5*, 149.Google Scholar - Noé, L., Martin, D. E. K. (2014). A coverage criterion for spaced seeds and its applications to SVM string kernels and \(k\)-mer distances.
*Journal of Computational Biology*,*21*(12), 947–963.Google Scholar - Nuel, G. (2008). Pattern Markov chains: Optimal Markov chain embedding through deterministic finite automata.
*Journal of Applied Probability*,*45*(1), 226–243.MathSciNetCrossRefMATHGoogle Scholar - Parzen, E. (1962).
*Stochastic Processes*. San Francisco: Holden-Day Inc.MATHGoogle Scholar - Ribeca, P., Raineri, E. (2008). Faster exact Markovian probability functions for motif occurrences: a DFA-only approach.
*Bioinformatics*,*24*(24), 2839–2848.Google Scholar - Robin, S., Rodolphe, F., Schbath, S. (2005).
*DNA, words and models*. United Kingdom: Cambridge University Press.Google Scholar - Tewari, A., Srivastava, U., Gupta, P. (2002). A parallel DFA minimization algorithm. In High performance computing Hi PC,
*Lecture Notes in Computer Science*(Vol. 2552, pp. 34–40).Google Scholar