Skip to main content
Log in

Faster exact distributions of pattern statistics through sequential elimination of states

  • Published:
Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Abstract

When using an auxiliary Markov chain (AMC) to compute sampling distributions, the computational complexity is directly related to the number of Markov chain states. For certain complex pattern statistics, minimal deterministic finite automata (DFA) have been used to facilitate efficient computation by reducing the number of AMC states. For example, when statistics of overlapping pattern occurrences are counted differently than non-overlapping occurrences, a DFA consisting of prefixes of patterns extended to overlapping occurrences has been generated and then minimized to form an AMC. However, there are situations where forming such a DFA is computationally expensive, e.g., with computing the distribution of spaced seed coverage. In dealing with this problem, we develop a method to obtain a small set of states during the state generation process without forming a DFA, and show that a huge reduction in the size of the AMC can be attained.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Aho, A. V., Corasick, M. J. (1975). Efficient string matching: An aid to bibliographic search. Communications of the ACM, 18(6), 333–340.

  • Aston, J. A. D., Martin, D. E. K. (2007). Distributions associated with general runs and patterns in hidden Markov models. Annals of Applied Statistics, 1(2), 585–611.

  • Balakrishnan, N., Koutras, M. V. (2002). Runs and scans with applications. New York: Wiley.

  • Bassino, F., Clement, J., Fayolle, J., Nicodème, P. (2008). Constructions for clumps statistics. Discrete Mathematics and Theoretical Computer Science (DMTCS), AI, 179–194.

  • Benson, G. (1999). Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Research, 27, 573–580.

    Article  Google Scholar 

  • Benson, G., Mak, D. Y. F. (2009). Exact distribution of a spaced seed statistic for DNA homology detection. String processing and information retrieval, Lecture Notes in Computer Science, 5280, 282–293.

  • Buhler, J., Keich, U., Sun, Y. (2005). Designing seeds for similarity search in genomic DNA. Journal of Computer and System Sciences, 70, 342–363.

  • Ebneshahrashoob, M., Gao, T., Wu, M. (2005). An efficient algorithm for exact distribution of discrete scan statistic. Methodology and Computing in Applied Probability, 7(4), 459–471.

  • Fu, J. C. (1996). Distribution theory of runs and patterns associated with a sequence of multi-state trials. Statistica Sinica, 6, 957–974.

    MathSciNet  MATH  Google Scholar 

  • Fu, J. C., Koutras, M. V. (1994). Distribution theory of runs: a Markov chain approach. Journal of the American Statistical Association, 89, 1050–1058.

  • Fu, J. C., Lou, W. Y. (2003). Distribution theory of runs and patterns and its applications. Singapore: World Scientific Publishing Co.

  • Fu, J. C., Lou, W. Y. W., Bai, Z.-D., Li, G. (2002). The exact and limiting distributions of the number of successes in success runs within a sequence of Markov-dependent two-state trials. Annals of the Institute of Statistical Mathematics, 54(4), 719–730.

  • Hopcroft, J. E. (1971). An \(n\) log \(n\) algorithm for minimizing states in a finite automaton. In Z. Kohavi & A. Paz (Eds.), Theory of Machines and Computations (pp. 189–196). New York: Academic Press.

    Chapter  Google Scholar 

  • Hopcroft, J. E., Motwani, R., Ullman, J. D. (2001). Introduction to automata theory, languages, and computation. New York: Addison-Wesley.

  • Keich, U., Li, M., Ma, B., Tromp, J. (2004). On spaced seeds for similarity search. Discrete Applied Mathematics, 138(3), 253–263.

  • Koutras, M. V., Alexandrou, V. A. (1995). Runs, scans and urn models: A unified Markov chain approach. Annals of the Institute of Statistical Mathematics, 47, 743–766.

  • Kucherov G., Noé, L., Roytberg, M. (2007). Subset seed automaton. In Implementation and application of automata, Lecture Notes in Computer Science, Volume 4783 (pp. 180–191).

  • Ledent, S., Robin, S. (2005). Checking homogeneity of motifs’ distribution in heterogenous sequences. Journal of Computational Biology, 12, 672–685.

  • Lladser, M., Betterton, M. D., Knight, R. (2008). Multiple pattern matching: a Markov chain approach. Journal of Mathematical Biology, 56(1–2), 51–92.

  • Lou, W. Y. W. (2003). The exact distribution of the \(k\)-tuple statistic for sequence homology. Statistics and Probability Letters, 61, 51–59.

    Article  MathSciNet  MATH  Google Scholar 

  • Ma, B., Tromp, J., Li, M. (2002). Patternhunter-faster and more sensitive homology search. Bioinformatics, 18(3), 440–445.

  • Marshall, T. and Rahmann, S. (2008). Probabilistic arithmetic automata and their application to pattern matching statistics. Lecture Notes In Computer Science; Vol. 5029, Proceedings of the 19th Annual Symposium on Combinatorial Pattern Matching (pp. 95–106).

  • Martin, D. E. K. (2006). The exact joint distribution of the sum of heads and apparent size statistics of a “tandem repeats finder” algorithm. Bulletin of Mathematical Biology, 68, 2353–2364.

    Article  MathSciNet  MATH  Google Scholar 

  • Martin, D. E. K. (2008). Application of auxiliary Markov chains to start-up demonstration tests. European Journal of Operational Research, 184(2), 574–583.

    Article  MATH  Google Scholar 

  • Martin, D. E. K. (2013). Coverage of spaced seeds as a measure of clumping. In American Statistical (Ed.), Association 2013 Proceedings of the Section on Computational Statistics. Alexandria, VA: American Statistical Association.

    Google Scholar 

  • Martin, D. E. K. (2014). P-values for the discrete scan statistic through slack variables. Communications in Statistics, Simulation and Computation. doi:10.1080/03610918.2013.777457.

  • Martin, D. E. K., Aston, J. A. D. (2008). Waiting time distribution of generalized later patterns. Computational Statistics and Data Analysis, 52, 4879–4890.

  • Martin, D. E. K., Aston, J. A. D. (2013). Distributions of statistics of hidden state sequences through the sum-product algorithm. Methodology and Computing in Applied Probability, 15(4), 897–918.

  • Martin, D. E. K., Coleman, D. A. (2011). Distributions of clump statistics for a collection of words. Journal of Applied Probability, 48, 1049–1059.

  • Mealy, G. H. (1955). A method for synthesizing sequential circuits. Bell System Technical Journal, 34(5), 1045–1079.

    Article  MathSciNet  Google Scholar 

  • Moore E.F. (1956) Gedanken-experiments on sequential machines. Automata Studies: Annals of Mathematical Studies, \(34\), 129–153. Princeton, N.J.: Princeton University Press.

  • Noé, L., Kucherov, G. (2004). Improved hit criteria for DNA local alignment. BMC Bioinformatics, 5, 149.

  • Noé, L., Martin, D. E. K. (2014). A coverage criterion for spaced seeds and its applications to SVM string kernels and \(k\)-mer distances. Journal of Computational Biology, 21(12), 947–963.

  • Nuel, G. (2008). Pattern Markov chains: Optimal Markov chain embedding through deterministic finite automata. Journal of Applied Probability, 45(1), 226–243.

    Article  MathSciNet  MATH  Google Scholar 

  • Parzen, E. (1962). Stochastic Processes. San Francisco: Holden-Day Inc.

    MATH  Google Scholar 

  • Ribeca, P., Raineri, E. (2008). Faster exact Markovian probability functions for motif occurrences: a DFA-only approach. Bioinformatics, 24(24), 2839–2848.

  • Robin, S., Rodolphe, F., Schbath, S. (2005). DNA, words and models. United Kingdom: Cambridge University Press.

  • Tewari, A., Srivastava, U., Gupta, P. (2002). A parallel DFA minimization algorithm. In High performance computing Hi PC, Lecture Notes in Computer Science (Vol. 2552, pp. 34–40).

Download references

Acknowledgments

The authors greatly appreciate the comments of the associate editor and the referees, which proved to be extremely useful in improving the manuscript. D. E. K. Martin was supported in this research by the National Science Foundation under Grant DMS-1107084. Laurent Noé was supported by a CNRS Mastodons grant, and benefited from a half-time course buyout from the French Institute for Research in Computer Science and Automation (Inria).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Donald E. K. Martin.

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Martin, D.E.K., Noé, L. Faster exact distributions of pattern statistics through sequential elimination of states. Ann Inst Stat Math 69, 231–248 (2017). https://doi.org/10.1007/s10463-015-0540-y

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10463-015-0540-y

Keywords

Navigation