Stochastic Analysis of Minimal Automata Growth for Generalized Strings

  • Ian G. Char
  • Manuel E. LladserEmail author


Generalized strings describe various biological motifs that arise in molecular and computational biology. In this manuscript, we introduce an alternative but efficient algorithm to construct the minimal deterministic finite automaton (DFA) associated with any generalized string. We exploit this construction to characterize the typical growth of the minimal DFA (i.e., with the least number of states) associated with a random generalized string of increasing length. Even though the worst-case growth may be exponential, we characterize a point in the construction of the minimal DFA when it starts to grow linearly and conclude it has at most a polynomial number of states with asymptotically certain probability. We conjecture that this number is linear.


Aho-Corasick algorithm Deterministic finite automaton Generalized string Minimization Motif Polynomial growth 

Mathematics Subject Classification (2010)

68Q25 68Q45 68Q87 68W40 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.



We are thankful to two anonymous referees for their careful reading of this paper and valuable suggestions. We are also very thankful to Dr. Dougherty for partially funding this research through her NSF EXTREEMS training grant.


  1. Aho AV, Corasick MJ (1975) Efficient string matching: an aid to bibliographic search. Commun ACM 18(6):333–340MathSciNetCrossRefzbMATHGoogle Scholar
  2. AitMous O, Bassino F, Nicaud C (2012) An efficient linear pseudo-minimization algorithm for Aho-Corasick automata. In: Annual symposium on combinatorial pattern matching. Springer, pp 110–123Google Scholar
  3. Apostolico A, Szpankowski W (1992) Self-alignments in words and their applications. J Algor 13(3):446–467MathSciNetCrossRefzbMATHGoogle Scholar
  4. Aston JAD, Martin DEK (2005) Waiting time distributions of competing patterns in higher-order Markovian sequences. J Appl Prob 42(4):977–988MathSciNetCrossRefzbMATHGoogle Scholar
  5. Bender EA, Kochman F (1993) The distribution of subword counts is usually Normal. Eur J Comb 14(4):265–275MathSciNetCrossRefzbMATHGoogle Scholar
  6. Brookner E (1966) Recurrent events in a Markov chain. Inf Control 9(3):215–229MathSciNetCrossRefzbMATHGoogle Scholar
  7. Char IG (2018) Algorithmic construction and stochastic analysis of optimal automata for generalized strings. University of Colorado, the United States, Master’s thesisGoogle Scholar
  8. Chestnut SR, Lladser ME (2010) Occupancy distributions in Markov chains via Doeblin’s ergodicity coefficient. Discrete Mathematics and Theoretical Computer Science Proceedings. Vienna, pp 79–92Google Scholar
  9. Cristianini N, Hahn MW (2007) Introduction to computational genomics: a case studies approach, 1st edn. Cambridge University PressGoogle Scholar
  10. Erhardsson T (1999) Compound Poisson approximation for Markov chains using Stein’s method. Ann Prob 27:565–596MathSciNetCrossRefzbMATHGoogle Scholar
  11. Flajolet P, Szpankowski W, Vallée B (2006) Hidden word statistics. J ACM 53(1):147–183MathSciNetCrossRefzbMATHGoogle Scholar
  12. Flames N, Hobert O (2009) Gene regulatory logic of dopamine neuron differentiation. Nature 16:885–889CrossRefGoogle Scholar
  13. Fu JC, Chang YM (2002) On probability generating functions for waiting time distributions of compound patterns in a sequence of multistate trials. J Appl Prob 39 (1):70–80MathSciNetCrossRefzbMATHGoogle Scholar
  14. Fu JC, Koutras MV (1994) Distribution theory of runs: a Markov chain approach. J Amer Statist Assoc 89(427):1050–1058MathSciNetCrossRefzbMATHGoogle Scholar
  15. Fu JC, Lou WYW (2003) Distribution theory of runs and patterns and its applications. A finite Markov chain imbedding approach. World Scientific Publishing Co. IncGoogle Scholar
  16. Gani J, Irle A (1999) On patterns in sequences of random events. Mh Math 127:295–309MathSciNetCrossRefzbMATHGoogle Scholar
  17. Hopcroft JE, Motwani R, Ullman JD (2001) Introduction to automata theory, languages, and computation, 2nd edn. Addison–WesleyGoogle Scholar
  18. Lladser ME (2007) Minimal Markov chain embeddings of pattern problems. In: Proceedings of the 2007 information theory and applications workshop. University of California, San DiegoGoogle Scholar
  19. Lladser ME (2008) Markovian embeddings of general random strings. In: 2008 Proceedings of the fifth workshop on analytic algorithmics and combinatorics. SIAM, San Francisco, pp 183–190Google Scholar
  20. Lladser ME, Chestnut SR (2014) Approximation of sojourn-times via maximal couplings: motif frequency distributions. J Math Biol 69(1):147–182MathSciNetCrossRefzbMATHGoogle Scholar
  21. Lladser ME, Betterton MD, Knight R (2008) Multiple pattern matching: a Markov chain approach. J Math Biol 56(1-2):51–92MathSciNetCrossRefzbMATHGoogle Scholar
  22. Marschall T (2011) Construction of minimal deterministic finite automata from biological motifs. Theor Comput Sci 412(8):922–930MathSciNetCrossRefzbMATHGoogle Scholar
  23. Marschall T, Herms I, Kaltenbach HM, Rahmann S (2012) Probabilistic arithmetic automata and their applications. IEEE/ACM Trans Comput Biol Bioinform 9(6):1737–50CrossRefGoogle Scholar
  24. Martin DEK (2018) Minimal auxiliary Markov chains through sequential elimination of states. Commun Statist Simul Comput 0(0):1–15CrossRefGoogle Scholar
  25. Mojica FJM, Díez-Villaseñor C, García-Martínez J, Almendros C (2009) Short motif sequences determine the targets of the prokaryotic CRISPR defence system. Microbiology 155(3):733–740CrossRefGoogle Scholar
  26. Nicodème P, Salvy B, Flajolet P (2002) Motif statistics. Theor Comput Sci 287(2):593–617MathSciNetCrossRefzbMATHGoogle Scholar
  27. Rėgnier M, Szpankowski W (1998) On pattern frequency occurrences in a Markovian sequence. Algorithmica 22(4):631–649MathSciNetCrossRefzbMATHGoogle Scholar
  28. Reinert G, Schbath S (1998) Compound Poisson and Poisson process approximations for occurrences of multiple words in Markov chains. J Comput Biol 5 (2):223–253CrossRefGoogle Scholar
  29. Robin S, Rodolphe F, Schbath S (2005) DNA, words and models: statistics of exceptional words, 1st edn. Cambridge University PressGoogle Scholar
  30. Robin S, Daudin JJ, Richard H, Sagot MF, Schbath S (2002) Occurrence probability of structured motifs in random sequences. J Comput Biol 9:761–73CrossRefGoogle Scholar
  31. Roquain E, Schbath S (2007) Improved compound Poisson approximation for the number of occurrences of any rare word family in a stationary Markov chain. Adv Appl Probab 39(1):128–140MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of Applied MathematicsUniversity of ColoradoBoulderUSA

Personalised recommendations