Advertisement

Hidden Pattern Statistics

  • Philippe Flajolet
  • Yves Guivarc’h
  • Wojciech Szpankowski
  • Brigitte Vallée
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2076)

Abstract

We consider the sequence comparison problem, also known as “hidden pattern” problem, where one searches for a given subsequence in a text (rather than a string understood as a sequence of consecutive symbols). A characteristic parameter is the number of occurrences of a given pattern w of length m as a subsequence in a random text of length n generated by a memoryless source. Spacings between letters of the pattern may either be constrained or not in order to define valid occurrences. We determine the mean and the variance of the number of occurrences, and establish a Gaussian limit law. These results are obtained via combinatorics on words, formal language techniques, and methods of analytic combinatorics based on generating functions and convergence of moments. The motivation to study this problem comes from an attempt at finding a reliable threshold for intrusion detections, from textual data processing applications, and from molecular biology.

Keywords

Intrusion Detection Pattern Match String Match Longe Common Subsequence Random Text 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    A. Aczel, The Mystery of the Aleph. Mathematics, the Kabbalah, and the Search for Infinity, Four Walls Eight Windows, New York, 2000.zbMATHGoogle Scholar
  2. 2.
    A. Apostolico and M. Atallah, Compact Recognizers of Episode Sequences, Submitted to Information and Computation.Google Scholar
  3. 3.
    E. Bender and F. Kochman, The Distribution of Subword Counts is Usually Normal, European Journal of Combinatorics, 14, 265–275, 1993.zbMATHCrossRefMathSciNetGoogle Scholar
  4. 4.
    P. Billingsley, Probability and Measure, Second Edition, John Wiley & Sons, New York, 1986.zbMATHGoogle Scholar
  5. 5.
    L. Boasson, P. Cegielski, I. Guessarian, and Yuri Matiyasevich, Window-Accumulated Subsequence Matching Problem is Linear, In Proceedings of the Eighteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems: PODS 1999, ACM Press, 327–336, 1999.Google Scholar
  6. 6.
    J. Clément, P. Flajolet, and B. Vallée, Dynamical Sources in Information Theory: A General Analysis of Trie Structures, Algorithmica, 29, 307–369, 2001.zbMATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    M. Crochemore and W. Rytter, Text Algorithms, Oxford University Press, New York, 1994.zbMATHGoogle Scholar
  8. 8.
    G. Das, R. Fleischer, L. Gasieniec, D. Gunopulos, and J. Kärkkäinen, Episode Matching, In Combinatorial Pattern Matching, 8th Annual Symposium, Lecture Notes in Computer Science vol. 1264, 12–27, 1997.Google Scholar
  9. 9.
    L. Guibas and A. M. Odlyzko, Periods in Strings, J. Combinatorial Theory Ser. A, 30, 19–43, 1981.zbMATHCrossRefMathSciNetGoogle Scholar
  10. 10.
    L. Guibas and A. M. Odlyzko, String Overlaps, Pattern Matching, and Nontransitive Games, J. Combinatorial Theory Ser. A, 30, 183–208, 1981.zbMATHCrossRefMathSciNetGoogle Scholar
  11. 11.
    Y. Guivarc’h, Marches aléatoires sur les groupes, Fascicule de probabilités, Publ. Inst. Rech. Math. Rennes, 2000.Google Scholar
  12. 12.
    D. E. Knuth, The Art of Computer Programming, Fundamental Algorithms, Vol. 1, Third Edition, Addison-Wesley, Reading, MA, 1997.Google Scholar
  13. 13.
    G. Kucherov and M. Rusinowitch, Matching a Set of Strings with Variable Length Don’t Cares, Theoretical Computer Science 178, 129–154, 1997.zbMATHCrossRefMathSciNetGoogle Scholar
  14. 14.
    S. Kumar and E.H. Spafford, A Pattern-Matching Model for Intrusion Detection, Proceedings of the National Computer Security Conference, 11–21, 1994.Google Scholar
  15. 15.
    P. Nicodème, B. Salvy, and P. Flajolet, Motif Statistics, European Symposium on Algorithms, Lecture Notes in Computer Science, No. 1643, 194–211, 1999.Google Scholar
  16. 16.
    M. Régnier and W. Szpankowski, On the Approximate Pattern Occurrences in a Text, Proc. Compression and Complexity of SEQUENCE’97, IEEE Computer Society, 253–264, Positano, 1997.Google Scholar
  17. 17.
    M. Règnier and W. Szpankowski, On Pattern Frequency Occurrences in a Markovian Sequence, Algorithmica, 22, 631–649, 1998.zbMATHCrossRefMathSciNetGoogle Scholar
  18. 18.
    I. Rigoutsos, A. Floratos, L. Parida, Y. Gao and D. Platt, The Emergence of Pattern Discovery Techniques in Computational Biology, Metabolic Engineering, 2, 159–177, 2000.CrossRefGoogle Scholar
  19. 19.
    R. Sedgewick and P. Flajolet, An Introduction to the Analysis of Algorithms, Addison-Wesley, Reading, MA, 1995.Google Scholar
  20. 20.
    J. M. Steele, Probability Theory and Combinatorial Optimization, SIAM, Philadelphia, 1997.zbMATHGoogle Scholar
  21. 21.
    W. Szpankowski, Average Case Analysis of Algorithms on Sequences, John Wiley & Sons, New York, 2001.zbMATHGoogle Scholar
  22. 22.
    B. Vallépe, Dynamical Sources in Information Theory: Fundamental Intervals and Word Prefixes, Algorithmica, 29, 262–306, 2001.CrossRefMathSciNetGoogle Scholar
  23. 23.
    A. Vanet, L. Marsan, and M.-F. Sagot, Promoter sequences and algorithmical methods for identifying them, Res. Microbiol., 150, 779–799, 1999.CrossRefGoogle Scholar
  24. 24.
    M. Waterman, Introduction to Computational Biology, Chapman and Hall, London, 1995.zbMATHGoogle Scholar
  25. 25.
    A. Wespi, H. Debar, M. Dacier, and M. Nassehi, Fixed vs. Variable-Length Patterns For Detecting Suspicious Process Behavior, J. Computer Security, 8, 159–181, 2000.Google Scholar
  26. 26.
    S. Wu and U. Manber, Fast Text Searching Allowing Errors, Comm. ACM, 35:10, 83–991, 1995.CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2001

Authors and Affiliations

  • Philippe Flajolet
    • 1
  • Yves Guivarc’h
    • 2
  • Wojciech Szpankowski
    • 3
  • Brigitte Vallée
    • 4
  1. 1.Algorithms ProjectINRIA-RocquencourtLe ChesnayFrance
  2. 2.IRMARUniversité de Rennes IRennes CedexFrance
  3. 3.Dept. Computer SciencePurdue UniversityUSA
  4. 4.GREYCUniversité de CaenCaen CedexFrance

Personalised recommendations