Hidden Pattern Statistics
We consider the sequence comparison problem, also known as “hidden pattern” problem, where one searches for a given subsequence in a text (rather than a string understood as a sequence of consecutive symbols). A characteristic parameter is the number of occurrences of a given pattern w of length m as a subsequence in a random text of length n generated by a memoryless source. Spacings between letters of the pattern may either be constrained or not in order to define valid occurrences. We determine the mean and the variance of the number of occurrences, and establish a Gaussian limit law. These results are obtained via combinatorics on words, formal language techniques, and methods of analytic combinatorics based on generating functions and convergence of moments. The motivation to study this problem comes from an attempt at finding a reliable threshold for intrusion detections, from textual data processing applications, and from molecular biology.
KeywordsIntrusion Detection Pattern Match String Match Longe Common Subsequence Random Text
Unable to display preview. Download preview PDF.
- 2.A. Apostolico and M. Atallah, Compact Recognizers of Episode Sequences, Submitted to Information and Computation.Google Scholar
- 5.L. Boasson, P. Cegielski, I. Guessarian, and Yuri Matiyasevich, Window-Accumulated Subsequence Matching Problem is Linear, In Proceedings of the Eighteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems: PODS 1999, ACM Press, 327–336, 1999.Google Scholar
- 8.G. Das, R. Fleischer, L. Gasieniec, D. Gunopulos, and J. Kärkkäinen, Episode Matching, In Combinatorial Pattern Matching, 8th Annual Symposium, Lecture Notes in Computer Science vol. 1264, 12–27, 1997.Google Scholar
- 11.Y. Guivarc’h, Marches aléatoires sur les groupes, Fascicule de probabilités, Publ. Inst. Rech. Math. Rennes, 2000.Google Scholar
- 12.D. E. Knuth, The Art of Computer Programming, Fundamental Algorithms, Vol. 1, Third Edition, Addison-Wesley, Reading, MA, 1997.Google Scholar
- 14.S. Kumar and E.H. Spafford, A Pattern-Matching Model for Intrusion Detection, Proceedings of the National Computer Security Conference, 11–21, 1994.Google Scholar
- 15.P. Nicodème, B. Salvy, and P. Flajolet, Motif Statistics, European Symposium on Algorithms, Lecture Notes in Computer Science, No. 1643, 194–211, 1999.Google Scholar
- 16.M. Régnier and W. Szpankowski, On the Approximate Pattern Occurrences in a Text, Proc. Compression and Complexity of SEQUENCE’97, IEEE Computer Society, 253–264, Positano, 1997.Google Scholar
- 19.R. Sedgewick and P. Flajolet, An Introduction to the Analysis of Algorithms, Addison-Wesley, Reading, MA, 1995.Google Scholar
- 25.A. Wespi, H. Debar, M. Dacier, and M. Nassehi, Fixed vs. Variable-Length Patterns For Detecting Suspicious Process Behavior, J. Computer Security, 8, 159–181, 2000.Google Scholar