Machine Learning

, Volume 75, Issue 1, pp 91–127 | Cite as

Mining probabilistic automata: a statistical view of sequential pattern mining

  • Stéphanie Jacquemont
  • François Jacquenet
  • Marc Sebban


During the past decade, sequential pattern mining has been the core of numerous research efforts. It is now possible to efficiently extract knowledge of users’ behavior from a huge set of sequences collected over time. This has applications in various domains such as purchases in supermarkets, Web site visits, etc. However, sequence mining algorithms do little to control the risks of extracting false discoveries or overlooking true knowledge. In this paper, the theoretical conditions to achieve a relevant sequence mining process are examined. Then, the article offers a statistical view of sequence mining which has the following advantages: First, it uses a compact and generalized representation of the original sequences in the form of a probabilistic automaton. Second, it integrates statistical constraints to guarantee the extraction of significant patterns. Finally, it provides an interesting solution in a privacy preserving context in order to respect individuals’ information. An application in car flow modeling is presented, showing the ability of our algorithm (acsm) to discover frequent routes without any private information. Comparisons with a classical sequence mining algorithm (spam) are made, showing the effectiveness of our approach.


Sequence mining Probabilistic automaton Privacy preserving data mining 


  1. Agrawal, R., & Srikant, R. (1995). Mining sequential patterns. In Proceedings of the 11th international conference on data engineering (pp. 3–14). Los Alamitos: IEEE Computer Society. Google Scholar
  2. Agrawal, R., & Srikant, R. (2000). Privacy-preserving data mining. In Proceedings of the ACM SIGMOD conference on management of data (pp. 439–450). New York: ACM. CrossRefGoogle Scholar
  3. Ayres, J., Flannick, J., Gehrke, J., & Yiu, T. (2002). Sequential pattern mining using a bitmap representation. In Proceedings of the 8th international conference on knowledge discovery and data mining (pp. 429–435). New York: ACM. Google Scholar
  4. Bayardo, R. J., & Agrawal, R. (2005). Data privacy through optimal k-anonymization. In Proceedings of the 21st international conference on data engineering (pp. 217–228). Los Alamitos: IEEE Computer Society. Google Scholar
  5. Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A new and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B, 57, 289–300. MATHMathSciNetGoogle Scholar
  6. Borges, J., & Levene, M. (1998). Mining association rules in hypertext databases. In Proceedings of the 4th international conference on knowledge discovery and data mining (pp. 149–153). Google Scholar
  7. Borges, J., & Levene, M. (1999). Data mining of user navigation patterns. In WEBKDD ’99: revised papers from the international workshop on web usage analysis and user profiling (pp. 92–111). Berlin: Springer. Google Scholar
  8. Borges, J., & Levene, M. (2004). A dynamic clustering-based Markov model for web usage mining. In CoRR: the computing research repository. cs.IR/0406032, June 2004.
  9. Callut, J. (2007). First passage times dynamics in Markov models with applications to HMM: induction, sequence classification and graph mining. PhD thesis, Université Catholique de Louvain. Google Scholar
  10. Carrasco, R. C., & Oncina, J. (1994). Learning stochastic regular grammars by means of a state merging method. In Proceedings of the 2nd international colloquium on grammatical inference (Vol. 862, pp. 139–152). Berlin: Springer. Google Scholar
  11. de la Higuera, C. (2005). A bibliographical study of grammatical inference. Pattern Recognition, 38(9), 1332–1348. CrossRefGoogle Scholar
  12. Dupont, P., Denis, F., & Esposito, Y. (2005). Links between probabilistic automata and hidden Markov models: probability distributions, learning models and induction algorithms. Pattern Recognition, 38(9), 1349–1371. MATHCrossRefGoogle Scholar
  13. Dupont, P., Callut, J., Dooms, G., Monette, J.-N., & Deville, Y. (2006). Relevant subgraph extraction from random walks in a graph (Technical Report 2006-2007). UCL/FSA/INGI, November 2006. Google Scholar
  14. Evfimievski, A. V., Srikant, R., Agrawal, R., & Gehrke, J. (2004). Privacy preserving mining of association rules. Information Systems, 29(4), 343–364. CrossRefGoogle Scholar
  15. Fisher, R. A. (1922). On the interpretation of chi-square from the contingency tables, and the calculation of P. Journal of the Royal Statistical Society, 85, 87–94. CrossRefGoogle Scholar
  16. Garofalakis, M., Rastogi, R., & Shim, K. (2002). Mining sequential patterns with regular expression constraints. IEEE Transactions on Knowledge and Data Engineering, 14(3), 530–552. CrossRefGoogle Scholar
  17. Gionis, A., Mannila, H., Mielikainen, T., & Tsaparas, P. (2006). Assessing data mining results via swap randomization. In KDD ’06: proceedings of the 12th international conference on knowledge discovery and data mining (pp. 167–176). Google Scholar
  18. Gold, E. M. (1978). Complexity of automaton identification from given data. Information and Control, 37(3), 302–320. MATHCrossRefMathSciNetGoogle Scholar
  19. Han, J., Altman, R. B., Kumar, V., Mannila, H., & Pregibon, D. (2002). Emerging scientific applications in data mining. Communications of the ACM, 45(8), 54–58. CrossRefGoogle Scholar
  20. Hingston, P. (2002). Using finite state automata for sequence mining. In Proceedings of the 25th Australasian conference on computer science (pp. 105–110). Australian Computer Society. Google Scholar
  21. Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301), 13–30. MATHCrossRefMathSciNetGoogle Scholar
  22. Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6, 65–70. MATHMathSciNetGoogle Scholar
  23. Klemettinen, M., Mannila, H., & Toivonen, H. (1999). Interactive exploration of interesting findings in the telecommunication network alarm sequence analyzer. Information & Software Technology, 41(9), 557–567. CrossRefGoogle Scholar
  24. Kosala, R., & Blockeel, H. (2000). Web mining research: a survey. SIGKDD Explorations, 2(1), 1–15. CrossRefGoogle Scholar
  25. Laur, P., Nock, R., Symphor, J., & Poncelet, P. (2007a). Mining evolving data streams for frequent patterns. Pattern Recognition, 40(2), 492–503. MATHCrossRefGoogle Scholar
  26. Laur, P., Symphor, J., Nock, R., & Poncelet, P. (2007b). Statistical supports for mining sequential patterns and improving the incremental update process on data streams. Intelligent Data Analysis, 11(1), 29–47. Google Scholar
  27. Mannila, H., Toivonen, H., & Verkamo, A. I. (1997). Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1(3), 259–289. CrossRefGoogle Scholar
  28. Megiddo, N., & Srikant, R. (1998). Discovering predictive association rules. In Knowledge discovery and data mining (pp. 274–278). Google Scholar
  29. Newton, E. M., Sweeney, L., & Malin, B. (2005). Preserving privacy by de-identifying face images. IEEE Transactions on Knowledge and Data Engineering, 17(2), 232–243. CrossRefGoogle Scholar
  30. Pearson, K. (1900). On a criterion that a given system of deviations from the probable in the case of correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophy Magazine, 50, 157–175. Google Scholar
  31. Pei, J., Han, J., & Wang, W. (2002). Mining sequential patterns with constraints in large databases. In Proceedings of the 11th international conference on information and knowledge management (pp. 18–25). New York: ACM. Google Scholar
  32. Reber, A. S. (1967). Implicit learning of artificial grammars. Journal of Verbal Learning and Verbal Behavior, 6, 855–863. CrossRefGoogle Scholar
  33. Shaffer, J. (1995). Multiple hypothesis-testing. Annual Review of Psychology, 46, 561–584. CrossRefGoogle Scholar
  34. Spiliopoulou, M., & Pohle, C. (2001). Data mining for measuring and improving the success of web sites. Data Mining and Knowledge Discovery, 5(1–2), 85–114. MATHCrossRefGoogle Scholar
  35. Srikant, R., & Agrawal, R. (1996). Mining sequential patterns: Generalizations and performance improvements. In Proceedings of the 5th international conference on extending database technology (Vol. 1057, pp. 3–17). Berlin: Springer. Google Scholar
  36. Sweeney, L. (2002). k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10(5), 557–570. MATHCrossRefMathSciNetGoogle Scholar
  37. Valiant, L. G. (1984). A theory of the learnable. In Proceedings of the 16th annual ACM symposium on theory of computing (pp. 436–445). New York: ACM. Google Scholar
  38. Verykios, V. S., Bertino, E., Fovino, I. N., Provenza, L. P., Saygin, Y., & Theodoridis, Y. (2004). State-of-the-art in privacy preserving data mining. SIGMOD Record, 33(1), 50–57. CrossRefGoogle Scholar
  39. Webb, G. I. (2007). Discovering significant patterns. Machine Learning, 68(1), 1–33. CrossRefGoogle Scholar
  40. Zaki, M. J. (2000). Sequence mining in categorical domains: incorporating constraints. In Proceedings of the 9th international conference on information and knowledge management (pp. 422–429). New York: ACM. Google Scholar
  41. Zaki, M. J. (2001). Spade: An efficient algorithm for mining frequent sequences. Machine Learning, 42(1–2), 31–60. MATHCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Stéphanie Jacquemont
    • 1
  • François Jacquenet
    • 1
  • Marc Sebban
    • 1
  1. 1.Laboratoire Hubert CurienUMR 5516 Université Jean MonnetSaint-ÉtienneFrance

Personalised recommendations