Constraints

, Volume 22, Issue 2, pp 265–306 | Cite as

Prefix-projection global constraint and top-k approach for sequential pattern mining

  • Amina Kemmar
  • Yahia Lebbah
  • Samir Loudni
  • Patrice Boizumault
  • Thierry Charnois
Article

Abstract

Sequential pattern mining (SPM) is an important data mining problem with broad applications. SPM is a hard problem due to the huge number of intermediate subsequences to be considered. State of the art approaches for SPM (e.g., PrefixSpan Pei et al. 2001) are largely based on the pattern-growth approach, where for each frequent prefix subsequence, only its related suffix subsequences need to be considered, and the database is recursively projected into smaller ones. Many authors have promoted the use of constraints to focus on the most promising patterns according to the interests of the end user. The top-k SPM problem is also used to cope with the difficulty of thresholding and to control the number of solutions. State of the art methods developed for SPM and top-k SPM, though efficient, are locked into a rather rigid search strategy, and suffer from the lack of declarativity and flexibility. Indeed, adding new constraints usually amounts to changing the data-structures used in the core of the algorithm, and combining these new constraints often require new developments. Recent works (e.g. Kemmar et al. 2014; Négrevergne and Guns 2015) have investigated the use of Constraint Programming (CP) for SPM. However, despite their nice declarative aspects, all these modelings have scaling problems, due to the huge size of their constraint networks. To address this issue, we propose the Prefix-Projection global constraint, which encapsulates both the subsequence relation as well as the frequency constraint. Its filtering algorithm relies on the principle of projected databases which allows to keep in the variables domain, only values leading to a frequent pattern in the database. Prefix-Projection filtering algorithm enforces domain consistency on the variable succeeding the current frequent prefix in polynomial time. This global constraint also allows for a straightforward implementation of additional constraints such as size, item membership, regular expressions and any combination of them. Experimental results show that our approach clearly outperforms existing CP approaches and competes well with the state-of-the-art methods on large datasets for mining frequent sequential patterns, sequential patterns under various constraints, and top-k sequential patterns. Unlike existing CP methods, our approach achieves a better scalability.

Keywords

Global constraints Data mining Sequential pattern mining Prefix-Projection Top-k 

References

  1. 1.
    Agrawal, R., & Srikant, R. (1995). Mining sequential patterns. In Yu, P.S., & Chen, A.L.P. (Eds.) Proceedings of the Eleventh International Conference on Data Engineering, March 6-10, 1995, Taipei, Taiwan. pp. 3–14. IEEE Computer Society. doi: 10.1109/ICDE.1995.380415.
  2. 2.
    Ayres, J., Flannick, J., Gehrke, J., & Yiu, T. (2002). Sequential pattern mining using a bitmap representation. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 23-26, 2002, Edmonton, Alberta, Canada. pp. 429–435. ACM. doi: 10.1145/775047.775109.
  3. 3.
    Béchet, N., Cellier, P., Charnois, T., & Crémilleux, B. (2012). Sequential pattern mining to discover relations between genes and rare diseases. In CBMS.Google Scholar
  4. 4.
    Beldiceanu, N., & Contejean, E. (1994). Introducing global constraints in CHIP. Journal of Mathematical and Computer Modelling, 20(12), 97–123.CrossRefMATHGoogle Scholar
  5. 5.
    Cheung, Y., & Fu, A. W. (2004). Mining frequent itemsets without support threshold: With and without item constraints. IEEE Transactions on Knowledge and Data Engineering, 16(9), 1052– 1069.CrossRefGoogle Scholar
  6. 6.
    Coquery, E., Jabbour, S., Saïs, L., & Salhi, Y. (2012). A sat-based approach for discovering frequent, closed and maximal patterns in a sequence. In Raedt, L.D., Bessière, C., Dubois, D., Doherty, P., Frasconi, P., Heintz, F., & Lucas, P.J.F. (Eds.) ECAI 2012 - 20th European Conference on Artificial Intelligence. Including Prestigious Applications of Artificial Intelligence (PAIS-2012) System Demonstrations Track, Montpellier, France, August 27-31, 2012. Frontiers in Artificial Intelligence and Applications, vol. 242, pp. 258–263. IOS Press. doi: 10.3233/978-1-61499-098-7-258.
  7. 7.
    Fournier-Viger, P., Gomariz, A., Gueniche, T., Soltani, A., Wu, C., & Tseng, V. (2014). SPMF: A java Open-Source pattern mining library. J. of Machine Learning Resea., 15, 3389–3393.MATHGoogle Scholar
  8. 8.
    Fournier-Viger, P., Gomariz, A., Gueniche, T., Mwamikazi, E., & Thomas, R. (2013). TKS: efficient mining of top-k sequential patterns. In Motoda, H., Wu, Z., Cao, L., Zaïane, O.R., Yao, M., & Wang, W. (Eds.) Advanced Data Mining and Applications, 9th International Conference, ADMA 2013, Hangzhou, China, December 14-16, 2013, Proceedings, Part I. Lecture Notes in Computer Science, vol. 8346, pp. 109–120. Springer. doi: 10.1007/978-3-642-53914-5_10.
  9. 9.
    Garofalakis, M. N., Rastogi, R., & Shim, K. (2002). Mining sequential patterns with regular expression constraints. IEEE Trans. Knowl. Data Eng., 14(3), 530–552. doi: 10.1109/TKDE.2002.1000341.CrossRefGoogle Scholar
  10. 10.
    Guns, T., Nijssen, S., & Raedt, L. D. (2011). Itemset mining: A constraint programming perspective. Artif. Intell., 175(12-13), 1951–1983. doi: 10.1016/j.artint.2011.05.002.MathSciNetCrossRefMATHGoogle Scholar
  11. 11.
    Han, J., Wang, J., Lu, Y., & Tzvetkov, P. (2002). Mining top-k frequent closed patterns without minimum support. In Proceedings of the 2002 IEEE international conference on data mining (ICDM 2002), 9-12 december 2002, maebashi city, Japan (pp. 211–218).Google Scholar
  12. 12.
    Kemmar, A., Loudni, S., Lebbah, Y., Boizumault, P., & Charnois, T. (2015). PREFIX-PROJECTION global constraint for sequential pattern mining. In Pesant, G. (Ed.) Principles and Practice of Constraint Programming - 21st International Conference, CP 2015, Cork, Ireland, August 31 - September 4, 2015, Proceedings. Lecture Notes in Computer Science, vol. 9255, pp. 226–243. Springer. doi: 10.1007/978-3-319-23219-5_17.
  13. 13.
    Kemmar, A., Loudni, S., Lebbah, Y., Boizumault, P., & Charnois, T. (2016). A global constraint for mining sequential patterns with GAP constraint. In Integration of AI and OR techniques in constraint programming - 13th international conference, CPAIOR 2016, banff, AB, Canada, May 29 - June 1, 2016, Proceedings. Lecture Notes in Computer Science, vol. 9676, pP. 198–215. Springer.Google Scholar
  14. 14.
    Kemmar, A., Ugarte, W., Loudni, S., Charnois, T., Lebbah, Y., Boizumault, P., & Crémilleux, B. (2014). Mining relevant sequence patterns with cp-based framework. In 26th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2014, Limassol, Cyprus, November 10-12, 2014. pp. 552–559. IEEE Computer Society. doi: 10.1109/ICTAI.2014.89.
  15. 15.
    Li, C., Yang, Q., Wang, J., & Li, M. (2012). Efficient mining of gap-constrained subsequences and its various applications. ACM Trans. Knowl. Discov. Data, 6(1), 2:1–2:39.CrossRefGoogle Scholar
  16. 16.
    Métivier, J.P., Loudni, S., & Charnois, T. (2013). A constraint programming approach for mining sequential patterns in a sequence database. In ECML/PKDD Workshop on languages for data mining and machine learning.Google Scholar
  17. 17.
    Négrevergne, B., Dries, A., Guns, T., & Nijssen, S. (2013). Dominance programming for itemset mining. In Xiong, H., Karypis, G., Thuraisingham, B. M., Cook, D. J., & Wu, X. (Eds.) 2013 IEEE 13th International Conference on Data Mining, Dallas, TX, USA, December 7-10, 2013. pp. 557–566. IEEE Computer Society. doi: 10.1109/ICDM.2013.92.
  18. 18.
    Négrevergne, B., & Guns, T. (2015). Constraint-based seque nce mining using constraint programming. In Michel, L. (Ed.) Integration of AI and OR Techniques in Constraint Programming - 12th International Conference, CPAIOR 2015, Barcelona, Spain, May 18-22, 2015, Proceedings. Lecture Notes in Computer Science, vol. 9075, pp. 288–305. Springer. doi: 10.1007/978-3-319-18008-3_{2}0.
  19. 19.
    Novak, P. K., Lavrac, N., & Webb, G. I. (2009). Supervised descriptive rule discovery: a unifying survey of contrast set, emerging pattern and subgroup mining. Journal of Machine Learning Research, 10, 377–403.MATHGoogle Scholar
  20. 20.
    Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., & Hsu, M. (2001). Prefixspan: Mining sequential patterns by prefix-projected growth. In Georgakopoulos, D., & Buchmann, A. (Eds.) Proceedings of the 17th International Conference on Data Engineering, April 2-6, 2001, Heidelberg, Germany. pp. 215–224. IEEE Computer Society. doi: 10.1109/ICDE.2001.914830.
  21. 21.
    Pei, J., Han, J., Mortazavi-Asl, B., & Zhu, H. (2000). Mining access patterns efficiently from web logs. In Terano, T., Liu, H., & Chen, A. L. P. (Eds.) Knowledge Discovery and Data Mining, Current Issues and New Applications, 4th Pacific-Asia Conference, PADKK 2000, Kyoto, Japan, April 18-20, 2000, Proceedings. Lecture Notes in Computer Science, vol. 1805, pP. 396–407. Springer. doi: 10.1007/3-540-45571-X_47.
  22. 22.
    Pei, J., Han, J., & Wang, W. (2002). Mining sequential patterns with constraints in large databases. In Proceedings of the 2002 ACM CIKM International Conference on Information and Knowledge Management, McLean, VA, USA, November 4-9, 2002. pp. 18–25. ACM. doi: 10.1145/584792.584799.
  23. 23.
    Pesant, G. (2004). A regular language membership constraint for finite sequences of variables. In Wallace, M. (Ed.) CP’04. LNCS, vol. 2239, pp. 482–495. Springer.Google Scholar
  24. 24.
    Pyun, G., & Yun, U. (2014). Mining top-k frequent patterns with combination reducing techniques. Applied Intelligence, 41(1), 76–98.CrossRefGoogle Scholar
  25. 25.
    Raedt, L. D., & Zimmermann, A. (2007). Constraint-based pattern set mining. In Proceedings of the Seventh SIAM International Conference on Data Mining, April 26-28, 2007, Minneapolis, Minnesota, USA. pp. 237–248. SIAM. doi: 10.1137/1.9781611972771.22.
  26. 26.
    Rojas, W. U., Boizumault, P., Loudni, S., Crémilleux, B., & Lepailleur, A. (2014). Mining (soft-) skypatterns using dynamic CSP. In Simonis, H. (Ed.) Integration of AI and OR Techniques in Constraint Programming - 11th International Conference, CPAIOR 2014, Cork, Ireland, May 19-23, 2014. Proceedings. Lecture Notes in Computer Science, vol. 8451, pp. 71–87. Springer. doi: 10.1007/978-3-319-07046-9_6.
  27. 27.
    Rossi, F., van Beek, P., & Walsh, T. (Eds.) (2006). Handbook of Constraint Programming. New York: Elsevier Science Inc.Google Scholar
  28. 28.
    Soulet, A., Raïssi, C., Plantevit, M., & Crémilleux, B. (2011). Mining dominant patterns in the sky. In Cook, D. J., Pei, J., Wang, W., Zaïane, O. R., & Wu, X. (Eds.) 11th IEEE International Conference on Data Mining, ICDM 2011, Vancouver, BC, Canada, December 11-14, 2011. pp. 655–664. IEEE Computer Society. doi: 10.1109/ICDM.2011.100.
  29. 29.
    Srikant, R., & Agrawal, R. (1996). Mining sequential patterns: Generalizations and performance improvements. In Apers, P. M. G., Bouzeghoub, M., & Gardarin, G. (Eds.) Advances in Database Technology - EDBT’96, 5th International Conference on Extending Database Technology, Avignon, France, March 25-29, 1996, Proceedings. Lecture Notes in Computer Science. doi: 10.1007/BFb0014140, (Vol. 1057 pp. 3–17): Springer.
  30. 30.
    Trasarti, R., Bonchi, F., & Goethals, B. (2008). Sequence mining automata: A new technique for mining frequent sequences under regular expressions. In Proceedings of the 8th IEEE International Conference on Data Mining (ICDM 2008), December 15-19, 2008, Pisa, Italy. pp. 1061–1066. IEEE Computer Society. doi: 10.1109/ICDM.2008.111.
  31. 31.
    Tzvetkov, P., Yan, X., & Han, J. (2003). In TSP: mining top-k closed sequential patterns. In: Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003), 19-22 December 2003, Melbourne, Florida, USA. pp. 347–354. IEEE Computer Society. doi: 10.1109/ICDM.2003.1250939.
  32. 32.
    Wang, J., & Han, J. (2004). BIDE: efficient mining of frequent closed sequences. In Özsoyoglu, Z. M., & Zdonik, S. B. (Eds.) Proceedings of the 20th International Conference on Data Engineering, ICDE 2004, 30 March - 2 April 2004, Boston, MA, USA. pp. 79–90. IEEE Computer Society. doi: 10.1109/ICDE.2004.1319986.
  33. 33.
    Wang, J., Han, J., Lu, Y., & Tzvetkov, P. (2005). TFP: an efficient algorithm for mining top-k frequent closed itemsets. IEEE Trans. Knowl. Data Eng., 17(5), 652–664. doi: 10.1109/TKDE.2005.81.CrossRefGoogle Scholar
  34. 34.
    Yan, X., Han, J., & Afshar, R. (2003). Clospan: Mining closed sequential patterns in large databases. In Barbará, D., & Kamath, C. (Eds.) Proceedings of the Third SIAM International Conference on Data Mining, San Francisco, CA, USA, May 1-3, 2003. pp. 166–177. SIAM. doi: 10.1137/1.9781611972733.15.
  35. 35.
    Zaki, M. J. (2000). Sequence mining in categorical domains: Incorporating constraints. In Proceedings of the 2000 ACM CIKM International Conference on Information and Knowledge Management, McLean, VA, USA, November 6-11, 2000. pp. 422–429. ACM. doi: 10.1145/354756.354849.
  36. 36.
    Zaki, M. J. (2000). Sequence mining in categorical domains: Incorporating constraints. In Proceedings of the 2000 ACM CIKM International Conference on Information and Knowledge Management, McLean, VA, USA, November 6-11, 2000. pp. 422–429. ACM. doi: 10.1145/354756.354849.
  37. 37.
    Zaki, M. J. (2001). SPADE: an efficient algorithm for mining frequent sequences. Machine Learning, 42(1/2), 31–60. doi: 10.1023/A:100765250231.CrossRefMATHGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.LITIOUniversity of Oran 1 Ahmed Ben BellaOranAlgeria
  2. 2.GREYC (CNRS UMR 6072)University of CaenCaenFrance
  3. 3.LIPN (CNRS UMR 7030)University PARIS 13ParisFrance

Personalised recommendations