Advertisement

Sequential pattern sampling with norm-based utility

  • Lamine DiopEmail author
  • Cheikh Talibouya Diop
  • Arnaud Giacometti
  • Dominique Li
  • Arnaud Soulet
Regular Paper
  • 47 Downloads

Abstract

Sequential pattern mining has been introduced by Agrawal and Srikant (in: Proceedings of ICDE’95, pp 3–14, 1995) 2 decades ago, and its usefulness has been widely proved for different mining tasks and application fields such as web usage mining, text mining, bioinformatics, fraud detection and so on. Since 1995, despite numerous optimization proposals, sequential pattern mining remains a costly task that often generates too many patterns. This limit, also reached by itemset mining, was circumvented by pattern sampling. Pattern sampling is a non-exhaustive method for instantly discovering relevant patterns that ensures a good interactivity while providing strong statistical guarantees due to its random nature. Curiously, such an approach investigated for different kinds of patterns including itemsets and subgraphs has not yet been applied to sequential patterns. In this paper, we propose the first method dedicated to sequential pattern sampling. In addition to address sequential data, the originality of our approach is to introduce a class of interestingness measures relying on the norm of the sequence, named norm-based utilities. In particular, it enables to add constraints on the norm of sampled patterns to control the length of the drawn patterns and to avoid the pitfall of the “long tail” where the rarest patterns flood the user. We propose a new two-step random procedure integrating this class of measures, named \({\textsc {NUSSampling}}\) that randomly draws sequential patterns according to frequency weighted by a norm-based utility. We demonstrate that this method performs an exact sampling according to the underlying measure. Moreover, despite the use of rejection sampling, the experimental study shows that \({\textsc {NUSSampling}}\) remains efficient. We especially focus on the interest of norm constraints and exponential decays that help to draw general patterns of the “head”. We also illustrate how to benefit from these sampled patterns to instantly build an associative classifier dedicated to sequences. This classification approach rivals state-of-the-art proposals showing the interest of sequential pattern sampling with norm-based utility.

Keywords

Pattern mining Pattern sampling Sequential data 

Notes

Acknowledgements

This work has been partly supported by the CEA-MITIC (Centre d’Excellence Africain en Mathématiques, Informatique et TIC).

References

  1. 1.
    Agrawal R, Srikant R (1995) Mining sequential patterns. In: Proceedings of ICDE’95, pp 3–14Google Scholar
  2. 2.
    Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Discov 15(1):55–86MathSciNetCrossRefGoogle Scholar
  3. 3.
    Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: Proceedings of EDBT vol 96, pp 3–17Google Scholar
  4. 4.
    Zaki MJ (2001) SPADE: an efficient algorithm for mining frequent sequences. Mach Learn 42(1–2):31–60CrossRefGoogle Scholar
  5. 5.
    Pei J, Han J, Mortazavi-Asl B, Pinto H (2001) PrefixSpan: mining sequential patterns efficiently by prefix-projected pattern growth. In: Proceedings of ICDE 2001, pp 215–224Google Scholar
  6. 6.
    Garofalakis MN, Rastogi R, Shim K (1999) Spirit: sequential pattern mining with regular expression constraints. VLDB 99:7–10Google Scholar
  7. 7.
    Pei J, Han J, Lakshmanan LV (2001) Mining frequent itemsets with convertible constraints. In: Proceedings of ICDE 2001. IEEE, pp 433–442Google Scholar
  8. 8.
    Wang J, Han J (2004) Bide: efficient mining of frequent closed sequences. In: Proceedings of ICDE 2004. IEEE, pp 79–90Google Scholar
  9. 9.
    Yan X, Han J, Afshar R (2003) Clospan: mining: closed sequential patterns in large datasets. In: Proceedings of SDM 2003. SIAM, pp 166–177Google Scholar
  10. 10.
    Bosc G, Boulicaut J-F, Raïssi C, Kaytoue M (2016) Anytime discovery of a diverse set of patterns with Monte Carlo tree search. Data Min Knowl Discov 32:1–47MathSciNetzbMATHGoogle Scholar
  11. 11.
    Al Hasan M, Zaki MJ (2009) Output space sampling for graph patterns. Proc VLDB 2(1):730–741CrossRefGoogle Scholar
  12. 12.
    Bhuiyan M, Mukhopadhyay S, Hasan MA (2012) Interactive pattern mining on hidden data: a sampling-based solution. In: Proceedings of CIKM 2012, pp 95–104Google Scholar
  13. 13.
    Giacometti A, Soulet A (2017) Interactive pattern sampling for characterizing unlabeled data. In: Proceedings of IDA 2017, pp 99–111CrossRefGoogle Scholar
  14. 14.
    Dzyuba V, Mv L, Nijssen S, De Raedt L (2014) Interactive learning of pattern rankings. Int J Artif Intell Tools 23(06):32CrossRefGoogle Scholar
  15. 15.
    Giacometti A, Soulet A (2016) Anytime algorithm for frequent pattern outlier detection. Int J Data Sci Anal 2(3–4):119–130CrossRefGoogle Scholar
  16. 16.
    Dzyuba V, van Leeuwen M (2017) Learning what matters—sampling interesting patterns. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 534–546CrossRefGoogle Scholar
  17. 17.
    Anderson C (2004) The long tail. Wired Mag 12(10):170–177Google Scholar
  18. 18.
    Diop L, Diop CT, Giacometti A, Li D, Soulet A (2018) Sequential pattern sampling with norm constraints. In: IEEE international conference on data mining (ICDM)Google Scholar
  19. 19.
    Boley M, Lucchese C, Paurat D, Gärtner T (2011) Direct local pattern sampling by efficient two-step random procedures. In: Proceedings of SIGKDD 2011, pp 582–590Google Scholar
  20. 20.
    van Leeuwen M (2014) Interactive data exploration using pattern mining. In: Andreas Holzinger, Igor Jurisica (eds)Interactive knowledge discovery and data mining in biomedical informatics. Springer, Berlin, pp 169–182Google Scholar
  21. 21.
    Zilberstein S (1996) Using anytime algorithms in intelligent systems. AI Mag 17(3):73Google Scholar
  22. 22.
    Hu Q, Imielinski T (2017) Alpine: progressive itemset mining with definite guarantees. In: Proceedings of the 2017 SIAM international conference on data mining. SIAM, pp 63–71Google Scholar
  23. 23.
    He Z, Xu X, Huang ZJ, Deng S (2005) Fp-outlier: frequent pattern based outlier detection. Comput Sci Inf Syst 2(1):103–118CrossRefGoogle Scholar
  24. 24.
    Toivonen H et al (1996) Sampling large databases for association rules. Proc VLDB 96(96):134–145Google Scholar
  25. 25.
    Luo C, Chung SM (2004) A scalable algorithm for mining maximal frequent sequences using sampling. In: Proceedings of ICTAI 2004. IEEE, pp 156–165Google Scholar
  26. 26.
    Raissi C, Poncelet P (2007) Sampling for sequential pattern mining: from static databases to data streams. In: Proceedings of ICDM 2007, pp 631–636Google Scholar
  27. 27.
    Bendimerad AA, Plantevit M, Robardet C (2016) Unsupervised exceptional attributed sub-graph mining in urban data. In: Proceedings of ICDM 2016. IEEE, pp 21–30Google Scholar
  28. 28.
    Giacometti A, Soulet A (2018) Dense neighborhood pattern sampling in numerical data. In: Proceedings of SDM 2018, pp 756–764CrossRefGoogle Scholar
  29. 29.
    Boley M, Gärtner T, Grosskreutz H (2010) Formal concept sampling for counting and threshold-free local pattern mining. In: Proceedings of SDM 2010. SIAM, pp 177–188Google Scholar
  30. 30.
    Li G, Zaki MJ (2012) Sampling minimal frequent boolean (DNF) patterns. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 87–95Google Scholar
  31. 31.
    Moens S, Goethals B (2013) Randomly sampling maximal itemsets. In: Proceedings of IDEA workshop 2013, pp 79–86Google Scholar
  32. 32.
    Dzyuba V, van Leeuwen M, De Raedt L (2017) Flexible constrained sampling with guarantees for pattern mining. Data Min Knowl Discov 31(5):1266–1293MathSciNetCrossRefGoogle Scholar
  33. 33.
    Gueguen M, Sentieys O, Termier A (2019) Accelerating itemset sampling using satisfiability constraints on FPGA. In: IEEE/ACM design, automation and test in Europe (DATE)Google Scholar
  34. 34.
    Boley M, Moens S, Gärtner T (2012) Linear space direct pattern sampling using coupling from the past. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 69–77Google Scholar
  35. 35.
    Moens S, Boley M (2014) Instant exceptional model mining using weighted controlled pattern sampling. In: International symposium on intelligent data analysis. Springer, Berlin, pp 203–214CrossRefGoogle Scholar
  36. 36.
    Egho E, Raïssi C, Calders T, Jay N, Napoli A (2015) On measuring similarity for sequences of itemsets. Data Min Knowl Discov 29(3):732–764MathSciNetCrossRefGoogle Scholar
  37. 37.
    Egho E, Gay D, Boullé M, Voisine N, Clérot F (2017) A user parameter-free approach for mining robust sequential classification rules. Knowl Inf Syst 52(1):53–81CrossRefGoogle Scholar
  38. 38.
    Fournier-Viger P, Gomariz A, Gueniche T, Mwamikazi E, Thomas R (2013) Tks: efficient mining of top-k sequential patterns. In: International conference on advanced data mining and applications. Springer, Berlin, pp 109–120CrossRefGoogle Scholar
  39. 39.
    Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7: 1–30. http://dl.acm.org/citation.cfm?id=1248547.1248548
  40. 40.
    Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: International conference on extending database technology. Springer, Berlin 1–17Google Scholar
  41. 41.
    Arimura H, Uno T (2009) Polynomial-delay and polynomial-space algorithms for mining closed sequences, graphs, pictures in accessible set systems. In: Proceedings of the 2009 SIAM international conference on data mining. SIAM, pp 1088–1099Google Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of Applied Science and TechnologyUniversity Gaston BergerSaint-LouisSenegal
  2. 2.LIFAT, University of ToursBloisFrance

Personalised recommendations