Data Mining and Knowledge Discovery

, Volume 30, Issue 5, pp 1086–1111 | Cite as

Skopus: Mining top-k sequential patterns under leverage

  • François Petitjean
  • Tao Li
  • Nikolaj Tatti
  • Geoffrey I. Webb
Article

Abstract

This paper presents a framework for exact discovery of the top-k sequential patterns under Leverage. It combines (1) a novel definition of the expected support for a sequential pattern—a concept on which most interestingness measures directly rely—with (2) Skopus: a new branch-and-bound algorithm for the exact discovery of top-k sequential patterns under a given measure of interest. Our interestingness measure employs the partition approach. A pattern is interesting to the extent that it is more frequent than can be explained by assuming independence between any of the pairs of patterns from which it can be composed. The larger the support compared to the expectation under independence, the more interesting is the pattern. We build on these two elements to exactly extract the k sequential patterns with highest leverage, consistent with our definition of expected support. We conduct experiments on both synthetic data with known patterns and real-world datasets; both experiments confirm the consistency and relevance of our approach with regard to the state of the art.

Keywords

Data mining Pattern mining Sequential data Exact discovery Interestingness measures 

References

  1. Achar A, Laxman S, Viswanathan R, Sastry P (2012) Discovering injective episodes with general partial orders. Data Min Knowl Discov 25(1):67–108MathSciNetCrossRefMATHGoogle Scholar
  2. Achar A, Sastry P (2015) Statistical significance of episodes with general partial orders. Inf Sci 296:175–200MathSciNetCrossRefGoogle Scholar
  3. Aggarwal CC, Han J (eds) (2014) Frequent pattern mining. Springer, HeidelbergGoogle Scholar
  4. Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM-SIGMOD International Conference on Management of Data, pp 207–216. Washington, DCGoogle Scholar
  5. Agrawal R, Srikant R (1995) Mining sequential patterns. In: Proceedings of the Eleventh International Conference on Data Engineering, March 6–10, 1995, Taipei, Taiwan, pp 3–14. IEEE Computer SocietyGoogle Scholar
  6. Bayardo Jr, RJ, Agrawal R (1999) Mining the most interesting rules. In: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 145–154. ACM, New YorkGoogle Scholar
  7. Boley M, Horváth T, Wrobel S (2009) Eficient discovery of interesting patterns based on strong closedness. Stat Anal Data Min 2(5–6):346–360MathSciNetCrossRefGoogle Scholar
  8. Castro NC, Azevedo PJ (2012) Significant motifs in time series. Stat Anal Data Min 5(1):35–53MathSciNetCrossRefGoogle Scholar
  9. Fournier-Viger P, Gomariz A, Gueniche T, Mwamikazi E, Thomas R (2013) TKS: efficient mining of top-k sequential patterns. In: Advanced data mining and applications, 9th international conference, ADMA 2013, Hangzhou, December 14–16, 2013, Proceedings, Part I. Lecture Notes in Computer Science, vol 8346, pp 109–120. Springer, BerlinGoogle Scholar
  10. Geng L, Hamilton HJ (2007) Choosing the right lens: finding what is interesting in data mining. In: Quality measures in data mining, pp 3–24. Springer, BerlinGoogle Scholar
  11. Gwadera R, Atallah MJ, Szpankowski W (2005) Markov models for identification of significant episodes. In: SIAM international conference on data mining, pp 404–414Google Scholar
  12. Gwadera R, Atallah MJ, Szpankowski W (2005) Reliable detection of episodes in event sequences. Knowl Inf Syst 7(4):415–437CrossRefGoogle Scholar
  13. Gwadera R, Crestani F (2010) Ranking sequential patterns with respect to significance. In: Advances in knowledge discovery and data mining, 14th Pacific-Asia Conference, PAKDD 2010, Hyderabad, June 21–24, 2010. Proceedings. Part I, Lecture Notes in Computer Science, vol 6118, pp 286–299. Springer, BerlinGoogle Scholar
  14. Hämäläinen W (2010) Efficient discovery of the top-k optimal dependency rules with Fisher’s exact test of significance. In: ICDM 2010, The 10th IEEE International Conference on Data Mining, Sydney, Australia, 14–17 December 2010, pp 196–205. IEEE Computer SocietyGoogle Scholar
  15. Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Discov 15(1):55–86MathSciNetCrossRefGoogle Scholar
  16. Han J, Pei J, Mortazavi-Asl B, Chen Q, Dayal U, Hsu M (2000) Freespan: frequent pattern-projected sequential pattern mining. In: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, Boston, MA, August 20–23, 2000, pp. 355–359. ACM, New YorkGoogle Scholar
  17. Jaroszewicz S, Simovici DA (2004) Interestingness of frequent itemsets using bayesian networks as background knowledge. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 178–196. ACM, New YorkGoogle Scholar
  18. Lam HT, Moerchen F, Fradkin D, Calders T (2014) Mining compressing sequential patterns. Stat Anal Data Min 7(1):34–52MathSciNetCrossRefGoogle Scholar
  19. Low-Kam C, Raïssi C, Kaytoue M, Pei J (2013) Mining statistically significant sequential patterns. In: 2013 IEEE 13th international conference on data mining, Dallas, TX, December 7–10, 2013, pp. 488–497. IEEE Computer SocietyGoogle Scholar
  20. Mabroukeh NR, Ezeife CI (2010) A taxonomy of sequential pattern mining algorithms. ACM Comput Surv 43(1):3CrossRefGoogle Scholar
  21. Mampaey M, Vreeken J, Tatti N (2012) Summarizing data succinctly with the most informative itemsets. ACM Trans Knowl Discov Data 6(4):16CrossRefGoogle Scholar
  22. Mannila H, Toivonen H, Verkamo AI (1995) Discovering frequent episodes in sequences. In: Proceedings of the First International Conference on Knowledge Discovery and Data Mining (KDD-95), Montreal, Canada, August 20–21, 1995, pp 210–215. AAAI PressGoogle Scholar
  23. Mannila H, Toivonen H, Verkamo AI (1997) Discovery of frequent episodes in event sequences. Data Min Knowl Discov 1(3):259–289CrossRefGoogle Scholar
  24. Masseglia F, Cathala F, Poncelet P (1998) The PSP approach for mining sequential patterns. In: Principles of data mining and knowledge discovery, Second European Symposium, PKDD ’98, Nantes, France, September 23-26, 1998, Proceedings, Lecture Notes in Computer Science, vol 1510, pp 176–184. Springer, BerlinGoogle Scholar
  25. Mooney CH, Roddick JF (2013) Sequential pattern mining-approaches and algorithms. ACM Comput Surv 45(2):19CrossRefMATHGoogle Scholar
  26. Pei J, Han J, Mortazavi-Asl B, Pinto H, Chen Q, Dayal U, Hsu M (2001) Prefixspan: mining sequential patterns by prefix-projected growth. In: Proceedings of the 17th International Conference on Data Engineering, April 2–6, 2001, Heidelberg, Germany, pp. 215–224. IEEE Computer SocietyGoogle Scholar
  27. Piatetsky-Shapiro G (1991) Discovery, analysis, and presentation of strong rules. In: Piatetsky-Shapiro G, Frawley J (eds) Knowl Discov Databases. AAAI/MIT Press, Menlo Park, pp 229–248Google Scholar
  28. Raïssi C, Calders T, Poncelet P (2008) Mining conjunctive sequential patterns. Data Min Knowl Discov 17(1):77–93MathSciNetCrossRefGoogle Scholar
  29. Siebes A, Vreeken J, van Leeuwen M (2006) Item sets that compress. In: Proceedings of the Sixth SIAM international conference on data mining, April 20–22, 2006, Bethesda, pp. 395–406. SIAMGoogle Scholar
  30. Tatti N (2009) Significance of episodes based on minimal windows. In: IEEE international conference on data mining, pp 513–522Google Scholar
  31. Tatti N (2014) Discovering episodes with compact minimal windows. Data Min Knowl Discov 28(4):1046–1077MathSciNetCrossRefMATHGoogle Scholar
  32. Tatti N (2015) Ranking episodes using a partition model. Data Min Knowl Discov 29(5):1312–1342MathSciNetCrossRefGoogle Scholar
  33. Tatti N, Mampaey M (2010) Using background knowledge to rank itemsets. Data Min Knowl Discov 21(2):293–309MathSciNetCrossRefGoogle Scholar
  34. Tatti N, Vreeken J (2012) The long and the short of it: Summarising event sequences with serial episodes. In: Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining, pp 462–470Google Scholar
  35. Tew CV, Giraud-Carrier CG, Tanner KW, Burton SH (2014) Behavior-based clustering and analysis of interestingness measures for association rule mining. Data Min Knowl Discov 28(4):1004–1045MathSciNetCrossRefMATHGoogle Scholar
  36. The Oxford English Corpus (2015) The Oxford English Corpus: Facts about Language. In: Oxford Dictionaries. Oxford University Press, Oxford. http://www.oxforddictionaries.com/words/the-oec-facts-about-the-language
  37. Tucker A (2006) Appl Comb. Wiley, New YorkGoogle Scholar
  38. Tzvetkov P, Yan X, Han J (2003) TSP: mining top-k closed sequential patterns. In: Proceedings of the 3rd IEEE international conference on data mining (ICDM 2003), 19–22 December 2003, Melbourne, pp 347–354. IEEE Computer SocietyGoogle Scholar
  39. Vreeken J, van Leeuwen M, Siebes A (2011) Krimp: mining itemsets that compress. Data Min Knowl Discov 23(1):169–214MathSciNetCrossRefMATHGoogle Scholar
  40. Webb G (2010) Self-sufficient itemsets: an approach to screening potentially interesting associations between items. Trans Knowl Discov Data 4:3:1–3:20Google Scholar
  41. Webb GI (1995) OPUS: an efficient admissible algorithm for unordered search. J Artif Intell Res 3:431–465MATHGoogle Scholar
  42. Webb GI (2000) Efficient search for association rules. In: Proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining, pp 99–107. ACM, New YorkGoogle Scholar
  43. Webb GI (2007) Discovering significant patterns. Mach Learn 68(1):1–33CrossRefGoogle Scholar
  44. Webb GI (2008) Layered critical values: a powerful direct-adjustment approach to discovering significant patterns. Mach Learn 71(2–3):307–323CrossRefGoogle Scholar
  45. Webb GI (2010) Self-sufficient itemsets: an approach to screening potentially interesting associations between items. ACM Trans Knowl Discov Data 4(1):3CrossRefGoogle Scholar
  46. Webb GI (2011) Filtered-top-k association discovery. Wiley Interdisc Rev Data Min Knowl Discov 1(3):183–192CrossRefGoogle Scholar
  47. Webb GI, Vreeken J (2014) Efficient discovery of the most interesting associations. ACM Trans Knowl Discov Data 8(3):1–31CrossRefGoogle Scholar
  48. Yan X, Han J, Afshar R (2003) Clospan: mining closed sequential patterns in large databases. In: Proceedings of the Third SIAM international conference on data mining, San Francisco, May 1–3, 2003, pp 166–177. SIAMGoogle Scholar
  49. Zimmermann A (2013) Objectively evaluating interestingness measures for frequent itemset mining. In: Li J, Cao L, Wang C, Tan K, Liu B, Pei J, Tseng V (eds) Trends and Applications in Knowledge Discovery and Data Mining, vol 7867., Lecture Notes in Computer ScienceSpringer, Berlin Heidelberg, pp 354–366CrossRefGoogle Scholar

Copyright information

© The Author(s) 2016

Authors and Affiliations

  • François Petitjean
    • 1
  • Tao Li
    • 2
  • Nikolaj Tatti
    • 3
  • Geoffrey I. Webb
    • 1
  1. 1.Faculty of Information TechnologyMonash UniversityClaytonAustralia
  2. 2.School of Electronic and Information EngineeringNanjing University of Information Science and TechnologyNanjingChina
  3. 3.Department of Information and Computer ScienceAalto UniversityEspooFinland

Personalised recommendations