Skip to main content
Log in

Efficiently mining cohesion-based patterns and rules in event sequences

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Discovering patterns in long event sequences is an important data mining task. Traditionally, research focused on frequency-based quality measures that allow algorithms to use the anti-monotonicity property to prune the search space and efficiently discover the most frequent patterns. In this work, we step away from such measures, and evaluate patterns using cohesion — a measure of how close to each other the items making up the pattern appear in the sequence on average. We tackle the fact that cohesion is not an anti-monotonic measure by developing an upper bound on cohesion in order to prune the search space. By doing so, we are able to efficiently unearth rare, but strongly cohesive, patterns that existing methods often fail to discover. Furthermore, having found the occurrences of cohesive itemsets in the input sequence, we use them to discover the representative sequential patterns and the dominant partially ordered episodes, without going through the computationally expensive candidate generation procedures typically associated with sequential pattern and episode mining. Experiments show that our method efficiently discovers important patterns that existing state-of-the-art methods fail to discover.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. The algorithm was given no name by its authors.

  2. The implementations of Winepi, Laxman and \({\textsc {Marbles}}_{\textsc {w}}\) are available at http://users.ics.aalto.fi/ntatti/software/closedepisodeminer.zip.

  3. The implementation of CMW was kindly provided by the author, but is not publicly available.

  4. The implementation of the generator is available at https://zimmermanna.users.greyc.fr/software.html.

  5. The implementation of \({\textsc {FCI}}_{\textsc {SEQ}}\) is available at https://bitbucket.org/len_feremans/fci_public.

  6. http://www.gutenberg.org/.

  7. http://www.trumptwitterarchive.com/.

References

  • Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: International conference on very large data bases, pp 487–499

  • Church KW, Mercer RL (1993) Introduction to the special issue on computational linguistics using large corpora. Comput Linguist 19(1):1–24

    Google Scholar 

  • Cule B, Goethals B (2010) Mining association rules in long sequences. In: Pacific-Asia conference on knowledge discovery and data mining

  • Cule B, Goethals B, Robardet C (2009) A new constraint for mining sets in sequences. In: Proceedings of the 2009 SIAM international conference on data mining

  • Cule B, Tatti N, Goethals B (2014) Marbles: Mining association rules buried in long event sequences. Stat Anal Data Min ASA Data Sci J 7(2):93–110

    Article  MathSciNet  Google Scholar 

  • Cule B, Feremans L, Goethals B (2016) Efficient discovery of sets of co-occurring items in event sequences. In: European conference on machine learning and principles and practice of knowledge discovery in databases, pp 361–377. Springer

  • Feremans L, Cule B, Goethals B (2018) Mining top-k quantile-based cohesive sequential patterns. In Proceedings of the 2018 SIAM international conference on data mining, pp 90–98. SIAM

  • Fowkes J, Sutton C (2016) A subsequence interleaving model for sequential pattern mining. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 835–844. ACM

  • Grünwald PD (2007) The minimum description length principle. MIT press, Cambridge

    Book  Google Scholar 

  • Han J, Pei J, Yin Y, Mao R (2004) Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min Knowl Discov 8(1):53–87

    Article  MathSciNet  Google Scholar 

  • Justeson JS, Katz SM (1995) Technical terminology: some linguistic properties and an algorithm for identification in text. Natl Lang Eng 1(1):9–27

    Article  Google Scholar 

  • Lam HT, Mörchen F, Fradkin D, Calders T (2014) Mining compressing sequential patterns. Stat Anal Data Min ASA Data Sci J 7(1):34–52

    Article  MathSciNet  Google Scholar 

  • Laxman S, Sastry PS, Unnikrishnan KP (2007) A fast algorithm for finding frequent episodes in event streams. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

  • Mannila H, Toivonen H, Verkamo AI (1997) Discovery of frequent episodes in event sequences. Data Min Knowl Discov 1(3):259–289

    Article  Google Scholar 

  • Manning CD, Schütze H (1999) Foundations of statistical natural language processing. MIT press, Cambridge

    MATH  Google Scholar 

  • Méger N, Rigotti C (2004) Constraint-based mining of episode rules and optimal window sizes. In: European conference on machine learning and principles and practice of knowledge discovery in databases

  • Pei J, Han J, Mortazavi-Asl B, Wang J, Pinto Helen, Chen Qiming, Dayal Umeshwar, Hsu Mei-Chun (2004) Mining sequential patterns by pattern-growth: the prefixspan approach. IEEE Trans Knowl Data Eng 16(11):1424–1440

    Article  Google Scholar 

  • Pei J, Han J, Wang W (2007) Constraint-based sequential pattern mining: the pattern-growth methods. J Intell Inf Syst 28(2):133–160

    Article  Google Scholar 

  • Petitjean F, Li T, Tatti N, Webb GI (2016) Skopus: mining top-k sequential patterns under leverage. Data Min Knowl Discov 30(5):1086–1111

    Article  MathSciNet  MATH  Google Scholar 

  • Srikant R, Agrawal R (1996) Mining sequential patterns: Generalizations and performance improvements. In: International conference on extending database technology, pp 1–17. Springer

  • Tatti N (2014) Discovering episodes with compact minimal windows. Data Min Knowl Discov 28(4):1046–1077

    Article  MathSciNet  MATH  Google Scholar 

  • Tatti N (2015) Ranking episodes using a partition model. Data Min Knowl Discov 29(5):1312–1342

    Article  MathSciNet  MATH  Google Scholar 

  • Tatti N, Cule B (2012) Mining closed strict episodes. Data Min Knowl Discov 25(1):34–66

    Article  MathSciNet  MATH  Google Scholar 

  • Tatti N, Vreeken J (2012) The long and the short of it: summarising event sequences with serial episodes. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, pp 462–470. ACM

  • Wang J, Han J (2004) Bide: efficient mining of frequent closed sequences. In: IEEE international conference on data engineering, pp 79–90

  • Webb GI (2010) Self-sufficient itemsets: an approach to screening potentially interesting associations between items. ACM Trans Knowl Discov Data 4(1):3

    Article  MathSciNet  Google Scholar 

  • Zaki MJ (2001) Spade: An efficient algorithm for mining frequent sequences. Mach Learn 42(1–2):31–60

    Article  MATH  Google Scholar 

  • Zaki MJ (2000) Scalable algorithms for association mining. IEEE Trans Knowl Data Eng 12(3):372–390

    Article  Google Scholar 

  • Zimmermann A (2014) Understanding episode mining techniques: Benchmarking on diverse, realistic, artificial data. Intell Data Anal 18(5):761–791

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the VLAIO SBO HYMOP project for funding this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Len Feremans.

Additional information

Responsible editor: Mohammed J. Zaki.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A preliminary version appeared as “Efficient Discovery of Sets of Co-occurring Items in Event Sequences“ (Cule et al. 2016). Sections 2.4 and 3.6 are based on “Mining Association Rules in Long Sequences” (Cule and Goethals 2010).

Appendices

Appendix

Top-25 patterns

Tables 11 and 12 show the top-25 itemsets for Species and Trump, Tables 13 and 14 show the top-25 sequential patterns, and Tables 15 and 16 the top-25 association rules. Note that, as discussed in Sect. 4, \({\textsc {FCI}}_{\textsc {seq}}\) produced fewer than 25 sequential patterns per dataset, due to the usage of the minimal occurrence ratio threshold. A lower threshold would naturally result in more patterns, but we argue that these patterns are better omitted from the output, since they are in fact not representative of the occurrences of the underlying itemset. Patterns for \({\textsc {FCI}}_{\textsc {seq}}\) in bold are not reported by any other state-of-the-art method in the top-1000, likewise, patterns in bold for other methods are not reported by \({\textsc {FCI}}_{\textsc {seq}}\) in the top-1000. Note that since we only produce fewer than 25 sequential patterns, nearly all of the patterns found by other methods are in bold.

Table 11 Top 25 itemsets for Species
Table 12 Top 25 itemsets for Trump
Table 13 Top 25 sequential patterns for Species
Table 14 Top 25 sequential patterns for Trump
Table 15 Top 25 rules for Species
Table 16 Top 25 rules for Trump

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cule, B., Feremans, L. & Goethals, B. Efficiently mining cohesion-based patterns and rules in event sequences. Data Min Knowl Disc 33, 1125–1182 (2019). https://doi.org/10.1007/s10618-019-00628-0

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-019-00628-0

Keywords

Navigation