Advertisement

Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Flexible constrained sampling with guarantees for pattern mining

Abstract

Pattern sampling has been proposed as a potential solution to the infamous pattern explosion. Instead of enumerating all patterns that satisfy the constraints, individual patterns are sampled proportional to a given quality measure. Several sampling algorithms have been proposed, but each of them has its limitations when it comes to (1) flexibility in terms of quality measures and constraints that can be used, and/or (2) guarantees with respect to sampling accuracy. We therefore present Flexics, the first flexible pattern sampler that supports a broad class of quality measures and constraints, while providing strong guarantees regarding sampling accuracy. To achieve this, we leverage the perspective on pattern mining as a constraint satisfaction problem and build upon the latest advances in sampling solutions in SAT as well as existing pattern mining algorithms. Furthermore, the proposed algorithm is applicable to a variety of pattern languages, which allows us to introduce and tackle the novel task of sampling sets of patterns. We introduce and empirically evaluate two variants of Flexics: (1) a generic variant that addresses the well-known itemset sampling task and the novel pattern set sampling task as well as a wide range of expressive constraints within these tasks, and (2) a specialized variant that exploits existing frequent itemset techniques to achieve substantial speed-ups. Experiments show that Flexics is both accurate and efficient, making it a useful tool for pattern-based data exploration.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Notes

  1. 1.

    In other words, item variables \(I\) are the independent support of a pattern mining CSP.

  2. 2.

    Theorem 1 corresponds to and follows from Theorem 3 of Chakraborty et al. (2014).

  3. 3.

    Available at https://bitbucket.org/wxd/flexics.

  4. 4.

    https://dtai.cs.kuleuven.be/CP4IM.

  5. 5.

    https://bitbucket.org/malb/m4ri/.

  6. 6.

    Source: https://dtai.cs.kuleuven.be/CP4IM/datasets/.

  7. 7.

    The code was provided by their respective authors. We also obtained the “unmaintained” code for the uniform LRW sampler (personal communication), but were unable to make it run on our machines. The code for the FCA sampler was not available (personal communication).

  8. 8.

    http://research.nii.ac.jp/~uno/codes.htm, ver. 3.

  9. 9.

    http://fimi.ua.ac.be/data/.

  10. 10.

    Storing all itemsets on disk provides no benefits: it increases the mining runtime to 23 min and results in a file of 215 Gb; simply counting its lines with ‘wc-l’ takes 25 min.

References

  1. Aggarwal CC, Han J (eds) (2014) Frequent pattern mining. Springer International Publishing, New York

  2. Agrawal R, Mannila H, Srikant R, Toivonen H, Verkamo AI (1996) Fast discovery of association rules. In: Fayyad U, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds) Advances in knowledge discovery and data mining. AAAI Press, Menlo Park, pp 307–328

  3. Albrecht M, Bard G (2012) The M4RI Library. The M4RI Team. https://bitbucket.org/malb/m4ri

  4. Berlingerio M, Pinelli F, Calabrese F (2013) ABACUS: frequent pattern mining-based community discovery in multidimensional networks. Data Min Knowl Discov 27(3):294–320

  5. Boley M, Grosskreutz H (2009) Approximating the number of frequent sets in dense data. Knowl Inf Syst 21(1):65–89

  6. Boley M, Gärtner T, Grosskreutz H (2010) Formal concept sampling for counting and threshold-free local pattern mining. In: Proceedings of the 10th SIAM international conference on data mining (SDM ’10), pp 177–188

  7. Boley M, Lucchese C, Paurat D, Gärtner T (2011) Direct local pattern sampling by efficient two-step random procedures. In: Proceedings of the 17th ACM SIGKDD conference on knowledge discovery and data mining (KDD ’11), pp 582–590

  8. Boley M, Moens S, Gärtner T (2012) Linear space direct pattern sampling using coupling from the past. In: Proceedings of the 18th ACM SIGKDD conference on knowledge discovery and data mining (KDD ’12), pp 69–77

  9. Boley M, Mampaey M, Kang B, Tokmakov P, Wrobel S (2013) One click mining—interactive local pattern discovery through implicit preference and performance learning. In: Proceedings of the ACM SIGKDD workshop on interactive data exploration and analytics (IDEA ’13), pp 28–36

  10. Bonchi F, Giannotti F, Lucchese C, Orlando S, Perego R, Trasarti R (2009) A constraint-based querying system for exploratory pattern discovery. Inf Syst 34(1):3–27

  11. Bouillaguet C, Delaplace C (2016) Sparse Gaussian elimination modulo \(p\): an update. In: Proceedings of the 18th international workshop on computer algebra in scientific computing (CASC ’16), pp 101–116

  12. Bringmann B, Nijssen S, Tatti N, Vreeken J, Zimmermann A (2010) Mining sets of patterns. In: Tutorial at the European conference on machine learning and principles and practice of knowledge discovery (ECML/PKDD ’10)

  13. Bucilă C, Gehrke J, Kifer D, White W (2003) Dualminer: a dual-pruning algorithm for itemsets with constraints. Data Min Knowl Discov 7(3):241–272

  14. Calders T, Rigotti C, Boulicaut JF (2006) A survey on condensed representations for frequent sets. In: Boulicaut JF, De Raedt L, Mannila H (eds) Constraint-based mining and inductive databases. Springer, Berlin, pp 64–80

  15. Carvalho DR, Freitas AA, Ebecken N (2005) Evaluating the correlation between objective rule interestingness measures and real human interest. In: Proceedings of the 9th European conference on principles of data mining and knowledge discovery (PKDD ’05), pp 453–461

  16. Chakraborty S, Meel KS, Vardi MY (2013) A scalable and nearly uniform generator of SAT witnesses. In: Proceedings of the 25th international conference on computer-aided verification (CAV ’13), pp 608–623

  17. Chakraborty S, Fremont DJ, Meel KS, Vardi MY (2014) Distribution-aware sampling and weighted model counting for SAT. In: Proceedings of the 28th AAAI conference on artificial intelligence (AAAI ’14), pp 1722–1730

  18. Chakraborty S, Fremont DJ, Meel KS, Seshia SA, Vardi MY (2015) On parallel scalable uniform SAT witness generation. In: Proceedings of the 21st international conference on tools and algorithms for the construction and analysis of systems (TACAS ’15), vol 9035, pp 304–319

  19. De Raedt L, Zimmermann A (2007) Constraint-based pattern set mining. In: Proceedings of the 7th SIAM international conference on data mining (SDM ’07), pp 237–248

  20. Dzyuba V, van Leeuwen M (2017) Learning what matters—sampling interesting patterns. In: Proceedings of the 21st Pacific-Asia conference on knowledge discovery and data mining (PAKDD ’17) (in press)

  21. Ermon S, Gomes CP, Sabharwal A, Selman B (2013a) Embed and project: discrete sampling with universal hashing. Adv Neural Inf Process Syst 26:2085–2093

  22. Ermon S, Gomes CP, Sabharwal A, Selman B (2013b) Taming the curse of dimensionality: discrete integration by hashing and optimization. In: Proceedings of the 30th international conference on machine learning (ICML ’13), pp 334–342

  23. Geerts F, Goethals B, Mielikäinen T (2004) Tiling databases. In: Proceedings of the 7th international conference on discovery science (DS ’04), pp 278–289

  24. Giacometti A, Soulet A (2016) Anytime algorithm for frequent pattern outlier detection. Int J Data Sci Anal 2(3):119–130

  25. Gomes CP, van Hoeve Wj, Sabharwal A, Selman B (2007a) Counting CSP solutions using generalized XOR constraints. In: Proceedings of the 22nd AAAI conference on artificial intelligence (AAAI ’07), pp 204–209

  26. Gomes CP, Sabharwal A, Selman B (2007b) Near-uniform sampling of combinatorial spaces using XOR constraints. Adv Neural Inf Process Syst 19:481–488

  27. Guns T, Nijssen S, De Raedt L (2011) Itemset mining: a constraint programming perspective. Artif Intell 175(12–13):1951–1983

  28. Guns T, Nijssen S, De Raedt L (2013) \(k\)-Pattern set mining under constraints. IEEE Trans Knowl Data Eng 25(2):402–418

  29. Hasan MA, Zaki MJ (2009) Output space sampling for graph patterns. Proc VLDB Endow 2(1):730–741

  30. Kemmar A, Ugarte W, Loudni S, Charnois T, Lebbah Y, Boizumault P, Crémilleux B (2014) Mining relevant sequence patterns with CP-based framework. In: Proceedings of the 26th IEEE international conference on tools with artificial intelligence (ICTAI ’14), pp 552–559

  31. Khiari M, Boizumault P, Crémilleux B (2010) Constraint programming for mining n-ary patterns. In: Proceedings of the 16th international conference on principles and practice of constraint programming (CP ’10), pp 552–567

  32. Knobbe A, Ho E (2006) Pattern teams. In: Proceedings of the 10th European conference on principles of data mining and knowledge discovery (PKDD ’06), pp 577–584

  33. Lemmerich F, Becker M, Puppe F (2013) Difference-based estimates for generalization-aware subgroup discovery. In: Proceedings of the European conference on machine learning and principles and practice of knowledge discovery (ECML/PKDD ’13), pp 288–303

  34. Meel K, Vardi M, Chakraborty S, Fremont D, Seshia S, Fried D, Ivrii A, Malik S (2016) Constrained sampling and counting: universal hashing meets SAT solving. In: Proceedings of the beyond NP AAAI workshop

  35. Nijssen S, Zimmermann A (2014) Constraint-based pattern mining. In: Aggarwal CC, Han J (eds) Frequent pattern mining, chap 7. Springer International Publishing, New York, pp 147–163

  36. Nijssen S, Guns T, De Raedt L (2009) Correlated itemset mining in ROC space: a constraint programming approach. In: Proceedings of the 15th ACM SIGKDD conference on knowledge discovery and data mining (KDD ’09), pp 647–655

  37. Paramonov S, van Leeuwen M, Denecker M, De Raedt L (2015) An exercise in declarative modeling for relational query mining. In: Proceedings of the 25th international conference on inductive logic programming (ILP ’15)

  38. Pei J, Han J (2000) Can we push more constraints into frequent pattern mining? In: Proceedings of the 6th ACM SIGKDD conference on knowledge discovery and data mining (KDD ’00), pp 350–354

  39. Ramakrishnan N, Kumar D, Mishra B, Potts M, Helm R (2004) Turning CARTwheels: an alternating algorithm for mining redescriptions. In: Proceedings of the 10th ACM SIGKDD conference on knowledge discovery and data mining (KDD ’04), pp 266–275

  40. Shervashidze N, Vishwanathan S, Petri T, Mehlhorn K, Borgwardt KM (2009) Efficient graphlet kernels for large graph comparison. In: Proceedings of the 12th international conference on artificial intelligence and statistics (AISTATS ’09), pp 488–495

  41. Soos M (2010) Enhanced Gaussian elimination in DPLL-based SAT solvers. In: Proceedings of the pragmatics of SAT workshop (POS ’10), pp 2–14

  42. Uno T, Kiyomi M, Arimura H (2005) LCM ver. 3: collaboration of array, bitmap and prefix tree for frequent itemset mining. In: Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations (OSDM ’05), pp 77–86

  43. Zaki MJ, Parthasarathy S, Ogihara M, Li W (1997) New algorithms for fast discovery of association rules. In: Proceedings of the 3rd ACM SIGKDD conference on knowledge discovery and data mining (KDD ’97), pp 283–296

  44. Zimmermann A, Nijssen S (2014) Supervised pattern mining and applications to classification. In: Aggarwal CC, Han J (eds) Frequent pattern mining, chap 17. Springer International Publishing, New York, pp 425–442

Download references

Acknowledgements

The authors would like to thank Guy Van den Broeck for useful discussions and Martin Albrecht for the support with the m4ri library. Vladimir Dzyuba is supported by FWO-Vlaanderen.

Author information

Correspondence to Vladimir Dzyuba.

Additional information

Responsible editor: Kurt Driessens, Dragi Kocev, Marko Robnik-Šikonja and Myra Spiliopoulou.

Appendix: WeightGen

Appendix: WeightGen

In this section, we present an extended technical description of the WeightGen algorithm, which closely follows Sections 3 and 4 in Chakraborty et al. (2014), whereas the pseudocode in Algorithm 2 is structured similarly to that of UniGen2, a close cousin of WeightGen (Chakraborty et al. 2015). Lines 13 correspond to the estimation phase and Lines 48 correspond to the sampling phase. SolveBounded stands for the bounded enumeration oracle.

figureb

The parameters of the estimation phase are fixed to particular theoretically motivated values. \(\textit{pivot}_{est}\) denotes the maximal weight of a cell at the estimation phase; \(\textit{pivot}_{est}=46\) corresponds to estimation error tolerance \(\varepsilon _{est}=0.8\) (Line 10). If the total weight of solutions in a given cell exceeds \(\textit{pivot}_{est}\), a new random XOR constraint is added in order to eliminate a number of solutions. Repeating the process for a number of iterations increases the confidence of the estimate, e.g., 17 iterations result in \(1-\delta _{est}=0.8\) (Line 1). Note that Estimate essentially estimates the total weight of all solutions, from which \(N_{\textit{XOR}}\), the initial number of XOR constraints for the sampling phase, is derived (Line 4).

A similar procedure is employed at the sampling phase. It starts with \(N_{\textit{XOR}}\) constraints and adds at most three extra constraints. The user-chosen error tolerance parameter \(\kappa \) determines the range \(\left[ \textit{loThresh},\ \textit{hiThresh}\right] \), within which the total weight of a suitable cell should lie (Line 5). For example, \(\kappa =0.9\) corresponds to range \(\left[ 6.7,\ 49.4\right] \). If a suitable cell can be obtained, a solution is sampled exactly from all solutions in the cell; otherwise, no sample is returned. Requiring the total cell weight to exceed a particular value ensures the lower bound on the sampling accuracy.

The preceding presentation makes two simplifying assumptions: (1) all weights lie in \(\left[ 1/r,\ 1\right] \); (2) adding XOR constraints never results in unsatisfiable subproblems (empty cells). The former is relaxed by multiplying pivots by \(\hat{w}_{\textit{max}} = \hat{w}_{\textit{min}} \times \hat{r} < 1\), where \(\hat{w}_{\textit{min}}\) is the smallest weight observed so far. The latter is solved by simply restarting an iteration with a newly generated set of constraints. See Chakraborty et al. (2014) for the full explanation, including the precise formulae to compute all parameters.

Implementation details

Following suggestions of Chakraborty et al. (2015), we implement leapfrogging, a technique that improves the performance of the umbrella sampling procedure and thus benefits both GFlexics and EFlexics. First, after three iterations of the estimation phase, we initialize the following iterations with a number of XOR constraints that is equal to the smallest number returned in the previous iterations (rather than with zero XORs). Second, in the sampling phase, we start with one XOR constraint more than the number suggested by theory. If the cell is too small, we remove one constraint; if it is too large, we proceed adding (at most two) constraints. Both modifications are based on the observation that theoretical parameter values address hypothetical corner cases that rarely occur in practice. Finally, we only run the estimation phase until the initial number of XOR constraints, which only depends on the median of total weight estimates, converges. For example, if the estimation phase is supposed to run for 17 iterations, the convergence can happen as early as after 9 iterations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Dzyuba, V., van Leeuwen, M. & De Raedt, L. Flexible constrained sampling with guarantees for pattern mining. Data Min Knowl Disc 31, 1266–1293 (2017). https://doi.org/10.1007/s10618-017-0501-6

Download citation

Keywords

  • Pattern sampling
  • Itemset mining
  • Pattern set mining
  • Tiling
  • Hashing-based sampling