Abstract
Pattern sampling has been proposed as a potential solution to the infamous pattern explosion. Instead of enumerating all patterns that satisfy the constraints, individual patterns are sampled proportional to a given quality measure. Several sampling algorithms have been proposed, but each of them has its limitations when it comes to (1) flexibility in terms of quality measures and constraints that can be used, and/or (2) guarantees with respect to sampling accuracy. We therefore present Flexics, the first flexible pattern sampler that supports a broad class of quality measures and constraints, while providing strong guarantees regarding sampling accuracy. To achieve this, we leverage the perspective on pattern mining as a constraint satisfaction problem and build upon the latest advances in sampling solutions in SAT as well as existing pattern mining algorithms. Furthermore, the proposed algorithm is applicable to a variety of pattern languages, which allows us to introduce and tackle the novel task of sampling sets of patterns. We introduce and empirically evaluate two variants of Flexics: (1) a generic variant that addresses the well-known itemset sampling task and the novel pattern set sampling task as well as a wide range of expressive constraints within these tasks, and (2) a specialized variant that exploits existing frequent itemset techniques to achieve substantial speed-ups. Experiments show that Flexics is both accurate and efficient, making it a useful tool for pattern-based data exploration.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
In other words, item variables \(I\) are the independent support of a pattern mining CSP.
Available at https://bitbucket.org/wxd/flexics.
The code was provided by their respective authors. We also obtained the “unmaintained” code for the uniform LRW sampler (personal communication), but were unable to make it run on our machines. The code for the FCA sampler was not available (personal communication).
Storing all itemsets on disk provides no benefits: it increases the mining runtime to 23 min and results in a file of 215 Gb; simply counting its lines with ‘wc-l’ takes 25 min.
References
Aggarwal CC, Han J (eds) (2014) Frequent pattern mining. Springer International Publishing, New York
Agrawal R, Mannila H, Srikant R, Toivonen H, Verkamo AI (1996) Fast discovery of association rules. In: Fayyad U, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds) Advances in knowledge discovery and data mining. AAAI Press, Menlo Park, pp 307–328
Albrecht M, Bard G (2012) The M4RI Library. The M4RI Team. https://bitbucket.org/malb/m4ri
Berlingerio M, Pinelli F, Calabrese F (2013) ABACUS: frequent pattern mining-based community discovery in multidimensional networks. Data Min Knowl Discov 27(3):294–320
Boley M, Grosskreutz H (2009) Approximating the number of frequent sets in dense data. Knowl Inf Syst 21(1):65–89
Boley M, Gärtner T, Grosskreutz H (2010) Formal concept sampling for counting and threshold-free local pattern mining. In: Proceedings of the 10th SIAM international conference on data mining (SDM ’10), pp 177–188
Boley M, Lucchese C, Paurat D, Gärtner T (2011) Direct local pattern sampling by efficient two-step random procedures. In: Proceedings of the 17th ACM SIGKDD conference on knowledge discovery and data mining (KDD ’11), pp 582–590
Boley M, Moens S, Gärtner T (2012) Linear space direct pattern sampling using coupling from the past. In: Proceedings of the 18th ACM SIGKDD conference on knowledge discovery and data mining (KDD ’12), pp 69–77
Boley M, Mampaey M, Kang B, Tokmakov P, Wrobel S (2013) One click mining—interactive local pattern discovery through implicit preference and performance learning. In: Proceedings of the ACM SIGKDD workshop on interactive data exploration and analytics (IDEA ’13), pp 28–36
Bonchi F, Giannotti F, Lucchese C, Orlando S, Perego R, Trasarti R (2009) A constraint-based querying system for exploratory pattern discovery. Inf Syst 34(1):3–27
Bouillaguet C, Delaplace C (2016) Sparse Gaussian elimination modulo \(p\): an update. In: Proceedings of the 18th international workshop on computer algebra in scientific computing (CASC ’16), pp 101–116
Bringmann B, Nijssen S, Tatti N, Vreeken J, Zimmermann A (2010) Mining sets of patterns. In: Tutorial at the European conference on machine learning and principles and practice of knowledge discovery (ECML/PKDD ’10)
Bucilă C, Gehrke J, Kifer D, White W (2003) Dualminer: a dual-pruning algorithm for itemsets with constraints. Data Min Knowl Discov 7(3):241–272
Calders T, Rigotti C, Boulicaut JF (2006) A survey on condensed representations for frequent sets. In: Boulicaut JF, De Raedt L, Mannila H (eds) Constraint-based mining and inductive databases. Springer, Berlin, pp 64–80
Carvalho DR, Freitas AA, Ebecken N (2005) Evaluating the correlation between objective rule interestingness measures and real human interest. In: Proceedings of the 9th European conference on principles of data mining and knowledge discovery (PKDD ’05), pp 453–461
Chakraborty S, Meel KS, Vardi MY (2013) A scalable and nearly uniform generator of SAT witnesses. In: Proceedings of the 25th international conference on computer-aided verification (CAV ’13), pp 608–623
Chakraborty S, Fremont DJ, Meel KS, Vardi MY (2014) Distribution-aware sampling and weighted model counting for SAT. In: Proceedings of the 28th AAAI conference on artificial intelligence (AAAI ’14), pp 1722–1730
Chakraborty S, Fremont DJ, Meel KS, Seshia SA, Vardi MY (2015) On parallel scalable uniform SAT witness generation. In: Proceedings of the 21st international conference on tools and algorithms for the construction and analysis of systems (TACAS ’15), vol 9035, pp 304–319
De Raedt L, Zimmermann A (2007) Constraint-based pattern set mining. In: Proceedings of the 7th SIAM international conference on data mining (SDM ’07), pp 237–248
Dzyuba V, van Leeuwen M (2017) Learning what matters—sampling interesting patterns. In: Proceedings of the 21st Pacific-Asia conference on knowledge discovery and data mining (PAKDD ’17) (in press)
Ermon S, Gomes CP, Sabharwal A, Selman B (2013a) Embed and project: discrete sampling with universal hashing. Adv Neural Inf Process Syst 26:2085–2093
Ermon S, Gomes CP, Sabharwal A, Selman B (2013b) Taming the curse of dimensionality: discrete integration by hashing and optimization. In: Proceedings of the 30th international conference on machine learning (ICML ’13), pp 334–342
Geerts F, Goethals B, Mielikäinen T (2004) Tiling databases. In: Proceedings of the 7th international conference on discovery science (DS ’04), pp 278–289
Giacometti A, Soulet A (2016) Anytime algorithm for frequent pattern outlier detection. Int J Data Sci Anal 2(3):119–130
Gomes CP, van Hoeve Wj, Sabharwal A, Selman B (2007a) Counting CSP solutions using generalized XOR constraints. In: Proceedings of the 22nd AAAI conference on artificial intelligence (AAAI ’07), pp 204–209
Gomes CP, Sabharwal A, Selman B (2007b) Near-uniform sampling of combinatorial spaces using XOR constraints. Adv Neural Inf Process Syst 19:481–488
Guns T, Nijssen S, De Raedt L (2011) Itemset mining: a constraint programming perspective. Artif Intell 175(12–13):1951–1983
Guns T, Nijssen S, De Raedt L (2013) \(k\)-Pattern set mining under constraints. IEEE Trans Knowl Data Eng 25(2):402–418
Hasan MA, Zaki MJ (2009) Output space sampling for graph patterns. Proc VLDB Endow 2(1):730–741
Kemmar A, Ugarte W, Loudni S, Charnois T, Lebbah Y, Boizumault P, Crémilleux B (2014) Mining relevant sequence patterns with CP-based framework. In: Proceedings of the 26th IEEE international conference on tools with artificial intelligence (ICTAI ’14), pp 552–559
Khiari M, Boizumault P, Crémilleux B (2010) Constraint programming for mining n-ary patterns. In: Proceedings of the 16th international conference on principles and practice of constraint programming (CP ’10), pp 552–567
Knobbe A, Ho E (2006) Pattern teams. In: Proceedings of the 10th European conference on principles of data mining and knowledge discovery (PKDD ’06), pp 577–584
Lemmerich F, Becker M, Puppe F (2013) Difference-based estimates for generalization-aware subgroup discovery. In: Proceedings of the European conference on machine learning and principles and practice of knowledge discovery (ECML/PKDD ’13), pp 288–303
Meel K, Vardi M, Chakraborty S, Fremont D, Seshia S, Fried D, Ivrii A, Malik S (2016) Constrained sampling and counting: universal hashing meets SAT solving. In: Proceedings of the beyond NP AAAI workshop
Nijssen S, Zimmermann A (2014) Constraint-based pattern mining. In: Aggarwal CC, Han J (eds) Frequent pattern mining, chap 7. Springer International Publishing, New York, pp 147–163
Nijssen S, Guns T, De Raedt L (2009) Correlated itemset mining in ROC space: a constraint programming approach. In: Proceedings of the 15th ACM SIGKDD conference on knowledge discovery and data mining (KDD ’09), pp 647–655
Paramonov S, van Leeuwen M, Denecker M, De Raedt L (2015) An exercise in declarative modeling for relational query mining. In: Proceedings of the 25th international conference on inductive logic programming (ILP ’15)
Pei J, Han J (2000) Can we push more constraints into frequent pattern mining? In: Proceedings of the 6th ACM SIGKDD conference on knowledge discovery and data mining (KDD ’00), pp 350–354
Ramakrishnan N, Kumar D, Mishra B, Potts M, Helm R (2004) Turning CARTwheels: an alternating algorithm for mining redescriptions. In: Proceedings of the 10th ACM SIGKDD conference on knowledge discovery and data mining (KDD ’04), pp 266–275
Shervashidze N, Vishwanathan S, Petri T, Mehlhorn K, Borgwardt KM (2009) Efficient graphlet kernels for large graph comparison. In: Proceedings of the 12th international conference on artificial intelligence and statistics (AISTATS ’09), pp 488–495
Soos M (2010) Enhanced Gaussian elimination in DPLL-based SAT solvers. In: Proceedings of the pragmatics of SAT workshop (POS ’10), pp 2–14
Uno T, Kiyomi M, Arimura H (2005) LCM ver. 3: collaboration of array, bitmap and prefix tree for frequent itemset mining. In: Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations (OSDM ’05), pp 77–86
Zaki MJ, Parthasarathy S, Ogihara M, Li W (1997) New algorithms for fast discovery of association rules. In: Proceedings of the 3rd ACM SIGKDD conference on knowledge discovery and data mining (KDD ’97), pp 283–296
Zimmermann A, Nijssen S (2014) Supervised pattern mining and applications to classification. In: Aggarwal CC, Han J (eds) Frequent pattern mining, chap 17. Springer International Publishing, New York, pp 425–442
Acknowledgements
The authors would like to thank Guy Van den Broeck for useful discussions and Martin Albrecht for the support with the m4ri library. Vladimir Dzyuba is supported by FWO-Vlaanderen.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Kurt Driessens, Dragi Kocev, Marko Robnik-Šikonja and Myra Spiliopoulou.
Appendix: WeightGen
Appendix: WeightGen
In this section, we present an extended technical description of the WeightGen algorithm, which closely follows Sections 3 and 4 in Chakraborty et al. (2014), whereas the pseudocode in Algorithm 2 is structured similarly to that of UniGen2, a close cousin of WeightGen (Chakraborty et al. 2015). Lines 1–3 correspond to the estimation phase and Lines 4–8 correspond to the sampling phase. SolveBounded stands for the bounded enumeration oracle.
The parameters of the estimation phase are fixed to particular theoretically motivated values. \(\textit{pivot}_{est}\) denotes the maximal weight of a cell at the estimation phase; \(\textit{pivot}_{est}=46\) corresponds to estimation error tolerance \(\varepsilon _{est}=0.8\) (Line 10). If the total weight of solutions in a given cell exceeds \(\textit{pivot}_{est}\), a new random XOR constraint is added in order to eliminate a number of solutions. Repeating the process for a number of iterations increases the confidence of the estimate, e.g., 17 iterations result in \(1-\delta _{est}=0.8\) (Line 1). Note that Estimate essentially estimates the total weight of all solutions, from which \(N_{\textit{XOR}}\), the initial number of XOR constraints for the sampling phase, is derived (Line 4).
A similar procedure is employed at the sampling phase. It starts with \(N_{\textit{XOR}}\) constraints and adds at most three extra constraints. The user-chosen error tolerance parameter \(\kappa \) determines the range \(\left[ \textit{loThresh},\ \textit{hiThresh}\right] \), within which the total weight of a suitable cell should lie (Line 5). For example, \(\kappa =0.9\) corresponds to range \(\left[ 6.7,\ 49.4\right] \). If a suitable cell can be obtained, a solution is sampled exactly from all solutions in the cell; otherwise, no sample is returned. Requiring the total cell weight to exceed a particular value ensures the lower bound on the sampling accuracy.
The preceding presentation makes two simplifying assumptions: (1) all weights lie in \(\left[ 1/r,\ 1\right] \); (2) adding XOR constraints never results in unsatisfiable subproblems (empty cells). The former is relaxed by multiplying pivots by \(\hat{w}_{\textit{max}} = \hat{w}_{\textit{min}} \times \hat{r} < 1\), where \(\hat{w}_{\textit{min}}\) is the smallest weight observed so far. The latter is solved by simply restarting an iteration with a newly generated set of constraints. See Chakraborty et al. (2014) for the full explanation, including the precise formulae to compute all parameters.
1.1 Implementation details
Following suggestions of Chakraborty et al. (2015), we implement leapfrogging, a technique that improves the performance of the umbrella sampling procedure and thus benefits both GFlexics and EFlexics. First, after three iterations of the estimation phase, we initialize the following iterations with a number of XOR constraints that is equal to the smallest number returned in the previous iterations (rather than with zero XORs). Second, in the sampling phase, we start with one XOR constraint more than the number suggested by theory. If the cell is too small, we remove one constraint; if it is too large, we proceed adding (at most two) constraints. Both modifications are based on the observation that theoretical parameter values address hypothetical corner cases that rarely occur in practice. Finally, we only run the estimation phase until the initial number of XOR constraints, which only depends on the median of total weight estimates, converges. For example, if the estimation phase is supposed to run for 17 iterations, the convergence can happen as early as after 9 iterations.
Rights and permissions
About this article
Cite this article
Dzyuba, V., van Leeuwen, M. & De Raedt, L. Flexible constrained sampling with guarantees for pattern mining. Data Min Knowl Disc 31, 1266–1293 (2017). https://doi.org/10.1007/s10618-017-0501-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-017-0501-6