# Flexible constrained sampling with guarantees for pattern mining

- 450 Downloads
- 2 Citations

**Part of the following topical collections:**

## Abstract

Pattern sampling has been proposed as a potential solution to the infamous pattern explosion. Instead of enumerating all patterns that satisfy the constraints, individual patterns are sampled proportional to a given quality measure. Several sampling algorithms have been proposed, but each of them has its limitations when it comes to (1) flexibility in terms of quality measures and constraints that can be used, and/or (2) guarantees with respect to sampling accuracy. We therefore present Flexics, the first flexible pattern sampler that supports a broad class of quality measures and constraints, while providing strong guarantees regarding sampling accuracy. To achieve this, we leverage the perspective on pattern mining as a constraint satisfaction problem and build upon the latest advances in sampling solutions in SAT as well as existing pattern mining algorithms. Furthermore, the proposed algorithm is applicable to a variety of pattern languages, which allows us to introduce and tackle the novel task of sampling sets of patterns. We introduce and empirically evaluate two variants of Flexics: (1) a generic variant that addresses the well-known itemset sampling task and the novel pattern set sampling task as well as a wide range of expressive constraints within these tasks, and (2) a specialized variant that exploits existing frequent itemset techniques to achieve substantial speed-ups. Experiments show that Flexics is both accurate and efficient, making it a useful tool for pattern-based data exploration.

## Keywords

Pattern sampling Itemset mining Pattern set mining Tiling Hashing-based sampling## Notes

### Acknowledgements

The authors would like to thank Guy Van den Broeck for useful discussions and Martin Albrecht for the support with the m4ri library. Vladimir Dzyuba is supported by FWO-Vlaanderen.

## References

- Aggarwal CC, Han J (eds) (2014) Frequent pattern mining. Springer International Publishing, New YorkzbMATHGoogle Scholar
- Agrawal R, Mannila H, Srikant R, Toivonen H, Verkamo AI (1996) Fast discovery of association rules. In: Fayyad U, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds) Advances in knowledge discovery and data mining. AAAI Press, Menlo Park, pp 307–328Google Scholar
- Albrecht M, Bard G (2012) The M4RI Library. The M4RI Team. https://bitbucket.org/malb/m4ri
- Berlingerio M, Pinelli F, Calabrese F (2013) ABACUS: frequent pattern mining-based community discovery in multidimensional networks. Data Min Knowl Discov 27(3):294–320MathSciNetCrossRefzbMATHGoogle Scholar
- Boley M, Grosskreutz H (2009) Approximating the number of frequent sets in dense data. Knowl Inf Syst 21(1):65–89CrossRefGoogle Scholar
- Boley M, Gärtner T, Grosskreutz H (2010) Formal concept sampling for counting and threshold-free local pattern mining. In: Proceedings of the 10th SIAM international conference on data mining (SDM ’10), pp 177–188Google Scholar
- Boley M, Lucchese C, Paurat D, Gärtner T (2011) Direct local pattern sampling by efficient two-step random procedures. In: Proceedings of the 17th ACM SIGKDD conference on knowledge discovery and data mining (KDD ’11), pp 582–590Google Scholar
- Boley M, Moens S, Gärtner T (2012) Linear space direct pattern sampling using coupling from the past. In: Proceedings of the 18th ACM SIGKDD conference on knowledge discovery and data mining (KDD ’12), pp 69–77Google Scholar
- Boley M, Mampaey M, Kang B, Tokmakov P, Wrobel S (2013) One click mining—interactive local pattern discovery through implicit preference and performance learning. In: Proceedings of the ACM SIGKDD workshop on interactive data exploration and analytics (IDEA ’13), pp 28–36Google Scholar
- Bonchi F, Giannotti F, Lucchese C, Orlando S, Perego R, Trasarti R (2009) A constraint-based querying system for exploratory pattern discovery. Inf Syst 34(1):3–27CrossRefGoogle Scholar
- Bouillaguet C, Delaplace C (2016) Sparse Gaussian elimination modulo \(p\): an update. In: Proceedings of the 18th international workshop on computer algebra in scientific computing (CASC ’16), pp 101–116Google Scholar
- Bringmann B, Nijssen S, Tatti N, Vreeken J, Zimmermann A (2010) Mining sets of patterns. In: Tutorial at the European conference on machine learning and principles and practice of knowledge discovery (ECML/PKDD ’10)Google Scholar
- Bucilă C, Gehrke J, Kifer D, White W (2003) Dualminer: a dual-pruning algorithm for itemsets with constraints. Data Min Knowl Discov 7(3):241–272MathSciNetCrossRefGoogle Scholar
- Calders T, Rigotti C, Boulicaut JF (2006) A survey on condensed representations for frequent sets. In: Boulicaut JF, De Raedt L, Mannila H (eds) Constraint-based mining and inductive databases. Springer, Berlin, pp 64–80CrossRefGoogle Scholar
- Carvalho DR, Freitas AA, Ebecken N (2005) Evaluating the correlation between objective rule interestingness measures and real human interest. In: Proceedings of the 9th European conference on principles of data mining and knowledge discovery (PKDD ’05), pp 453–461Google Scholar
- Chakraborty S, Meel KS, Vardi MY (2013) A scalable and nearly uniform generator of SAT witnesses. In: Proceedings of the 25th international conference on computer-aided verification (CAV ’13), pp 608–623Google Scholar
- Chakraborty S, Fremont DJ, Meel KS, Vardi MY (2014) Distribution-aware sampling and weighted model counting for SAT. In: Proceedings of the 28th AAAI conference on artificial intelligence (AAAI ’14), pp 1722–1730Google Scholar
- Chakraborty S, Fremont DJ, Meel KS, Seshia SA, Vardi MY (2015) On parallel scalable uniform SAT witness generation. In: Proceedings of the 21st international conference on tools and algorithms for the construction and analysis of systems (TACAS ’15), vol 9035, pp 304–319Google Scholar
- De Raedt L, Zimmermann A (2007) Constraint-based pattern set mining. In: Proceedings of the 7th SIAM international conference on data mining (SDM ’07), pp 237–248Google Scholar
- Dzyuba V, van Leeuwen M (2017) Learning what matters—sampling interesting patterns. In: Proceedings of the 21st Pacific-Asia conference on knowledge discovery and data mining (PAKDD ’17) (
**in press**)Google Scholar - Ermon S, Gomes CP, Sabharwal A, Selman B (2013a) Embed and project: discrete sampling with universal hashing. Adv Neural Inf Process Syst 26:2085–2093Google Scholar
- Ermon S, Gomes CP, Sabharwal A, Selman B (2013b) Taming the curse of dimensionality: discrete integration by hashing and optimization. In: Proceedings of the 30th international conference on machine learning (ICML ’13), pp 334–342Google Scholar
- Geerts F, Goethals B, Mielikäinen T (2004) Tiling databases. In: Proceedings of the 7th international conference on discovery science (DS ’04), pp 278–289Google Scholar
- Giacometti A, Soulet A (2016) Anytime algorithm for frequent pattern outlier detection. Int J Data Sci Anal 2(3):119–130CrossRefGoogle Scholar
- Gomes CP, van Hoeve Wj, Sabharwal A, Selman B (2007a) Counting CSP solutions using generalized XOR constraints. In: Proceedings of the 22nd AAAI conference on artificial intelligence (AAAI ’07), pp 204–209Google Scholar
- Gomes CP, Sabharwal A, Selman B (2007b) Near-uniform sampling of combinatorial spaces using XOR constraints. Adv Neural Inf Process Syst 19:481–488Google Scholar
- Guns T, Nijssen S, De Raedt L (2011) Itemset mining: a constraint programming perspective. Artif Intell 175(12–13):1951–1983MathSciNetCrossRefzbMATHGoogle Scholar
- Guns T, Nijssen S, De Raedt L (2013) \(k\)-Pattern set mining under constraints. IEEE Trans Knowl Data Eng 25(2):402–418CrossRefGoogle Scholar
- Hasan MA, Zaki MJ (2009) Output space sampling for graph patterns. Proc VLDB Endow 2(1):730–741CrossRefGoogle Scholar
- Kemmar A, Ugarte W, Loudni S, Charnois T, Lebbah Y, Boizumault P, Crémilleux B (2014) Mining relevant sequence patterns with CP-based framework. In: Proceedings of the 26th IEEE international conference on tools with artificial intelligence (ICTAI ’14), pp 552–559Google Scholar
- Khiari M, Boizumault P, Crémilleux B (2010) Constraint programming for mining n-ary patterns. In: Proceedings of the 16th international conference on principles and practice of constraint programming (CP ’10), pp 552–567Google Scholar
- Knobbe A, Ho E (2006) Pattern teams. In: Proceedings of the 10th European conference on principles of data mining and knowledge discovery (PKDD ’06), pp 577–584Google Scholar
- Lemmerich F, Becker M, Puppe F (2013) Difference-based estimates for generalization-aware subgroup discovery. In: Proceedings of the European conference on machine learning and principles and practice of knowledge discovery (ECML/PKDD ’13), pp 288–303Google Scholar
- Meel K, Vardi M, Chakraborty S, Fremont D, Seshia S, Fried D, Ivrii A, Malik S (2016) Constrained sampling and counting: universal hashing meets SAT solving. In: Proceedings of the beyond NP AAAI workshopGoogle Scholar
- Nijssen S, Zimmermann A (2014) Constraint-based pattern mining. In: Aggarwal CC, Han J (eds) Frequent pattern mining, chap 7. Springer International Publishing, New York, pp 147–163Google Scholar
- Nijssen S, Guns T, De Raedt L (2009) Correlated itemset mining in ROC space: a constraint programming approach. In: Proceedings of the 15th ACM SIGKDD conference on knowledge discovery and data mining (KDD ’09), pp 647–655Google Scholar
- Paramonov S, van Leeuwen M, Denecker M, De Raedt L (2015) An exercise in declarative modeling for relational query mining. In: Proceedings of the 25th international conference on inductive logic programming (ILP ’15)Google Scholar
- Pei J, Han J (2000) Can we push more constraints into frequent pattern mining? In: Proceedings of the 6th ACM SIGKDD conference on knowledge discovery and data mining (KDD ’00), pp 350–354Google Scholar
- Ramakrishnan N, Kumar D, Mishra B, Potts M, Helm R (2004) Turning CARTwheels: an alternating algorithm for mining redescriptions. In: Proceedings of the 10th ACM SIGKDD conference on knowledge discovery and data mining (KDD ’04), pp 266–275Google Scholar
- Shervashidze N, Vishwanathan S, Petri T, Mehlhorn K, Borgwardt KM (2009) Efficient graphlet kernels for large graph comparison. In: Proceedings of the 12th international conference on artificial intelligence and statistics (AISTATS ’09), pp 488–495Google Scholar
- Soos M (2010) Enhanced Gaussian elimination in DPLL-based SAT solvers. In: Proceedings of the pragmatics of SAT workshop (POS ’10), pp 2–14Google Scholar
- Uno T, Kiyomi M, Arimura H (2005) LCM ver. 3: collaboration of array, bitmap and prefix tree for frequent itemset mining. In: Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations (OSDM ’05), pp 77–86Google Scholar
- Zaki MJ, Parthasarathy S, Ogihara M, Li W (1997) New algorithms for fast discovery of association rules. In: Proceedings of the 3rd ACM SIGKDD conference on knowledge discovery and data mining (KDD ’97), pp 283–296Google Scholar
- Zimmermann A, Nijssen S (2014) Supervised pattern mining and applications to classification. In: Aggarwal CC, Han J (eds) Frequent pattern mining, chap 17. Springer International Publishing, New York, pp 425–442Google Scholar