Krimp: mining itemsets that compress

Vreeken, Jilles; van Leeuwen, Matthijs; Siebes, Arno

doi:10.1007/s10618-010-0202-x

Krimp: mining itemsets that compress

Open access
Published: 16 October 2010

Volume 23, pages 169–214, (2011)
Cite this article

Download PDF

You have full access to this open access article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Krimp: mining itemsets that compress

Download PDF

Jilles Vreeken^1,2,
Matthijs van Leeuwen¹ &
Arno Siebes¹

3756 Accesses
184 Citations
3 Altmetric
Explore all metrics

Abstract

One of the major problems in pattern mining is the explosion of the number of results. Tight constraints reveal only common knowledge, while loose constraints lead to an explosion in the number of returned patterns. This is caused by large groups of patterns essentially describing the same set of transactions. In this paper we approach this problem using the MDL principle: the best set of patterns is that set that compresses the database best. For this task we introduce the Krimp algorithm. Experimental evaluation shows that typically only hundreds of itemsets are returned; a dramatic reduction, up to seven orders of magnitude, in the number of frequent item sets. These selections, called code tables, are of high quality. This is shown with compression ratios, swap-randomisation, and the accuracies of the code table-based Krimp classifier, all obtained on a wide range of datasets. Further, we extensively evaluate the heuristic choices made in the design of the algorithm.

Article PDF

Mining and Using Sets of Patterns through Compression

Introduction to Pattern Mining

Data Mining Paradigms

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Agrawal R, Mannila H, Srikant R, Toivonen H, Verkamo AI (1996) Fast discovery of association rules. In: Advances in knowledge discovery and data mining, AAAI, pp 307–328
Bathoorn R, Koopman A, Siebes A (2006) Reducing the frequent pattern set. In: Proceedings of the ICDM-workshops’06, pp 55–59
Bayardo R (1998) Efficiently mining long patterns from databases. In: Proceedings of SIGMOD’98, pp 85–93
Bringmann B, Zimmermann A (2007) The chosen few: on identifying valuable patterns. In: Proceedings of the ICDM’07, pp 63–72
Calders T, Goethals B (2002) Mining all non-derivable frequent itemsets. In: Proceedings of the ECML PKDD’02, pp 74–85
Chakrabarti D, Papadimitriou S, Modha DS, Faloutsos C (2004) Fully automatic cross-associations. In: Proceedings of KDD’04, pp 79–88
Chakrabarti S, Sarawagi S, Dom B (1998) Mining surprising patterns using temporal description length. In: Proceedings of VLDB’98, Morgan Kaufmann, San Francisco, pp 606–617
Chandola V, Kumar V (2007) Summarization—compressing data into an informative representation. Knowl Inf Syst 12(3): 355–378
Article Google Scholar
Coenen F (2003) The LUCS–KDD discretised/normalised ARM and CARM data library. http://www.csc.liv.ac.uk/~frans/KDD/Software/LUCS-KDD-DN/DataSets/dataSets.html
Coenen F (2004) The LUCS–KDD software library. http://www.csc.liv.ac.uk/~frans/KDD/Software
Cover T, Thomas J (2006) Elements of information theory, 2nd edn. John Wiley and Sons, New York
Google Scholar
Crémilleux B, Boulicaut JF (2002) Simplest rules characterizing classes generated by δ-free sets. In: Proceedings of KBSAAI’02, pp 33–46
Duda R, Hart P (1973) Pattern classification and scene analysis. John Wiley and Sons, New York
MATH Google Scholar
Faloutsos C, Megalooikonomou V (2007) On data mining, compression and Kolmogorov complexity. Data Min Knowl Discov 15(1): 3–20
Article MathSciNet Google Scholar
Geerts F, Goethals B, Mielikäinen T (2004) Tiling databases. In: Proceedings of DS’04, pp 278–289
Gionis A, Mannila H, Mielikäinen T, Tsaparas P (2007) Assessing data mining results via swap randomization. ACM Trans Knowl Discov Data 1(3): 14
Article Google Scholar
Goethals B, Zaki MJ (2003) Frequent itemset mining implementations repository (FIMI). http://fimi.cs.helsinki.fi
Grünwald PD (2005) Minimum description length tutorial. In: Grünwald P, Myung I (eds) Advances in minimum description length. MIT Press, Cambridge
Google Scholar
Grünwald PD (2007) The minimum description length principle. MIT Press, Cambridge
Google Scholar
Hand, D, Adams, N, Bolton, R (eds) (2002) Pattern detection and discovery. Springer, New York
MATH Google Scholar
Heikinheimo H, Hinkkanen E, Mannila H, Mielikäinen T, Seppänen JK (2007) Finding low-entropy sets and trees from binary data. In: Proceedings of KDD’07, pp 350–359
Heikinheimo H, Vreeken J, Siebes A, Mannila H (2009) Low-entropy set selection. In: Proceedings of SDM’09, pp 569–579
Karp RM (1972) Reducibility among combinatorial problems. In: Miller R, Thatcher J (eds) Proceedings of a symposium on the complexity of computer computations. Plenum Press, New York, USA, pp 85–103
Google Scholar
Keogh E, Lonardi S, Ratanamahatana CA (2004) Towards parameter-free data mining. In: Proceedings of KDD’04, pp 206–215
Keogh E, Lonardi S, Ratanamahatana CA, Wei L, Lee SH, Handley J (2007) Compression-based data mining of sequential data. Data Min Knowl Discov 14(1): 99–129
Article MathSciNet Google Scholar
Knobbe AJ, Ho EKY (2006a) Maximally informative k-itemsets and their efficient discovery. In: Proceedings of KDD’06, pp 237–244
Knobbe AJ, Ho EKY (2006b) Pattern teams. In: Proceedings of the ECML PKDD’06, pp 577–584
Kohavi R, Brodley C, Frasca B, Mason L, Zheng Z (2000) KDD-Cup 2000 organizers’ report: peeling the onion. SIGKDD Explor 2(2):86–98. http://www.ecn.purdue.edu/KDDCUP
Koopman A, Siebes A (2008) Discovering relational items sets efficiently. In: Zaki M, Wang K (eds) Proceedings of SDM’08, SIAM, pp 108–119
Koopman A, Siebes A (2009) Characteristic relational patterns. In: Proceedings of KDD’09, pp 437–446
Li M, Vitányi P (1993) An introduction to Kolmogorov complexity and its applications. Springer, New York
MATH Google Scholar
Liu B, Hsu W, Ma Y (1998) Integrating classification and association rule mining. In: Proceedings of KDD’98, pp 80–86
Liu G, Lu H, Yu JX, Wei W, Xiao X (2004) AFOPT: an efficient implementation of pattern growth approach. In: Proceedings of the 2nd workshop on frequent itemset mining implementations
Mannila H, Toivonen H (1996) Multiple uses of frequent sets and condensed representations. In: Proceedings of KDD’96, pp 189–194
Mannila H, Toivonen H (1997) Levelwise search and borders of theories in knowledge discovery. Data mining and knowledge discovery, pp 241–258
Mehta M, Agrawal R, Rissanen J (1996) Sliq: a fast scalable classifier for data mining. In: Advances in database technology. Springer, NY, pp 18–32
Meretakis D, Lu H, Wüthrich B (2000) A study on the performance of large bayes classifier. In: Proceedings of the ECML’00, pp 271–279
Mielikäinen T, Mannila H (2003) The pattern ordering problem. In: Proceedings of the ECML PKDD’03, pp 327–338
Mitchell-Jones AJ, Amori G, Bogdanowicz W, Krystufek B, Reijnders PJH, Spitzenberger F, Stubbe M, Thissen JBM, Vohralik V, Zima J (1999) The atlas of European mammals. Academic Press, London
Google Scholar
Morik, K, Boulicaut, JF, Siebes, A (eds) (2005) Local pattern detection. Springer, New York
Google Scholar
Myllykangas S, Himberg J, Böhling T, Nagy B, Hollmén J, Knuutila S (2006) Dna copy number amplification profiling of human neoplasms. Oncogene 25(55)
Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Proceedings of the ICDT’99, pp 398–416
Pfahringer B (1995) Compression-based feature subset selection. In: Proceedings of the IJCAI’95 workshop on data engineering for inductive learning, pp 109–119
Quinlan J (1993b) C4.5: programs for machine learning. Morgan-Kaufmann, Los Altos
Google Scholar
Quinlan J (1993b) FOIL: a midterm report. In: Proceedings of the ECML’93
Rissanen J (1978) Modeling by shortest data description. Automatica 14(1): 465–471
Article MATH Google Scholar
Siebes A, Vreeken J, van Leeuwen M (2006) Item sets that compress. In: Proceedings of SDM’06, pp 393–404
Sun J, Faloutsos C, Papadimitriou S, Yu PS (2007) Graphscope: parameter-free mining of large time-evolving graphs. In: Proceedings of KDD’07, pp 687–696
Tatti N, Vreeken J (2008) Finding good itemsets by packing data. In: Proceedings of the ICDM’08, pp 588–597
van Leeuwen M, Siebes A (2008) Streamkrimp: detecting change in data streams. In: Proceedings of ECMLPKDD’08, Springer, Heidelberg, pp 672–687
van Leeuwen M, Vreeken J, Siebes A (2006) Compression picks the item sets that matter. In: Proceedings of the ECML PKDD’06, pp 585–592
van Leeuwen M, Vreeken J, Siebes A (2009) Identifying the components. Data Min Knowl Discov 19(2): 173–292
Article Google Scholar
Vreeken J, Siebes A (2008) Filling in the blanks—Krimp minimisation for missing data. In: Proceedings of the ICDM’08, pp 1067–1072
Vreeken J, van Leeuwen M, Siebes A (2007a) Characterising the difference. In: Proceedings of KDD’07, pp 765–774
Vreeken J, van Leeuwen M, Siebes A (2007b) Preserving privacy through data generation. In: Proceedings of the ICDM’07, pp 685–690
Wallace C (2005) Statistical and inductive inference by minimum message length. Springer, New York
MATH Google Scholar
Wang J, Karypis G (2005) HARMONY: efficiently mining the best rules for classification. In: Proceedings of SDM’05, pp 205–216
Wang J, Karypis G (2006) On efficiently summarizing categorical databases. Knowl Inf Syst 9(1): 19–37
Article Google Scholar
Wang C, Parthasarathy S (2006) Summarizing itemset patterns using probabilistic models. In: Proceedings of KDD’06, pp 730–735
Warner H, Toronto A, Veasey L, Stephenson R (1961) A mathematical model for medical diagnosis, application to congenital heart disease. J Am Med Assoc 177: 177–184
Google Scholar
Witten I, Frank E (2005) Data mining: practical machine learning tools and techniques. 2nd edn. Morgan Kaufmann, San Francisco
MATH Google Scholar
Xiang Y, Jin R, Fuhry D, Dragan FF (2008) Succinct summarization of transactional databases: an overlapped hyperrectangle scheme. In: Proceedings of KDD’08, pp 758–766
Xin D, Han J, Yan X, Cheng H (2005) Mining compressed frequent-pattern sets. In: Proceedings of VLDB’05, pp 709–720
Yan X, Cheng H, Han J, Xin D (2005) Summarizing itemset patterns: a profile-based approach. In: Proceedings of KDD’05, pp 314–323
Yin X, Han J (2003) CPAR: Classification based on predictive association rules. In: Proceedings of SDM’03, pp 331–335
Zhang X, Guozhu D, Ramamohanarao K (2000) Information-based classification by aggregating emerging patterns. In: Proceedings of IDEAL’00, pp 48–53

Download references

Acknowledgements

Jilles Vreeken is supported by the NWO project Mining Factors of Celiac Disease, part of the Computational Life Sciences Programme. Matthijs van Leeuwen is supported by the NBIC Biorange Programme and the NWO project Exceptional Model Mining, under number 612.065.822. The authors would like to thank Sander Schuckmann for parallelising the Krimp implementation.

Open Access

This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Author information

Authors and Affiliations

Algorithmic Data Analysis, Department of Information and Computing Sciences, Faculty of Science, Universiteit Utrecht, Utrecht, The Netherlands
Jilles Vreeken, Matthijs van Leeuwen & Arno Siebes
ADReM, Department of Mathematics and Computer Science, Faculty of Science, University of Antwerp, Antwerp, Belgium
Jilles Vreeken

Authors

Jilles Vreeken
View author publications
You can also search for this author in PubMed Google Scholar
Matthijs van Leeuwen
View author publications
You can also search for this author in PubMed Google Scholar
Arno Siebes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jilles Vreeken.

Additional information

Responsible editor: M.J. Zaki.

The research described in this paper builds upon and extends the work appearing in SDM’06 (Siebes et al. 2006) and ECML PKDD’06 (van Leeuwen et al. 2006).

Rights and permissions

Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Reprints and permissions

About this article

Cite this article

Vreeken, J., van Leeuwen, M. & Siebes, A. Krimp: mining itemsets that compress. Data Min Knowl Disc 23, 169–214 (2011). https://doi.org/10.1007/s10618-010-0202-x

Download citation

Received: 16 September 2009
Accepted: 21 September 2010
Published: 16 October 2010
Issue Date: July 2011
DOI: https://doi.org/10.1007/s10618-010-0202-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Krimp: mining itemsets that compress

Abstract

Article PDF

Similar content being viewed by others

Mining and Using Sets of Patterns through Compression

Introduction to Pattern Mining

Data Mining Paradigms

References

Acknowledgements

Open Access

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Krimp: mining itemsets that compress

Abstract

Article PDF

Similar content being viewed by others

Mining and Using Sets of Patterns through Compression

Introduction to Pattern Mining

Data Mining Paradigms

Explore related subjects

References

Acknowledgements

Open Access

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation