Data Mining and Knowledge Discovery

, Volume 23, Issue 1, pp 169–214 | Cite as

Krimp: mining itemsets that compress

  • Jilles Vreeken
  • Matthijs van Leeuwen
  • Arno Siebes
Open Access
Article

Abstract

One of the major problems in pattern mining is the explosion of the number of results. Tight constraints reveal only common knowledge, while loose constraints lead to an explosion in the number of returned patterns. This is caused by large groups of patterns essentially describing the same set of transactions. In this paper we approach this problem using the MDL principle: the best set of patterns is that set that compresses the database best. For this task we introduce the Krimp algorithm. Experimental evaluation shows that typically only hundreds of itemsets are returned; a dramatic reduction, up to seven orders of magnitude, in the number of frequent item sets. These selections, called code tables, are of high quality. This is shown with compression ratios, swap-randomisation, and the accuracies of the code table-based Krimp classifier, all obtained on a wide range of datasets. Further, we extensively evaluate the heuristic choices made in the design of the algorithm.

Keywords

MDL Pattern mining Pattern selection Itemsets 

References

  1. Agrawal R, Mannila H, Srikant R, Toivonen H, Verkamo AI (1996) Fast discovery of association rules. In: Advances in knowledge discovery and data mining, AAAI, pp 307–328Google Scholar
  2. Bathoorn R, Koopman A, Siebes A (2006) Reducing the frequent pattern set. In: Proceedings of the ICDM-workshops’06, pp 55–59Google Scholar
  3. Bayardo R (1998) Efficiently mining long patterns from databases. In: Proceedings of SIGMOD’98, pp 85–93Google Scholar
  4. Bringmann B, Zimmermann A (2007) The chosen few: on identifying valuable patterns. In: Proceedings of the ICDM’07, pp 63–72Google Scholar
  5. Calders T, Goethals B (2002) Mining all non-derivable frequent itemsets. In: Proceedings of the ECML PKDD’02, pp 74–85Google Scholar
  6. Chakrabarti D, Papadimitriou S, Modha DS, Faloutsos C (2004) Fully automatic cross-associations. In: Proceedings of KDD’04, pp 79–88Google Scholar
  7. Chakrabarti S, Sarawagi S, Dom B (1998) Mining surprising patterns using temporal description length. In: Proceedings of VLDB’98, Morgan Kaufmann, San Francisco, pp 606–617Google Scholar
  8. Chandola V, Kumar V (2007) Summarization—compressing data into an informative representation. Knowl Inf Syst 12(3): 355–378CrossRefGoogle Scholar
  9. Coenen F (2003) The LUCS–KDD discretised/normalised ARM and CARM data library. http://www.csc.liv.ac.uk/~frans/KDD/Software/LUCS-KDD-DN/DataSets/dataSets.html
  10. Coenen F (2004) The LUCS–KDD software library. http://www.csc.liv.ac.uk/~frans/KDD/Software
  11. Cover T, Thomas J (2006) Elements of information theory, 2nd edn. John Wiley and Sons, New YorkGoogle Scholar
  12. Crémilleux B, Boulicaut JF (2002) Simplest rules characterizing classes generated by δ-free sets. In: Proceedings of KBSAAI’02, pp 33–46Google Scholar
  13. Duda R, Hart P (1973) Pattern classification and scene analysis. John Wiley and Sons, New YorkMATHGoogle Scholar
  14. Faloutsos C, Megalooikonomou V (2007) On data mining, compression and Kolmogorov complexity. Data Min Knowl Discov 15(1): 3–20CrossRefMathSciNetGoogle Scholar
  15. Geerts F, Goethals B, Mielikäinen T (2004) Tiling databases. In: Proceedings of DS’04, pp 278–289Google Scholar
  16. Gionis A, Mannila H, Mielikäinen T, Tsaparas P (2007) Assessing data mining results via swap randomization. ACM Trans Knowl Discov Data 1(3): 14CrossRefGoogle Scholar
  17. Goethals B, Zaki MJ (2003) Frequent itemset mining implementations repository (FIMI). http://fimi.cs.helsinki.fi
  18. Grünwald PD (2005) Minimum description length tutorial. In: Grünwald P, Myung I (eds) Advances in minimum description length. MIT Press, CambridgeGoogle Scholar
  19. Grünwald PD (2007) The minimum description length principle. MIT Press, CambridgeGoogle Scholar
  20. Hand, D, Adams, N, Bolton, R (eds) (2002) Pattern detection and discovery. Springer, New YorkMATHGoogle Scholar
  21. Heikinheimo H, Hinkkanen E, Mannila H, Mielikäinen T, Seppänen JK (2007) Finding low-entropy sets and trees from binary data. In: Proceedings of KDD’07, pp 350–359Google Scholar
  22. Heikinheimo H, Vreeken J, Siebes A, Mannila H (2009) Low-entropy set selection. In: Proceedings of SDM’09, pp 569–579Google Scholar
  23. Karp RM (1972) Reducibility among combinatorial problems. In: Miller R, Thatcher J (eds) Proceedings of a symposium on the complexity of computer computations. Plenum Press, New York, USA, pp 85–103Google Scholar
  24. Keogh E, Lonardi S, Ratanamahatana CA (2004) Towards parameter-free data mining. In: Proceedings of KDD’04, pp 206–215Google Scholar
  25. Keogh E, Lonardi S, Ratanamahatana CA, Wei L, Lee SH, Handley J (2007) Compression-based data mining of sequential data. Data Min Knowl Discov 14(1): 99–129CrossRefMathSciNetGoogle Scholar
  26. Knobbe AJ, Ho EKY (2006a) Maximally informative k-itemsets and their efficient discovery. In: Proceedings of KDD’06, pp 237–244Google Scholar
  27. Knobbe AJ, Ho EKY (2006b) Pattern teams. In: Proceedings of the ECML PKDD’06, pp 577–584Google Scholar
  28. Kohavi R, Brodley C, Frasca B, Mason L, Zheng Z (2000) KDD-Cup 2000 organizers’ report: peeling the onion. SIGKDD Explor 2(2):86–98. http://www.ecn.purdue.edu/KDDCUP
  29. Koopman A, Siebes A (2008) Discovering relational items sets efficiently. In: Zaki M, Wang K (eds) Proceedings of SDM’08, SIAM, pp 108–119Google Scholar
  30. Koopman A, Siebes A (2009) Characteristic relational patterns. In: Proceedings of KDD’09, pp 437–446Google Scholar
  31. Li M, Vitányi P (1993) An introduction to Kolmogorov complexity and its applications. Springer, New YorkMATHGoogle Scholar
  32. Liu B, Hsu W, Ma Y (1998) Integrating classification and association rule mining. In: Proceedings of KDD’98, pp 80–86Google Scholar
  33. Liu G, Lu H, Yu JX, Wei W, Xiao X (2004) AFOPT: an efficient implementation of pattern growth approach. In: Proceedings of the 2nd workshop on frequent itemset mining implementationsGoogle Scholar
  34. Mannila H, Toivonen H (1996) Multiple uses of frequent sets and condensed representations. In: Proceedings of KDD’96, pp 189–194Google Scholar
  35. Mannila H, Toivonen H (1997) Levelwise search and borders of theories in knowledge discovery. Data mining and knowledge discovery, pp 241–258Google Scholar
  36. Mehta M, Agrawal R, Rissanen J (1996) Sliq: a fast scalable classifier for data mining. In: Advances in database technology. Springer, NY, pp 18–32Google Scholar
  37. Meretakis D, Lu H, Wüthrich B (2000) A study on the performance of large bayes classifier. In: Proceedings of the ECML’00, pp 271–279Google Scholar
  38. Mielikäinen T, Mannila H (2003) The pattern ordering problem. In: Proceedings of the ECML PKDD’03, pp 327–338Google Scholar
  39. Mitchell-Jones AJ, Amori G, Bogdanowicz W, Krystufek B, Reijnders PJH, Spitzenberger F, Stubbe M, Thissen JBM, Vohralik V, Zima J (1999) The atlas of European mammals. Academic Press, LondonGoogle Scholar
  40. Morik, K, Boulicaut, JF, Siebes, A (eds) (2005) Local pattern detection. Springer, New YorkGoogle Scholar
  41. Myllykangas S, Himberg J, Böhling T, Nagy B, Hollmén J, Knuutila S (2006) Dna copy number amplification profiling of human neoplasms. Oncogene 25(55)Google Scholar
  42. Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Proceedings of the ICDT’99, pp 398–416Google Scholar
  43. Pfahringer B (1995) Compression-based feature subset selection. In: Proceedings of the IJCAI’95 workshop on data engineering for inductive learning, pp 109–119Google Scholar
  44. Quinlan J (1993b) C4.5: programs for machine learning. Morgan-Kaufmann, Los AltosGoogle Scholar
  45. Quinlan J (1993b) FOIL: a midterm report. In: Proceedings of the ECML’93Google Scholar
  46. Rissanen J (1978) Modeling by shortest data description. Automatica 14(1): 465–471CrossRefMATHGoogle Scholar
  47. Siebes A, Vreeken J, van Leeuwen M (2006) Item sets that compress. In: Proceedings of SDM’06, pp 393–404Google Scholar
  48. Sun J, Faloutsos C, Papadimitriou S, Yu PS (2007) Graphscope: parameter-free mining of large time-evolving graphs. In: Proceedings of KDD’07, pp 687–696Google Scholar
  49. Tatti N, Vreeken J (2008) Finding good itemsets by packing data. In: Proceedings of the ICDM’08, pp 588–597Google Scholar
  50. van Leeuwen M, Siebes A (2008) Streamkrimp: detecting change in data streams. In: Proceedings of ECMLPKDD’08, Springer, Heidelberg, pp 672–687Google Scholar
  51. van Leeuwen M, Vreeken J, Siebes A (2006) Compression picks the item sets that matter. In: Proceedings of the ECML PKDD’06, pp 585–592Google Scholar
  52. van Leeuwen M, Vreeken J, Siebes A (2009) Identifying the components. Data Min Knowl Discov 19(2): 173–292CrossRefGoogle Scholar
  53. Vreeken J, Siebes A (2008) Filling in the blanks—Krimp minimisation for missing data. In: Proceedings of the ICDM’08, pp 1067–1072Google Scholar
  54. Vreeken J, van Leeuwen M, Siebes A (2007a) Characterising the difference. In: Proceedings of KDD’07, pp 765–774Google Scholar
  55. Vreeken J, van Leeuwen M, Siebes A (2007b) Preserving privacy through data generation. In: Proceedings of the ICDM’07, pp 685–690Google Scholar
  56. Wallace C (2005) Statistical and inductive inference by minimum message length. Springer, New YorkMATHGoogle Scholar
  57. Wang J, Karypis G (2005) HARMONY: efficiently mining the best rules for classification. In: Proceedings of SDM’05, pp 205–216Google Scholar
  58. Wang J, Karypis G (2006) On efficiently summarizing categorical databases. Knowl Inf Syst 9(1): 19–37CrossRefGoogle Scholar
  59. Wang C, Parthasarathy S (2006) Summarizing itemset patterns using probabilistic models. In: Proceedings of KDD’06, pp 730–735Google Scholar
  60. Warner H, Toronto A, Veasey L, Stephenson R (1961) A mathematical model for medical diagnosis, application to congenital heart disease. J Am Med Assoc 177: 177–184Google Scholar
  61. Witten I, Frank E (2005) Data mining: practical machine learning tools and techniques. 2nd edn. Morgan Kaufmann, San FranciscoMATHGoogle Scholar
  62. Xiang Y, Jin R, Fuhry D, Dragan FF (2008) Succinct summarization of transactional databases: an overlapped hyperrectangle scheme. In: Proceedings of KDD’08, pp 758–766Google Scholar
  63. Xin D, Han J, Yan X, Cheng H (2005) Mining compressed frequent-pattern sets. In: Proceedings of VLDB’05, pp 709–720Google Scholar
  64. Yan X, Cheng H, Han J, Xin D (2005) Summarizing itemset patterns: a profile-based approach. In: Proceedings of KDD’05, pp 314–323Google Scholar
  65. Yin X, Han J (2003) CPAR: Classification based on predictive association rules. In: Proceedings of SDM’03, pp 331–335Google Scholar
  66. Zhang X, Guozhu D, Ramamohanarao K (2000) Information-based classification by aggregating emerging patterns. In: Proceedings of IDEAL’00, pp 48–53Google Scholar

Copyright information

© The Author(s) 2010

Authors and Affiliations

  • Jilles Vreeken
    • 1
    • 2
  • Matthijs van Leeuwen
    • 1
  • Arno Siebes
    • 1
  1. 1.Algorithmic Data Analysis, Department of Information and Computing Sciences, Faculty of ScienceUniversiteit UtrechtUtrechtThe Netherlands
  2. 2.ADReM, Department of Mathematics and Computer ScienceFaculty of Science, University of AntwerpAntwerpBelgium

Personalised recommendations