Data Mining and Knowledge Discovery

, Volume 19, Issue 2, pp 176–193 | Cite as

Identifying the components

  • Matthijs van Leeuwen
  • Jilles Vreeken
  • Arno Siebes
Open Access
Article

Abstract

Most, if not all, databases are mixtures of samples from different distributions. Transactional data is no exception. For the prototypical example, supermarket basket analysis, one also expects a mixture of different buying patterns. Households of retired people buy different collections of items than households with young children. Models that take such underlying distributions into account are in general superior to those that do not. In this paper we introduce two MDL-based algorithms that follow orthogonal approaches to identify the components in a transaction database. The first follows a model-based approach, while the second is data-driven. Both are parameter-free: the number of components and the components themselves are chosen such that the combined complexity of data and models is minimised. Further, neither prior knowledge on the distributions nor a distance metric on the data is required. Experiments with both methods show that highly characteristic components are identified.

Keywords

MDL Database components Clusters 

References

  1. Aggarwal CC, Procopiuc C, Yu PS (2002) Finding localized associations in market basket data. IEEE Trans Knowl Data Eng 14(1): 51–62CrossRefGoogle Scholar
  2. Andritsos P, Tsaparas P, Miller RJ, Sevcik KC (2004) LIMBO: scalable clustering of categorical data. In: Proceedings of the EDBT’04, pp 124–146Google Scholar
  3. Bischof H, Leonardis A, Sleb A (1999) MDL principle for robust vector quantization. Pattern Anal Appl 2: 59–72MATHCrossRefGoogle Scholar
  4. Böhm C, Faloutsos C, Pan J-Y, Plant C (2006) Robust information-theoretic clustering. In: Proceedings of the KDD’06, pp 65–75Google Scholar
  5. Brijs T, Swinnen G, Vanhoof K, Wets G (1999) The use of association rules for product assortment decisions: a case study. In: Proceedings of the KDD’99, pp 254–260Google Scholar
  6. Cadez IV, Smyth P, Mannila H (2001) Probabilistic modeling of transaction data with applications to profiling, visualization, and prediction. In: Proceedings of the KDD’01, pp 37–46Google Scholar
  7. Cilibrasi R, Vitányi PMB (2005) Clustering by compression. IEEE Trans Inf Theory 51(4): 1523–1545CrossRefGoogle Scholar
  8. Coenen F (2003) The LUCS-KDD discretised/normalised ARM and CARM data library. http://www.csc.liv.ac.uk/~frans/KDD/Software/LUCS-KDD-DN/DataSets/dataSets.html. Accessed 30 June 2009
  9. Faloutsos C, Megalooikonomou V (2007) On data mining, compression, and kolomogorov complexity. Data Min Knowl Discov 15(1): 3–20CrossRefMathSciNetGoogle Scholar
  10. Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comp Sys Sci 55(1): 119–139MATHCrossRefMathSciNetGoogle Scholar
  11. Gokcay E, Principe JC (2002) Information theoretic clustering. IEEE Trans Pattern Anal Mach Intell 24(2): 158–171CrossRefGoogle Scholar
  12. Grünwald PD (2005) Minimum description length tutorial. In: Grünwald PD, Myung IJ, Pitt MA (eds) Advances in minimum description length. MIT Press, CambridgeGoogle Scholar
  13. Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proceedings of the SIGMOD’00, pp 1–12Google Scholar
  14. Heikinheimo H, Fortelius M, Eronen J, Mannila H (2007) Biogeography of European land mammals shows environmentally distinct and spatial coherent clusters. J Biogeogr 34(6): 1053–1064CrossRefGoogle Scholar
  15. Kontkanen P, Myllymäki P, Buntine W, Rissanen J, Tirri H (2004) An MDL framework for clustering. Technical Report 2004–6, HIITGoogle Scholar
  16. Koyotürk M, Grama A, Ramakrishnan N (2005) Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute data sets. IEEE Trans Knowl Data Eng 17(4): 447–461CrossRefGoogle Scholar
  17. Li M, Vitányi PMB (1993) An introduction to Kolmogorov complexity and its applications. Springer, New YorkMATHGoogle Scholar
  18. MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the symposium on mathematical statistics and probability, pp 281–297Google Scholar
  19. Mitchell-Jones AJ, Amori G, Bogdanowicz W, Krystufek B, Reijnders PJH, Spitzenberger F, Stubbe M, Thissen JBM, Vohralik V, Zima J (1999) The atlas of European mammals. Academic Press, New YorkGoogle Scholar
  20. Pensa R, Robardet C, Boulicaut J-F (2005) A bi-clustering framework for categorical data. In: Proceedings of the PKDD’05, pp 643–650Google Scholar
  21. Siebes A, Vreeken J, van Leeuwen M (2006) Item sets that compress. In: Proceedings of the SDM’06, pp 393–404Google Scholar
  22. Titterington D, Smith A, Makov U (1985) Statistical analysis of finite mixture distributions. Wiley, New YorkMATHGoogle Scholar
  23. Tishby N, Pereira FC, Bialek W (1999) The information bottleneck method. In: Proceedings of the allerton conference on communication, control and computing, pp 368–377Google Scholar
  24. van Leeuwen M, Vreeken J, Siebes A (2006) Compression picks item sets that matter, In: Proceedings of the ECML/PKDD’06, pp 585–592Google Scholar
  25. Vreeken J, van Leeuwen M, Siebes A (2007) Characterising the difference. In: Proceedings of the KDD’07, pp 765–774Google Scholar
  26. Wang K, Xu C, Liu B (1999) Clustering transactions using large items. In: Proceedings of the CIKM’99, pp 483–490Google Scholar

Copyright information

© The Author(s) 2009

Authors and Affiliations

  • Matthijs van Leeuwen
    • 1
  • Jilles Vreeken
    • 1
  • Arno Siebes
    • 1
  1. 1.Department of Computer ScienceUniversiteit UtrechtUtrechtThe Netherlands

Personalised recommendations