Data Mining and Knowledge Discovery

, Volume 19, Issue 2, pp 176–193

Identifying the components

  • Matthijs van Leeuwen
  • Jilles Vreeken
  • Arno Siebes
Open Access
Article

DOI: 10.1007/s10618-009-0137-2

Cite this article as:
van Leeuwen, M., Vreeken, J. & Siebes, A. Data Min Knowl Disc (2009) 19: 176. doi:10.1007/s10618-009-0137-2

Abstract

Most, if not all, databases are mixtures of samples from different distributions. Transactional data is no exception. For the prototypical example, supermarket basket analysis, one also expects a mixture of different buying patterns. Households of retired people buy different collections of items than households with young children. Models that take such underlying distributions into account are in general superior to those that do not. In this paper we introduce two MDL-based algorithms that follow orthogonal approaches to identify the components in a transaction database. The first follows a model-based approach, while the second is data-driven. Both are parameter-free: the number of components and the components themselves are chosen such that the combined complexity of data and models is minimised. Further, neither prior knowledge on the distributions nor a distance metric on the data is required. Experiments with both methods show that highly characteristic components are identified.

Keywords

MDL Database components Clusters 

Copyright information

© The Author(s) 2009

Authors and Affiliations

  • Matthijs van Leeuwen
    • 1
  • Jilles Vreeken
    • 1
  • Arno Siebes
    • 1
  1. 1.Department of Computer ScienceUniversiteit UtrechtUtrechtThe Netherlands

Personalised recommendations