The Difference and the Norm — Characterising Similarities and Differences Between Databases
Suppose we are given a set of databases, such as sales records over different branches. How can we characterise the differences and the norm between these datasets? That is, what are the patterns that characterise the general distribution, and what are those that are important to describe the individual datasets? We study how to discover these pattern sets simultaneously and without redundancy – automatically identifying those patterns that aid describing the overall distribution, as well as those pointing out those that are characteristic for specific databases. We define the problem in terms of the Minimum Description Length principle, and propose the DiffNorm algorithm to approximate the MDL-optimal summary directly from data. Empirical evaluation on synthetic and real-world data shows that DiffNorm efficiently discovers descriptions that accurately characterise the difference and the norm in easily understandable terms.
KeywordsMinimum Description Length Kolmogorov Complexity Frequent Pattern Mining Minimum Description Length Principle Sale Record
Unable to display preview. Download preview PDF.
- 1.Aggarwal, C.C., Han, J. (eds.): Frequent Pattern Mining. Springer (2014)Google Scholar
- 2.Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley-Interscience New York (2006)Google Scholar
- 5.Grünwald, P.: The Minimum Description Length Principle. MIT Press (2007)Google Scholar
- 7.Li, M., Vitányi, P.: An Introduction to Kolmogorov Complexity and its Applications. Springer (1993)Google Scholar
- 9.Miettinen, P.: On finding joint subspace Boolean matrix factorizations. In: SDM, pp. 954–965. SIAM (2012)Google Scholar
- 10.Nijssen, P., Guns, T., De Raedt, L.: Correlated itemset mining in ROC space: a constraint programming approach. In: KDD, pp. 647–656. Springer (2009)Google Scholar
- 14.Smets, K., Vreeken, J.: Slim: Directly mining descriptive patterns. In: SDM, pp. 236–247. SIAM (2012)Google Scholar
- 15.Nikolaj, T., Jilles, V.: The long and the short of it: Summarizing event sequences with serial episodes. In: KDD. ACM (2012)Google Scholar
- 16.Vreeken, J., van Leeuwen, M., Siebes, A.: Characterising the difference. In: KDD, pp. 765–774 (2007)Google Scholar
- 18.Wallace, C.S.: Statistical and inductive inference by minimum message length. Springer (2005)Google Scholar
- 21.Zimmermann, A., Nijssen, S.: Supervised pattern mining and applications to classification. In: Aggarwal, C.C., Han, J. (eds.) Frequent Pattern Mining, pp. 425–442. Springer (2014)Google Scholar