Abstract
Suppose we are given a set of databases, such as sales records over different branches. How can we characterise the differences and the norm between these datasets? That is, what are the patterns that characterise the general distribution, and what are those that are important to describe the individual datasets? We study how to discover these pattern sets simultaneously and without redundancy – automatically identifying those patterns that aid describing the overall distribution, as well as those pointing out those that are characteristic for specific databases. We define the problem in terms of the Minimum Description Length principle, and propose the DiffNorm algorithm to approximate the MDL-optimal summary directly from data. Empirical evaluation on synthetic and real-world data shows that DiffNorm efficiently discovers descriptions that accurately characterise the difference and the norm in easily understandable terms.
Chapter PDF
Similar content being viewed by others
Keywords
- Minimum Description Length
- Kolmogorov Complexity
- Frequent Pattern Mining
- Minimum Description Length Principle
- Sale Record
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Aggarwal, C.C., Han, J. (eds.): Frequent Pattern Mining. Springer (2014)
Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley-Interscience New York (2006)
De Bie, T.: Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min. Knowl. Disc. 23(3), 407–446 (2011)
Geerts, Floris, Goethals, Bart, Mielikäinen, Taneli: Tiling databases. In: Suzuki, Einoshin, Arikawa, Setsuo (eds.) DS 2004. LNCS (LNAI), vol. 3245, pp. 278–289. Springer, Heidelberg (2004)
Grünwald, P.: The Minimum Description Length Principle. MIT Press (2007)
Kolmogorov, A.N.: Three approaches to the quantitative definition of information. Problemy Peredachi Informatsii 1(1), 3–11 (1965)
Li, M., Vitányi, P.: An Introduction to Kolmogorov Complexity and its Applications. Springer (1993)
Mampaey, M., Vreeken, J., Tatti, N.: Summarizing data succinctly with the most informative itemsets. ACM TKDD 6, 1–44 (2012)
Miettinen, P.: On finding joint subspace Boolean matrix factorizations. In: SDM, pp. 954–965. SIAM (2012)
Nijssen, P., Guns, T., De Raedt, L.: Correlated itemset mining in ROC space: a constraint programming approach. In: KDD, pp. 647–656. Springer (2009)
Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Discovering frequent closed itemsets for association rules. In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 398–416. Springer, Heidelberg (1998)
Rissanen, J.: Modeling by shortest data description. Automatica 14(1), 465–471 (1978)
Rissanen, J.: A universal prior for integers and estimation by minimum description length. Annals Stat. 11(2), 416–431 (1983)
Smets, K., Vreeken, J.: Slim: Directly mining descriptive patterns. In: SDM, pp. 236–247. SIAM (2012)
Nikolaj, T., Jilles, V.: The long and the short of it: Summarizing event sequences with serial episodes. In: KDD. ACM (2012)
Vreeken, J., van Leeuwen, M., Siebes, A.: Characterising the difference. In: KDD, pp. 765–774 (2007)
Vreeken, J., van Leeuwen, M., Siebes, A.: Krimp: Mining itemsets that compress. Data Min. Knowl. Disc. 23(1), 169–214 (2011)
Wallace, C.S.: Statistical and inductive inference by minimum message length. Springer (2005)
Wallace, C.S., Boulton, D.M.: An information measure for classification. Comput. J. 11(1), 185–194 (1968)
Webb, G., Vreeken, J.: Efficient discovery of the most interesting associations. ACM TKDD 8(3), 1–31 (2014)
Zimmermann, A., Nijssen, S.: Supervised pattern mining and applications to classification. In: Aggarwal, C.C., Han, J. (eds.) Frequent Pattern Mining, pp. 425–442. Springer (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Budhathoki, K., Vreeken, J. (2015). The Difference and the Norm — Characterising Similarities and Differences Between Databases . In: Appice, A., Rodrigues, P., Santos Costa, V., Gama, J., Jorge, A., Soares, C. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2015. Lecture Notes in Computer Science(), vol 9285. Springer, Cham. https://doi.org/10.1007/978-3-319-23525-7_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-23525-7_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23524-0
Online ISBN: 978-3-319-23525-7
eBook Packages: Computer ScienceComputer Science (R0)