The Difference and the Norm — Characterising Similarities and Differences Between Databases

Budhathoki, Kailash; Vreeken, Jilles

doi:10.1007/978-3-319-23525-7_13

Kailash Budhathoki¹⁰ &
Jilles Vreeken¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9285))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

4107 Accesses
3 Citations

Abstract

Suppose we are given a set of databases, such as sales records over different branches. How can we characterise the differences and the norm between these datasets? That is, what are the patterns that characterise the general distribution, and what are those that are important to describe the individual datasets? We study how to discover these pattern sets simultaneously and without redundancy – automatically identifying those patterns that aid describing the overall distribution, as well as those pointing out those that are characteristic for specific databases. We define the problem in terms of the Minimum Description Length principle, and propose the DiffNorm algorithm to approximate the MDL-optimal summary directly from data. Empirical evaluation on synthetic and real-world data shows that DiffNorm efficiently discovers descriptions that accurately characterise the difference and the norm in easily understandable terms.

Download to read the full chapter text

Chapter PDF

An Introduction to the Minimum Description Length Principle

Summary and Semi-average Similarity Criteria for Individual Clusters

Minimum Description Length Principle

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Aggarwal, C.C., Han, J. (eds.): Frequent Pattern Mining. Springer (2014)
Google Scholar
Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley-Interscience New York (2006)
Google Scholar
De Bie, T.: Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min. Knowl. Disc. 23(3), 407–446 (2011)
Article MathSciNet MATH Google Scholar
Geerts, Floris, Goethals, Bart, Mielikäinen, Taneli: Tiling databases. In: Suzuki, Einoshin, Arikawa, Setsuo (eds.) DS 2004. LNCS (LNAI), vol. 3245, pp. 278–289. Springer, Heidelberg (2004)
Chapter Google Scholar
Grünwald, P.: The Minimum Description Length Principle. MIT Press (2007)
Google Scholar
Kolmogorov, A.N.: Three approaches to the quantitative definition of information. Problemy Peredachi Informatsii 1(1), 3–11 (1965)
MathSciNet Google Scholar
Li, M., Vitányi, P.: An Introduction to Kolmogorov Complexity and its Applications. Springer (1993)
Google Scholar
Mampaey, M., Vreeken, J., Tatti, N.: Summarizing data succinctly with the most informative itemsets. ACM TKDD 6, 1–44 (2012)
Article Google Scholar
Miettinen, P.: On finding joint subspace Boolean matrix factorizations. In: SDM, pp. 954–965. SIAM (2012)
Google Scholar
Nijssen, P., Guns, T., De Raedt, L.: Correlated itemset mining in ROC space: a constraint programming approach. In: KDD, pp. 647–656. Springer (2009)
Google Scholar
Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Discovering frequent closed itemsets for association rules. In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 398–416. Springer, Heidelberg (1998)
Chapter Google Scholar
Rissanen, J.: Modeling by shortest data description. Automatica 14(1), 465–471 (1978)
Article MATH Google Scholar
Rissanen, J.: A universal prior for integers and estimation by minimum description length. Annals Stat. 11(2), 416–431 (1983)
Article MathSciNet Google Scholar
Smets, K., Vreeken, J.: Slim: Directly mining descriptive patterns. In: SDM, pp. 236–247. SIAM (2012)
Google Scholar
Nikolaj, T., Jilles, V.: The long and the short of it: Summarizing event sequences with serial episodes. In: KDD. ACM (2012)
Google Scholar
Vreeken, J., van Leeuwen, M., Siebes, A.: Characterising the difference. In: KDD, pp. 765–774 (2007)
Google Scholar
Vreeken, J., van Leeuwen, M., Siebes, A.: Krimp: Mining itemsets that compress. Data Min. Knowl. Disc. 23(1), 169–214 (2011)
Article MATH Google Scholar
Wallace, C.S.: Statistical and inductive inference by minimum message length. Springer (2005)
Google Scholar
Wallace, C.S., Boulton, D.M.: An information measure for classification. Comput. J. 11(1), 185–194 (1968)
Article Google Scholar
Webb, G., Vreeken, J.: Efficient discovery of the most interesting associations. ACM TKDD 8(3), 1–31 (2014)
Article MATH Google Scholar
Zimmermann, A., Nijssen, S.: Supervised pattern mining and applications to classification. In: Aggarwal, C.C., Han, J. (eds.) Frequent Pattern Mining, pp. 425–442. Springer (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Max Planck Institute for Informatics and Saarland University, Saarbrücken, Germany
Kailash Budhathoki & Jilles Vreeken

Authors

Kailash Budhathoki
View author publications
You can also search for this author in PubMed Google Scholar
Jilles Vreeken
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jilles Vreeken .

Editor information

Editors and Affiliations

University of Bari Aldo Moro, Bari, Italy
Annalisa Appice
University of Porto, Porto, Portugal
Pedro Pereira Rodrigues
Universidade do Porto, Porto, Portugal
Vítor Santos Costa
University of Porto - INESC TEC, Porto, Portugal
João Gama
University of Porto - INESC TEC, Porto, Portugal
Alípio Jorge
University of Porto - INESC TEC, Porto, Portugal
Carlos Soares

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Budhathoki, K., Vreeken, J. (2015). The Difference and the Norm — Characterising Similarities and Differences Between Databases . In: Appice, A., Rodrigues, P., Santos Costa, V., Gama, J., Jorge, A., Soares, C. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2015. Lecture Notes in Computer Science(), vol 9285. Springer, Cham. https://doi.org/10.1007/978-3-319-23525-7_13

Download citation

DOI: https://doi.org/10.1007/978-3-319-23525-7_13
Published: 29 August 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23524-0
Online ISBN: 978-3-319-23525-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

The Difference and the Norm — Characterising Similarities and Differences Between Databases

Abstract

Chapter PDF

Similar content being viewed by others

An Introduction to the Minimum Description Length Principle

Summary and Semi-average Similarity Criteria for Individual Clusters

Minimum Description Length Principle

Keywords

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

The Difference and the Norm — Characterising Similarities and Differences Between Databases

Abstract

Chapter PDF

Similar content being viewed by others

An Introduction to the Minimum Description Length Principle

Summary and Semi-average Similarity Criteria for Individual Clusters

Minimum Description Length Principle

Keywords

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation