Joint European Conference on Machine Learning and Knowledge Discovery in Databases

ECML PKDD 2015: Machine Learning and Knowledge Discovery in Databases pp 206-223 | Cite as

The Difference and the Norm — Characterising Similarities and Differences Between Databases

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9285)

Abstract

Suppose we are given a set of databases, such as sales records over different branches. How can we characterise the differences and the norm between these datasets? That is, what are the patterns that characterise the general distribution, and what are those that are important to describe the individual datasets? We study how to discover these pattern sets simultaneously and without redundancy – automatically identifying those patterns that aid describing the overall distribution, as well as those pointing out those that are characteristic for specific databases. We define the problem in terms of the Minimum Description Length principle, and propose the DiffNorm algorithm to approximate the MDL-optimal summary directly from data. Empirical evaluation on synthetic and real-world data shows that DiffNorm efficiently discovers descriptions that accurately characterise the difference and the norm in easily understandable terms.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aggarwal, C.C., Han, J. (eds.): Frequent Pattern Mining. Springer (2014)Google Scholar
  2. 2.
    Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley-Interscience New York (2006)Google Scholar
  3. 3.
    De Bie, T.: Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min. Knowl. Disc. 23(3), 407–446 (2011)MathSciNetCrossRefMATHGoogle Scholar
  4. 4.
    Geerts, Floris, Goethals, Bart, Mielikäinen, Taneli: Tiling databases. In: Suzuki, Einoshin, Arikawa, Setsuo (eds.) DS 2004. LNCS (LNAI), vol. 3245, pp. 278–289. Springer, Heidelberg (2004) CrossRefGoogle Scholar
  5. 5.
    Grünwald, P.: The Minimum Description Length Principle. MIT Press (2007)Google Scholar
  6. 6.
    Kolmogorov, A.N.: Three approaches to the quantitative definition of information. Problemy Peredachi Informatsii 1(1), 3–11 (1965)MathSciNetGoogle Scholar
  7. 7.
    Li, M., Vitányi, P.: An Introduction to Kolmogorov Complexity and its Applications. Springer (1993)Google Scholar
  8. 8.
    Mampaey, M., Vreeken, J., Tatti, N.: Summarizing data succinctly with the most informative itemsets. ACM TKDD 6, 1–44 (2012)CrossRefGoogle Scholar
  9. 9.
    Miettinen, P.: On finding joint subspace Boolean matrix factorizations. In: SDM, pp. 954–965. SIAM (2012)Google Scholar
  10. 10.
    Nijssen, P., Guns, T., De Raedt, L.: Correlated itemset mining in ROC space: a constraint programming approach. In: KDD, pp. 647–656. Springer (2009)Google Scholar
  11. 11.
    Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Discovering frequent closed itemsets for association rules. In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 398–416. Springer, Heidelberg (1998) CrossRefGoogle Scholar
  12. 12.
    Rissanen, J.: Modeling by shortest data description. Automatica 14(1), 465–471 (1978)CrossRefMATHGoogle Scholar
  13. 13.
    Rissanen, J.: A universal prior for integers and estimation by minimum description length. Annals Stat. 11(2), 416–431 (1983)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Smets, K., Vreeken, J.: Slim: Directly mining descriptive patterns. In: SDM, pp. 236–247. SIAM (2012)Google Scholar
  15. 15.
    Nikolaj, T., Jilles, V.: The long and the short of it: Summarizing event sequences with serial episodes. In: KDD. ACM (2012)Google Scholar
  16. 16.
    Vreeken, J., van Leeuwen, M., Siebes, A.: Characterising the difference. In: KDD, pp. 765–774 (2007)Google Scholar
  17. 17.
    Vreeken, J., van Leeuwen, M., Siebes, A.: Krimp: Mining itemsets that compress. Data Min. Knowl. Disc. 23(1), 169–214 (2011)CrossRefMATHGoogle Scholar
  18. 18.
    Wallace, C.S.: Statistical and inductive inference by minimum message length. Springer (2005)Google Scholar
  19. 19.
    Wallace, C.S., Boulton, D.M.: An information measure for classification. Comput. J. 11(1), 185–194 (1968)CrossRefGoogle Scholar
  20. 20.
    Webb, G., Vreeken, J.: Efficient discovery of the most interesting associations. ACM TKDD 8(3), 1–31 (2014)CrossRefMATHGoogle Scholar
  21. 21.
    Zimmermann, A., Nijssen, S.: Supervised pattern mining and applications to classification. In: Aggarwal, C.C., Han, J. (eds.) Frequent Pattern Mining, pp. 425–442. Springer (2014)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Max Planck Institute for Informatics and Saarland UniversitySaarbrückenGermany

Personalised recommendations