Advertisement

RealKrimp — Finding Hyperintervals that Compress with MDL for Real-Valued Data

  • Jouke Witteveen
  • Wouter Duivesteijn
  • Arno Knobbe
  • Peter Grünwald
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8819)

Abstract

The MDL Principle (induction by compression) is applied with meticulous effort in the Krimpalgorithm for the problem of itemset mining, where one seeks exceptionally frequent patterns in a binary dataset. As is the case with many algorithms in data mining, Krimpis not designed to cope with real-valued data, and it is not able to handle such data natively. Inspired by Krimp’s success at using the MDL Principle in itemset mining, we develop RealKrimp: an MDL-based Krimp-inspired mining scheme that seeks exceptionally high-density patterns in a real-valued dataset. We review how to extend the underlying Kraft inequality, which relates probabilities to codelengths, to real-valued data. Based on this extension we introduce the RealKrimpalgorithm: an efficient method to find hyperintervals that compress the real-valued dataset, without the need for pre-algorithm data discretization.

Keywords

Minimum Description Length Information Theory Real-Valued Data RealKrimp 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Vreeken, J., van Leeuwen, M., Siebes, A.: KRIMP: Mining Itemsets that Compress. Data Mining and Knowledge Discovery 23, 169–214 (2011)Google Scholar
  2. 2.
    Rissanen, J.: Modeling by Shortest Data Descriptions. Automatica 14(1), 465–471 (1978)CrossRefzbMATHGoogle Scholar
  3. 3.
    Grünwald, P.D.: The Minimum Description Length Principle. MIT Press, Cambridge (2007)Google Scholar
  4. 4.
    Li, M., Vitányi, P.: An Introduction to Kolmogorov Complexity and its Applications. Springer, New York (1993)Google Scholar
  5. 5.
    Faloutsos, C., Megalooikonomou, V.: On Data Mining, Compression and Kolmogorov Complexity. Data Mining and Knowledge Discovery 15(1), 3–20 (2007)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Pfahringer, B.: Compression-Based Feature Subset Selection. In: Proc. IJCAI Workshop on Data Engineering for Inductive Learning, pp. 109–119 (1995)Google Scholar
  7. 7.
    Chakrabarti, S., Sarawagi, S., Dom, B.: Mining Surprising Patterns Using Temporal Description Length. In: Proc. VLDB, pp. 606–617 (1998)Google Scholar
  8. 8.
    Keogh, E., Lonardi, S., Ratanamahatana, C.A.: Towards Parameter-Free Data Mining. In: Proc. KDD, pp. 206–215 (2004)Google Scholar
  9. 9.
    Geerts, F., Goethals, B., Mielikäinen, T.: Tiling Databases. In: Proc. DS, pp. 278–289 (2004)Google Scholar
  10. 10.
    Tatti, N., Vreeken, J.: Finding Good Itemsets by Packing Data. In: Proc. ICDM, pp. 588–597 (2008)Google Scholar
  11. 11.
    Heikinheimo, H., Vreeken, J., Siebes, A., Mannila, H.: Low-Entropy Set Selection. In: Proc. SDM, pp. 569–579 (2009)Google Scholar
  12. 12.
    Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley (2006)Google Scholar
  13. 13.
    Witteveen, J.: Mining Hyperintervals – Getting to Grips With Real-Valued Data, Bachelor’s thesis, Leiden University (2012)Google Scholar
  14. 14.
    Heikinheimo, H., Fortelius, M., Eronen, J., Manilla, H.: Biogeography of European land mammals shows environmentally distinct and spatially coherent clusters. Journal of Biogeography 34(6), 1053–1064 (2007)CrossRefGoogle Scholar
  15. 15.
    Nix, H.A.: BIOCLIM — a Bioclimatic Analysis and Prediction System, research report, CSIRO Division of Water and Land Resources, pp. 59–60 (1986)Google Scholar
  16. 16.
    Peel, M.C., Finlayson, B.L., McMahon, T.A.: Updated World Map of the Köppen-Geiger Climate Classification. Hydrology and Earth System Sciences 11, 1633–1644 (2007)CrossRefGoogle Scholar
  17. 17.
    Mitchell-Jones, T., et al.: The Atlas of European Mammals, Poyser natural history (1999)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Jouke Witteveen
    • 1
  • Wouter Duivesteijn
    • 2
  • Arno Knobbe
    • 3
  • Peter Grünwald
    • 4
  1. 1.ILLCUniversity of AmsterdamThe Netherlands
  2. 2.Fakultät für Informatik, LS VIIITU DortmundGermany
  3. 3.LIACSLeiden UniversityThe Netherlands
  4. 4.CWI and Leiden UniversityThe Netherlands

Personalised recommendations