Mining and Using Sets of Patterns through Compression

Chapter

Abstract

In this chapter we describe how to successfully apply the MDL principle to pattern mining. In particular, we discuss how pattern-based models can be designed and induced by means of compression, resulting in succinct and characteristic descriptions of the data.

As motivation, we argue that traditional pattern mining asks the wrong question: instead of asking for all patterns satisfying some interestingness measure, one should ask for a small, non-redundant, and interesting set of patterns—which allows us to avoid the pattern explosion. Firmly rooted in algorithmic information theory, the approach we discuss in this chapter states that the best set of patterns is that set that compresses the data best. We formalize this problem using the Minimum Description Length (MDL) principle, describe useful model classes, and briefly discuss algorithmic approaches to inducing good models from data. Last but not least, we describe how the obtained models—in addition to showing the key patterns of the data—can be used for a wide range of data mining tasks; hence showing that MDL selects useful patterns.

Keywords

Compression MDL Pattern set mining Data summarization 

References

  1. 1.
    P. Adriaans and P. Vitányi. Approximation of the two-part MDL code. IEEE TIT, 55(1):444–457, 2009.Google Scholar
  2. 2.
    H. Akaike. A new look at the statistical model identification. IEEE TAC, 19(6):716–723, 1974.MATHMathSciNetGoogle Scholar
  3. 3.
    L. Akoglu, H. Tong, J. Vreeken, and C. Faloutsos. CompreX (refer MSP): Compression based anomaly detection. In CIKM. ACM, 2012.Google Scholar
  4. 4.
    L. Akoglu, J. Vreeken, H. Tong, N. Tatti, and C. Faloutsos. Mining connection pathways for marked nodes in large graphs. In SDM. SIAM, 2013.Google Scholar
  5. 5.
    R. Bathoorn, A. Koopman, and A. Siebes. Reducing the frequent pattern set. In ICDM-Workshop, pages 1–5, 2006.Google Scholar
  6. 6.
    C. Böhm, C. Faloutsos, J.-Y. Pan, and C. Plant. Robust information-theoretic clustering. In KDD, pages 65–75, 2006.Google Scholar
  7. 7.
    F. Bonchi, M. van Leeuwen, and A. Ukkonen. Characterizing uncertain data using compression. In SDM, pages 534–545, 2011.Google Scholar
  8. 8.
    S. Chakrabarti, S. Sarawagi, and B. Dom. Mining surprising patterns using temporal description length. In VLDB, pages 606–617. Morgan Kaufmann, 1998.Google Scholar
  9. 9.
    D. Chakrabarti, S. Papadimitriou, D. S. Modha, and C. Faloutsos. Fully automatic cross-associations. In KDD, pages 79–88, 2004.Google Scholar
  10. 10.
    V. Chandola and V. Kumar. Summarization – compressing data into an informative representation. Knowl. Inf. Sys., 12(3):355–378, 2007.CrossRefGoogle Scholar
  11. 11.
    R. Cilibrasi and P. Vitányi. Clustering by compression. IEEE TIT, 51(4):1523–1545, 2005.Google Scholar
  12. 12.
    T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley-Interscience New York, 2006.Google Scholar
  13. 13.
    T. De Bie. An information theoretic framework for data mining. In KDD, pages 564–572. ACM, 2011.Google Scholar
  14. 14.
    A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J. R. Statist. Soc. B, 39(1):1–38, 1977.MATHMathSciNetGoogle Scholar
  15. 15.
    C. Faloutsos and V. Megalooikonomou. On data mining, compression and Kolmogorov complexity. Data Min. Knowl. Disc., 15(1):3–20, 2007.CrossRefMathSciNetGoogle Scholar
  16. 16.
    U. Fayyad and K. Irani. Multi-interval discretization of continuous-valued attributes for classification learning. In UAI, pages 1022–1027, 1993.Google Scholar
  17. 17.
    R. A. Fisher. On the interpretation of χ2from contingency tables, and the calculation of P. Journal of the Royal Statistical Society, 85(1):87–94, 1922.CrossRefGoogle Scholar
  18. 18.
    F. Geerts, B. Goethals, and T. Mielikäinen. Tiling databases. In DS, pages 278–289, 2004.Google Scholar
  19. 19.
    A. Gionis, H. Mannila, and J. K. Seppänen. Geometric and combinatorial tiles in 0-1 data. In PKDD, pages 173–184. Springer, 2004.Google Scholar
  20. 20.
    P. Grünwald. The Minimum Description Length Principle. MIT Press, 2007.Google Scholar
  21. 21.
    T. Guns, S. Nijssen, and L. D. Raedt. Itemset mining: A constraint programming perspective. Artif. Intell., 175(12-13):1951–1983, 2011.CrossRefMATHGoogle Scholar
  22. 22.
    E. Halperin and R. M. Karp. The minimum-entropy set cover problem. TCS, 348(2-3):240–250, 2005.CrossRefMATHMathSciNetGoogle Scholar
  23. 23.
    H. Heikinheimo, J. K. Seppänen, E. Hinkkanen, H. Mannila, and T. Mielikäinen. Finding low-entropy sets and trees from binary data. In KDD, pages 350–359, 2007.Google Scholar
  24. 24.
    H. Heikinheimo, J. Vreeken, A. Siebes, and H. Mannila. Lowentropy set selection. In SDM, pages 569–580, 2009.Google Scholar
  25. 25.
    E. Jaynes. On the rationale of maximum-entropy methods. Proc. IEEE, 70(9):939–952, 1982.CrossRefGoogle Scholar
  26. 26.
    U. Kang and C. Faloutsos. Beyond caveman communities: Hubs and spokes for graph compression and mining. In ICDM, pages 300–309. IEEE, 2011.Google Scholar
  27. 27.
    R. M. Karp. Reducibility among combinatorial problems. In Proc. Compl. Comp. Comput., pages 85–103, New York, USA, 1972.Google Scholar
  28. 28.
    E. Keogh, S. Lonardi, and C. A. Ratanamahatana. Towards parameter-free data mining. In KDD, pages 206–215, 2004.Google Scholar
  29. 29.
    E. Keogh, S. Lonardi, C. A. Ratanamahatana, L. Wei, S.-H. Lee, and J. Handley. Compression-based data mining of sequential data. Data Min. Knowl. Disc., 14(1):99–129, 2007.CrossRefMathSciNetGoogle Scholar
  30. 30.
    P. Kontkanen and P. Myllymäki. A linear-time algorithm for computing the multinomial stochastic complexity. Inf. Process. Lett., 103(6):227–233, 2007.CrossRefMATHGoogle Scholar
  31. 31.
    P. Kontkanen, P. Myllymäki, W. Buntine, J. Rissanen, and H. Tirri. An MDL framework for clustering. Technical report, HIIT, 2004. Technical Report 2004–6.Google Scholar
  32. 32.
    A. Koopman and A. Siebes. Discovering relational items sets efficiently. In SDM, pages 108–119, 2008.Google Scholar
  33. 33.
    A. Koopman and A. Siebes. Characteristic relational patterns. In KDD, pages 437–446, 2009.Google Scholar
  34. 34.
    H. T. Lam, F. Mörchen, D. Fradkin, and T. Calders. Mining compressing sequential patterns. In SDM, 2012.Google Scholar
  35. 35.
    H. T. Lam, T. Calders, J. Yang, F. Moerchen, and D. Fradkin.: Mining compressing sequential patterns in streams. In IDEA, pages 54–62, 2013.Google Scholar
  36. 36.
    M. van Leeuwen and A. Siebes. StreamKrimp: Detecting change in data streams. In ECML PKDD, pages 672–687, 2008.Google Scholar
  37. 37.
    M. van Leeuwen, J. Vreeken, and A. Siebes. Compression picks the item sets that matter. In PKDD, pages 585–592, 2006.Google Scholar
  38. 38.
    M. van Leeuwen, F. Bonchi, B. Sigurbjörnsson, and A. Siebes. Compressing tags to find interesting media groups. In CIKM, pages 1147–1156, 2009.Google Scholar
  39. 39.
    M. van Leeuwen, J. Vreeken, and A. Siebes. Identifying the components. Data Min. Knowl. Disc., 19(2):173–292, 2009.CrossRefMathSciNetGoogle Scholar
  40. 40.
    M. Li and P. Vitányi. An Introduction to Kolmogorov Complexity and its Applications. Springer, 1993.Google Scholar
  41. 41.
    M. Li, X. Chen, X. Li, B. Ma, and P. Vitanyi. The similarity metric. IEEE TIT, 50(12): 3250–3264, 2004.MathSciNetGoogle Scholar
  42. 42.
    C. Lucchese, S. Orlando, and R. Perego. Mining top-k patterns from binary datasets in presence of noise. In SDM, pages 165–176, 2010.Google Scholar
  43. 43.
    M. Mampaey and J. Vreeken. Summarising categorical data by clustering attributes. Data Min. Knowl. Disc., 26(1):130–173, 2013.CrossRefMATHMathSciNetGoogle Scholar
  44. 44.
    M. Mampaey, J. Vreeken, and N. Tatti. Summarizing data succinctly with the most informative itemsets. ACM TKDD, 6:1–44, 2012.CrossRefGoogle Scholar
  45. 45.
    P. Miettinen and J. Vreeken. Model order selection for Boolean matrix factorization. In KDD, pages 51–59. ACM, 2011.Google Scholar
  46. 46.
    P. Miettinen and J. Vreeken. mdl4bmf: Minimum description length for Boolean matrix factorization. ACM TKDD. In Press.Google Scholar
  47. 47.
    S. Papadimitriou, J. Sun, C. Faloutsos, and P. S. Yu. Hierarchical, parameter-free community discovery. In ECML PKDD, pages 170–187, 2008.Google Scholar
  48. 48.
    B. Pfahringer. Compression-based feature subset selection. In Proc. IJCAI'95 Workshop on Data Engineering for Inductive Learning, pages 109–119, 1995.Google Scholar
  49. 49.
    B. A. Prakash, J. Vreeken, and C. Faloutsos. Spotting culprits in epidemics: How many and which ones? In ICDM. IEEE, 2012.Google Scholar
  50. 50.
    J. Quinlan. C4.5: Programs for Machine Learning. Morgan-Kaufmann, Los Altos, California, 1993.Google Scholar
  51. 51.
    L. D. Raedt. Declarative modeling for machine learning and data mining. In ECML PKDD, pages 2–3, 2012.Google Scholar
  52. 52.
    J. Rissanen. Modeling by shortest data description. Automatica, 14(1):465–471, 1978.CrossRefMATHGoogle Scholar
  53. 53.
    G. Schwarz. Estimating the dimension of a model. Annals Stat., 6(2):461–464, 1978.CrossRefMATHGoogle Scholar
  54. 54.
    H. Shao, B. Tong, and E. Suzuki. Extended MDL principle for feature-based inductive transfer learning. Knowl. Inf. Sys., 35(2):365–389, 2013.CrossRefGoogle Scholar
  55. 55.
    A. Siebes. Queries for data analysis. In IDA, pages 7–22, 2012.Google Scholar
  56. 56.
    A. Siebes and R. Kersten. A structure function for transaction data. In SDM, pages 558–569. SIAM, 2011.Google Scholar
  57. 57.
    A. Siebes, J. Vreeken, and M. van Leeuwen. Item sets that compress. In SDM, pages 393–404. SIAM, 2006.Google Scholar
  58. 58.
    K. Smets and J. Vreeken. The odd one out: Identifying and characterising anomalies. In SDM, pages 804–815. SIAM, 2011.Google Scholar
  59. 59.
    K. Smets and J. Vreeken. Slim: Directly mining descriptive patterns. In SDM, pages 236–247. SIAM, 2012.Google Scholar
  60. 60.
    J. Sun, C. Faloutsos, S. Papadimitriou, and P. S. Yu. Graphscope: parameter-free mining of large time-evolving graphs. In KDD, pages 687–696, 2007.Google Scholar
  61. 61.
    N. Tatti. Computational complexity of queries based on itemsets. Inf. Process. Lett., 98(5): 183–187, 2006.CrossRefMATHMathSciNetGoogle Scholar
  62. 62.
    N. Tatti and J. Vreeken. Finding good itemsets by packing data. In ICDM, pages 588–597, 2008.Google Scholar
  63. 63.
    N. Tatti and J. Vreeken. Discovering descriptive tile trees by fast mining of optimal geometric subtiles. In ECML PKDD. Springer, 2012.Google Scholar
  64. 64.
    N. Tatti and J. Vreeken. The long and the short of it: Summarizing event sequences with serial episodes. In KDD. ACM, 2012.Google Scholar
  65. 65.
    J. Vreeken and A. Siebes. Filling in the blanks: Krimp minimisation for missing data. In ICDM, pages 1067–1072. IEEE, 2008.Google Scholar
  66. 66.
    J. Vreeken, M. van Leeuwen, and A. Siebes. Characterising the difference. In KDD, pages 765–774, 2007.Google Scholar
  67. 67.
    J. Vreeken, M. van Leeuwen, and A. Siebes. Preserving privacy through data generation. In ICDM, pages 685–690. IEEE, 2007.Google Scholar
  68. 68.
    J. Vreeken, M. van Leeuwen, and A. Siebes. Krimp: Mining itemsets that compress. Data Min. Knowl. Disc., 23(1):169–214, 2011.CrossRefMATHGoogle Scholar
  69. 69.
    C. Wallace. Statistical and inductive inference by minimum message length. Springer-Verlag, 2005.Google Scholar
  70. 70.
    C. Wang and S. Parthasarathy. Summarizing itemset patterns using probabilistic models. In KDD, pages 730–735, 2006.Google Scholar
  71. 71.
    H. Warner, A. Toronto, L. Veasey, and R. Stephenson. A mathematical model for medical diagnosis, application to congenital heart disease. J. Am. Med. Assoc., 177:177–184, 1961.CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  1. 1.KU LeuvenLeuvenBelgium
  2. 2.Max-Planck Institute for Informatics and Saarland UniversitySaarbrückenGermany

Personalised recommendations