Abstract
When analysing binary data, the ease at which one can interpret results is very important. Many existing methods, however, discover either models that are difficult to read, or return so many results interpretation becomes impossible. Here, we study a fully automated approach for mining easily interpretable models for binary data. We model data hierarchically with noisy tiles—rectangles with significantly different density than their parent tile. To identify good trees, we employ the Minimum Description Length principle.
We propose Stijl, a greedy any-time algorithm for mining good tile trees from binary data. Iteratively, it finds the locally optimal addition to the current tree, allowing overlap with tiles of the same parent. A major result of this paper is that we find the optimal tile in only Θ(NM min(N,M)) time. Stijl can either be employed as a top-k miner, or by MDL we can identify the tree that describes the data best.
Experiments show we find succinct models that accurately summarise the data, and, by their hierarchical property are easily interpretable.
Chapter PDF
Similar content being viewed by others
Keywords
- Minimum Description Length
- Kolmogorov Complexity
- Tile Tree
- Frequent Pattern Mining
- Minimum Description Length Principle
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: VLDB, pp. 487–499 (1994)
Bringmann, B., Zimmermann, A.: The chosen few: On identifying valuable patterns. In: ICDM, pp. 63–72 (2007)
Calders, T., Dexters, N., Goethals, B.: Mining frequent itemsets in a stream. In: ICDM, pp. 83–92. IEEE (2007)
Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley-Interscience, New York (2006)
De Bie, T.: Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min. Knowl. Disc. 23(3), 407–446 (2011)
Fortelius, M., Gionis, A., Jernvall, J., Mannila, H.: Spectral ordering and biochronology of european fossil mammals. Paleobiology 32(2), 206–214 (2006)
Geerts, F., Goethals, B., Mielikäinen, T.: Tiling Databases. In: Suzuki, E., Arikawa, S. (eds.) DS 2004. LNCS (LNAI), vol. 3245, pp. 278–289. Springer, Heidelberg (2004)
Gionis, A., Mannila, H., Seppänen, J.K.: Geometric and Combinatorial Tiles in 0–1 Data. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 173–184. Springer, Heidelberg (2004)
Grünwald, P.: The Minimum Description Length Principle. MIT Press (2007)
Han, J., Cheng, H., Xin, D., Yan, X.: Frequent pattern mining: Current status and future directions. Data Min. Knowl. Disc. 15 (2007)
Hanhijärvi, S., Ojala, M., Vuokko, N., Puolamäki, K., Tatti, N., Mannila, H.: Tell me something I don’t know: randomization strategies for iterative data mining. In: KDD, pp. 379–388. ACM (2009)
Kontonasios, K.-N., De Bie, T.: An information-theoretic approach to finding noisy tiles in binary databases. In: SDM, pp. 153–164. SIAM (2010)
Li, M., Vitányi, P.: An Introduction to Kolmogorov Complexity and its Applications. Springer (1993)
Mampaey, M., Tatti, N., Vreeken, J.: Tell me what I need to know: Succinctly summarizing data with itemsets. In: KDD, pp. 573–581. ACM (2011)
Miettinen, P., Mielikäinen, T., Gionis, A., Das, G., Mannila, H.: The discrete basis problem. IEEE TKDE 20(10), 1348–1362 (2008)
Mitchell-Jones, A., Amori, G., Bogdanowicz, W., Krystufek, B., Reijnders, P.H., Spitzenberger, F., Stubbe, M., Thissen, J., Vohralik, V., Zima, J.: The Atlas of European Mammals. Academic Press (1999)
Myllykangas, S., Himberg, J., Böhling, T., Nagy, B., Hollmén, J., Knuutila, S.: DNA copy number amplification profiling of human neoplasms. Oncogene 25(55), 7324–7332 (2006)
Pensa, R.G., Robardet, C., Boulicaut, J.-F.: A Bi-clustering Framework for Categorical Data. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 643–650. Springer, Heidelberg (2005)
Tatti, N.: Are your items in order? In: SDM 2011, pp. 414–425. SIAM (2011)
Tatti, N., Heikinheimo, H.: Decomposable Families of Itemsets. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part II. LNCS (LNAI), vol. 5212, pp. 472–487. Springer, Heidelberg (2008)
Vreeken, J., van Leeuwen, M., Siebes, A.: Krimp: Mining itemsets that compress. Data Min. Knowl. Disc. 23(1), 169–214 (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Tatti, N., Vreeken, J. (2012). Discovering Descriptive Tile Trees. In: Flach, P.A., De Bie, T., Cristianini, N. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2012. Lecture Notes in Computer Science(), vol 7523. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33460-3_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-33460-3_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33459-7
Online ISBN: 978-3-642-33460-3
eBook Packages: Computer ScienceComputer Science (R0)