Abstract
Deciding whether the results of two different mining algorithms provide significantly different information is an important, yet understudied, open problem in exploratory data mining. Whether the goal is to select the most informative result for analysis, or to decide which mining approach will most likely provide the most novel insight, it is essential that we can tell how different the information is that different results by possibly different methods provide. In this paper we take a first step towards comparing exploratory data mining results on binary data. We propose to meaningfully convert results into sets of noisy tiles, and compare between these sets by maximum entropy modelling and Kullback–Leibler divergence, well-founded notions from Information Theory. We so construct a measure that is highly flexible, and allows us to naturally include background knowledge, such that differences in results can be measured from the perspective of what a user already knows. Furthermore, adding to its interpretability, it coincides with Jaccard dissimilarity when we only consider exact tiles. Our approach provides a means to study and tell differences between results of different exploratory data mining methods. As an application, we show that our measure can also be used to identify which parts of results best redescribe other results. Furthermore, we study its use for iterative data mining, where one iteratively wants to find that result that will provide maximal novel information. Experimental evaluation shows our measure gives meaningful results, correctly identifies methods that are similar in nature, automatically provides sound redescriptions of results, and is highly applicable for iterative data mining.
Similar content being viewed by others
References
Aggarwal CC, Wolf JL, Yu PS, Procopiuc C, Park JS (1999) Fast algorithms for projected clustering. In: Proceedings of the ACM international conference on management of data (SIGMOD), Philadelphia, PA, ACM, pp 61–72
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large data bases (VLDB), Santiago de Chile, Chile, pp 487–499
Cover TM, Thomas JA (2006) Elements of information theory. Wiley, New York
Csiszár I (1975) I-divergence geometry of probability distributions and minimization problems. Ann Probab 3(1): 146–158
Darroch J, Ratcliff D (1972) Generalized iterative scaling for log-linear models. Ann Math Stat 43(5): 1470–1480
De Bie T (2011a) An information theoretic framework for data mining. In: Proceedings of the 17th ACM international conference on knowledge discovery and data mining (SIGKDD), San Diego, CA, ACM
De Bie T (2011b) Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Discov 23: 1–40
Fortelius M, Gionis A, Jernvall J, Mannila H (2006) Spectral ordering and biochronology of European fossil mammals. Paleobiology 32(2): 206–214
Gallo A, Miettinen P, Mannila H (2008) Finding subgroups having several descriptions: algorithms for redescription mining. In: Proceedings of the 8th SIAM international conference on data mining (SDM), Atlanta, GA
Garriga GC, Junttila E, Mannila H (2011) Banded structure in binary matrices. Knowl Inform Syst 28(1): 197–226
Geerts F, Goethals B, Mielikäinen T (2004) Tiling databases. In: Proceedings of discovery science, pp 278–289
Gionis A, Mannila H, Seppänen JK (2004) Geometric and combinatorial tiles in 0–1 data. In: Proceedings of the 8th European conference on principles and practice of knowledge discovery in databases (PKDD), Pisa, Italy, pp 173–184
Hanhijärvi S, Ojala M, Vuokko N, Puolamäki K, Tatti N, Mannila H (2009) Tell me something I don’t know: randomization strategies for iterative data mining. In: Proceedings of the 15th ACM international conference on knowledge discovery and data mining (SIGKDD), Paris, France, ACM, pp 379–388
Hollmén J, Seppänen JK, Mannila H (2003) Mixture models and frequent sets: combining global and local methods for 0–1 data. In: Proceedings of the 3rd SIAM international conference on data mining (SDM), San Francisco, CA
Jaynes E (1982) On the rationale of maximum-entropy methods. Proc IEEE 70(9): 939–952
Knobbe A, Ho E (2006) Pattern teams. In: Proceedings of the 10th European conference on principles and practice of knowledge discovery in databases (PKDD), Berlin, Germany, vol 4213. Springer, pp 577–584
Kontonasios KN, De Bie T (2010) An information-theoretic approach to finding noisy tiles in binary databases. In: Proceedings of the 10th SIAM international conference on data mining (SDM), Columbus, OH, SIAM, pp 153–164
Kontonasios KN, Vreeken J, De Bie T (2011) Maximum entropy modelling for assessing results on real-valued data. In: Proceedings of the 11th IEEE international conference on data mining (ICDM), Vancouver, Canada, ICDM
Kuncheva LI, Whitaker CJ (2003) Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Mach Learn 51(2): 181–207
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statististics and probability (Berkeley, Calif., 1965/66), vol I: Statistics. Univ. California Press, Berkeley, pp 281–297
Mampaey M, Vreeken J (2010) Summarising data by clustering items. In: Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECML PKDD), Barcelona, Spain. Springer, pp 321–336
Mampaey M, Vreeken J (2012) Summarising categorical data by clustering attributes. Data Min Knowl Discov (in press)
Mampaey M, Tatti N, Vreeken J (2012) Succinctly summarizing data with itemsets. ACM Trans Knowl Discov Data (in press)
Mannila H, Terzi E (2007) Nestedness and segmented nestedness. In: Proceedings of the 13th ACM international conference on knowledge discovery and data mining (SIGKDD), San Jose, CA, ACM, p 489
Miettinen P (2008) On the positive-negative partial set cover problem. Inform Process Lett 108(4): 219–221
Miettinen P, Mielikäinen T, Gionis A, Das G, Mannila H (2008) The discrete basis problem. IEEE Trans Knowl Data Eng 20(10): 1348–1362
Mitchell-Jones A, Amori G, Bogdanowicz W, Krystufek B, Reijnders PH, Spitzenberger F, Stubbe M, Thissen J, Vohralik V, Zima J (1999) The Atlas of European Mammals. Academic Press
Müller E, Günnemann S, Assent I, Seidl T (2009) Evaluating clustering in subspace projections of high dimensional data. In: Proceedings of the 35th international conference on very large databases (VLDB), Lyon, France, pp 1270–1281
Myllykangas S, Himberg J, Böhling T, Nagy B, Hollmén J, Knuutila S (2006) DNA copy number amplification profiling of human neoplasms. Oncogene 25(55): 7324–7332
Pensa RG, Robardet C, Boulicaut JF (2005) A bi-clustering framework for categorical data. In: Proceedings of the 9th European conference on principles and practice of knowledge discovery in databases (PKDD), Porto, Portugal, pp 643–650
Puolamäki K, Hanhijärvi S, Garriga GC (2008) An approximation ratio for biclustering. Inform Process Lett 108(2): 45–49
Ramakrishnan N, Kumar D, Mishra B, Potts M, Helm RF (2004) Turning cartwheels: an alternating algorithm for mining redescriptions. In: Proceedings of the 10th ACM international conference on knowledge discovery and data mining (SIGKDD), Seattle, WA, pp 266–275
Rasch G (1960) Probabilistic Models for Some Intelligence and Attainnment Tests. Danmarks paedagogiske Institut
Sammon J (1969) A nonlinear mapping for data structure analysis. IEEE Trans Comput 18: 401–409
Siebes A, Vreeken J, van Leeuwen M (2006) Item sets that compress. In: Proceedings of the 6th SIAM international conference on data mining (SDM), Bethesda, MD, SIAM, pp 393–404
Tatti N (2006) Computational complexity of queries based on itemsets. Inform Process Lett 98(5): 183–187. doi:10.1016/j.ipl.2006.02.003
Tatti N (2007) Distances between data sets based on summary statistics. J Mach Learn Res 8: 131–154
Tatti N, Vreeken J (2011) Comparing apples and oranges: measuring differences between data mining results. In: Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECML PKDD), Athens, Greece. Springer, pp 398–413
Vreeken J, van Leeuwen M, Siebes A (2007) Characterising the difference. In: Proceedings of the 13th ACM international conference on knowledge discovery and data mining (SIGKDD), San Jose, CA, pp 765–774
Vreeken J, van Leeuwen M, Siebes A (2011) Krimp: mining itemsets that compress. Data Min Knowl Discov 23(1): 169–214
Wang C, Parthasarathy S (2006) Summarizing itemset patterns using probabilistic models. In: Proceedings of the 12th ACM international conference on knowledge discovery and data mining (SIGKDD), Philadelphia, PA, pp 730–735
Xiang Y, Jin R, Fuhry D, Dragan F (2011) Summarizing transactional databases with overlapped hyperrectangles. Data Min Knowl Discov 23(2): 215–251
Zaki MJ, Ramakrishnan N (2005) Reasoning about sets using redescription mining. In: Proceedings of the 11th ACM international conference on knowledge discovery and data mining (SIGKDD), Chicago, IL, ACM, pp 364–373. doi:10.1145/1081870.1081912
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Dimitrios Gunopulos, Donato Malerba, Michalis Vazirgiannis.
The research described in this paper builds upon and extends the work appearing in ECML PKDD’11 as Tatti and Vreeken (2011).
Rights and permissions
About this article
Cite this article
Tatti, N., Vreeken, J. Comparing apples and oranges: measuring differences between exploratory data mining results. Data Min Knowl Disc 25, 173–207 (2012). https://doi.org/10.1007/s10618-012-0275-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-012-0275-9