Comparing apples and oranges: measuring differences between exploratory data mining results

Tatti, Nikolaj; Vreeken, Jilles

doi:10.1007/s10618-012-0275-9

Comparing apples and oranges: measuring differences between exploratory data mining results

Published: 22 June 2012

Volume 25, pages 173–207, (2012)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Nikolaj Tatti¹ &
Jilles Vreeken¹

748 Accesses
12 Citations
Explore all metrics

Abstract

Deciding whether the results of two different mining algorithms provide significantly different information is an important, yet understudied, open problem in exploratory data mining. Whether the goal is to select the most informative result for analysis, or to decide which mining approach will most likely provide the most novel insight, it is essential that we can tell how different the information is that different results by possibly different methods provide. In this paper we take a first step towards comparing exploratory data mining results on binary data. We propose to meaningfully convert results into sets of noisy tiles, and compare between these sets by maximum entropy modelling and Kullback–Leibler divergence, well-founded notions from Information Theory. We so construct a measure that is highly flexible, and allows us to naturally include background knowledge, such that differences in results can be measured from the perspective of what a user already knows. Furthermore, adding to its interpretability, it coincides with Jaccard dissimilarity when we only consider exact tiles. Our approach provides a means to study and tell differences between results of different exploratory data mining methods. As an application, we show that our measure can also be used to identify which parts of results best redescribe other results. Furthermore, we study its use for iterative data mining, where one iteratively wants to find that result that will provide maximal novel information. Experimental evaluation shows our measure gives meaningful results, correctly identifies methods that are similar in nature, automatically provides sound redescriptions of results, and is highly applicable for iterative data mining.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A random forest guided tour

Article 19 April 2016

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Density-Based Clustering Based on Hierarchical Density Estimates

References

Aggarwal CC, Wolf JL, Yu PS, Procopiuc C, Park JS (1999) Fast algorithms for projected clustering. In: Proceedings of the ACM international conference on management of data (SIGMOD), Philadelphia, PA, ACM, pp 61–72
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large data bases (VLDB), Santiago de Chile, Chile, pp 487–499
Cover TM, Thomas JA (2006) Elements of information theory. Wiley, New York
MATH Google Scholar
Csiszár I (1975) I-divergence geometry of probability distributions and minimization problems. Ann Probab 3(1): 146–158
Article MATH Google Scholar
Darroch J, Ratcliff D (1972) Generalized iterative scaling for log-linear models. Ann Math Stat 43(5): 1470–1480
Article MathSciNet MATH Google Scholar
De Bie T (2011a) An information theoretic framework for data mining. In: Proceedings of the 17th ACM international conference on knowledge discovery and data mining (SIGKDD), San Diego, CA, ACM
De Bie T (2011b) Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Discov 23: 1–40
Article MathSciNet Google Scholar
Fortelius M, Gionis A, Jernvall J, Mannila H (2006) Spectral ordering and biochronology of European fossil mammals. Paleobiology 32(2): 206–214
Article Google Scholar
Gallo A, Miettinen P, Mannila H (2008) Finding subgroups having several descriptions: algorithms for redescription mining. In: Proceedings of the 8th SIAM international conference on data mining (SDM), Atlanta, GA
Garriga GC, Junttila E, Mannila H (2011) Banded structure in binary matrices. Knowl Inform Syst 28(1): 197–226
Article Google Scholar
Geerts F, Goethals B, Mielikäinen T (2004) Tiling databases. In: Proceedings of discovery science, pp 278–289
Gionis A, Mannila H, Seppänen JK (2004) Geometric and combinatorial tiles in 0–1 data. In: Proceedings of the 8th European conference on principles and practice of knowledge discovery in databases (PKDD), Pisa, Italy, pp 173–184
Hanhijärvi S, Ojala M, Vuokko N, Puolamäki K, Tatti N, Mannila H (2009) Tell me something I don’t know: randomization strategies for iterative data mining. In: Proceedings of the 15th ACM international conference on knowledge discovery and data mining (SIGKDD), Paris, France, ACM, pp 379–388
Hollmén J, Seppänen JK, Mannila H (2003) Mixture models and frequent sets: combining global and local methods for 0–1 data. In: Proceedings of the 3rd SIAM international conference on data mining (SDM), San Francisco, CA
Jaynes E (1982) On the rationale of maximum-entropy methods. Proc IEEE 70(9): 939–952
Article Google Scholar
Knobbe A, Ho E (2006) Pattern teams. In: Proceedings of the 10th European conference on principles and practice of knowledge discovery in databases (PKDD), Berlin, Germany, vol 4213. Springer, pp 577–584
Kontonasios KN, De Bie T (2010) An information-theoretic approach to finding noisy tiles in binary databases. In: Proceedings of the 10th SIAM international conference on data mining (SDM), Columbus, OH, SIAM, pp 153–164
Kontonasios KN, Vreeken J, De Bie T (2011) Maximum entropy modelling for assessing results on real-valued data. In: Proceedings of the 11th IEEE international conference on data mining (ICDM), Vancouver, Canada, ICDM
Kuncheva LI, Whitaker CJ (2003) Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Mach Learn 51(2): 181–207
Article MATH Google Scholar
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statististics and probability (Berkeley, Calif., 1965/66), vol I: Statistics. Univ. California Press, Berkeley, pp 281–297
Mampaey M, Vreeken J (2010) Summarising data by clustering items. In: Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECML PKDD), Barcelona, Spain. Springer, pp 321–336
Mampaey M, Vreeken J (2012) Summarising categorical data by clustering attributes. Data Min Knowl Discov (in press)
Mampaey M, Tatti N, Vreeken J (2012) Succinctly summarizing data with itemsets. ACM Trans Knowl Discov Data (in press)
Mannila H, Terzi E (2007) Nestedness and segmented nestedness. In: Proceedings of the 13th ACM international conference on knowledge discovery and data mining (SIGKDD), San Jose, CA, ACM, p 489
Miettinen P (2008) On the positive-negative partial set cover problem. Inform Process Lett 108(4): 219–221
Article MathSciNet Google Scholar
Miettinen P, Mielikäinen T, Gionis A, Das G, Mannila H (2008) The discrete basis problem. IEEE Trans Knowl Data Eng 20(10): 1348–1362
Article Google Scholar
Mitchell-Jones A, Amori G, Bogdanowicz W, Krystufek B, Reijnders PH, Spitzenberger F, Stubbe M, Thissen J, Vohralik V, Zima J (1999) The Atlas of European Mammals. Academic Press
Müller E, Günnemann S, Assent I, Seidl T (2009) Evaluating clustering in subspace projections of high dimensional data. In: Proceedings of the 35th international conference on very large databases (VLDB), Lyon, France, pp 1270–1281
Myllykangas S, Himberg J, Böhling T, Nagy B, Hollmén J, Knuutila S (2006) DNA copy number amplification profiling of human neoplasms. Oncogene 25(55): 7324–7332
Article Google Scholar
Pensa RG, Robardet C, Boulicaut JF (2005) A bi-clustering framework for categorical data. In: Proceedings of the 9th European conference on principles and practice of knowledge discovery in databases (PKDD), Porto, Portugal, pp 643–650
Puolamäki K, Hanhijärvi S, Garriga GC (2008) An approximation ratio for biclustering. Inform Process Lett 108(2): 45–49
Article MathSciNet Google Scholar
Ramakrishnan N, Kumar D, Mishra B, Potts M, Helm RF (2004) Turning cartwheels: an alternating algorithm for mining redescriptions. In: Proceedings of the 10th ACM international conference on knowledge discovery and data mining (SIGKDD), Seattle, WA, pp 266–275
Rasch G (1960) Probabilistic Models for Some Intelligence and Attainnment Tests. Danmarks paedagogiske Institut
Sammon J (1969) A nonlinear mapping for data structure analysis. IEEE Trans Comput 18: 401–409
Article Google Scholar
Siebes A, Vreeken J, van Leeuwen M (2006) Item sets that compress. In: Proceedings of the 6th SIAM international conference on data mining (SDM), Bethesda, MD, SIAM, pp 393–404
Tatti N (2006) Computational complexity of queries based on itemsets. Inform Process Lett 98(5): 183–187. doi:10.1016/j.ipl.2006.02.003
Article MathSciNet MATH Google Scholar
Tatti N (2007) Distances between data sets based on summary statistics. J Mach Learn Res 8: 131–154
MathSciNet MATH Google Scholar
Tatti N, Vreeken J (2011) Comparing apples and oranges: measuring differences between data mining results. In: Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECML PKDD), Athens, Greece. Springer, pp 398–413
Vreeken J, van Leeuwen M, Siebes A (2007) Characterising the difference. In: Proceedings of the 13th ACM international conference on knowledge discovery and data mining (SIGKDD), San Jose, CA, pp 765–774
Vreeken J, van Leeuwen M, Siebes A (2011) Krimp: mining itemsets that compress. Data Min Knowl Discov 23(1): 169–214
Article MathSciNet MATH Google Scholar
Wang C, Parthasarathy S (2006) Summarizing itemset patterns using probabilistic models. In: Proceedings of the 12th ACM international conference on knowledge discovery and data mining (SIGKDD), Philadelphia, PA, pp 730–735
Xiang Y, Jin R, Fuhry D, Dragan F (2011) Summarizing transactional databases with overlapped hyperrectangles. Data Min Knowl Discov 23(2): 215–251
Article MathSciNet MATH Google Scholar
Zaki MJ, Ramakrishnan N (2005) Reasoning about sets using redescription mining. In: Proceedings of the 11th ACM international conference on knowledge discovery and data mining (SIGKDD), Chicago, IL, ACM, pp 364–373. doi:10.1145/1081870.1081912

Download references

Author information

Authors and Affiliations

Advanced Database Research and Modelling, Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium
Nikolaj Tatti & Jilles Vreeken

Authors

Nikolaj Tatti
View author publications
You can also search for this author in PubMed Google Scholar
Jilles Vreeken
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nikolaj Tatti.

Additional information

Responsible editor: Dimitrios Gunopulos, Donato Malerba, Michalis Vazirgiannis.

The research described in this paper builds upon and extends the work appearing in ECML PKDD’11 as Tatti and Vreeken (2011).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tatti, N., Vreeken, J. Comparing apples and oranges: measuring differences between exploratory data mining results. Data Min Knowl Disc 25, 173–207 (2012). https://doi.org/10.1007/s10618-012-0275-9

Download citation

Received: 27 October 2011
Accepted: 05 June 2012
Published: 22 June 2012
Issue Date: September 2012
DOI: https://doi.org/10.1007/s10618-012-0275-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comparing apples and oranges: measuring differences between exploratory data mining results

Abstract

Access this article

Similar content being viewed by others

A random forest guided tour

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Comparing apples and oranges: measuring differences between exploratory data mining results

Abstract

Access this article

Similar content being viewed by others

A random forest guided tour

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation