Skip to main content
Log in

Comparing apples and oranges: measuring differences between exploratory data mining results

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Deciding whether the results of two different mining algorithms provide significantly different information is an important, yet understudied, open problem in exploratory data mining. Whether the goal is to select the most informative result for analysis, or to decide which mining approach will most likely provide the most novel insight, it is essential that we can tell how different the information is that different results by possibly different methods provide. In this paper we take a first step towards comparing exploratory data mining results on binary data. We propose to meaningfully convert results into sets of noisy tiles, and compare between these sets by maximum entropy modelling and Kullback–Leibler divergence, well-founded notions from Information Theory. We so construct a measure that is highly flexible, and allows us to naturally include background knowledge, such that differences in results can be measured from the perspective of what a user already knows. Furthermore, adding to its interpretability, it coincides with Jaccard dissimilarity when we only consider exact tiles. Our approach provides a means to study and tell differences between results of different exploratory data mining methods. As an application, we show that our measure can also be used to identify which parts of results best redescribe other results. Furthermore, we study its use for iterative data mining, where one iteratively wants to find that result that will provide maximal novel information. Experimental evaluation shows our measure gives meaningful results, correctly identifies methods that are similar in nature, automatically provides sound redescriptions of results, and is highly applicable for iterative data mining.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Aggarwal CC, Wolf JL, Yu PS, Procopiuc C, Park JS (1999) Fast algorithms for projected clustering. In: Proceedings of the ACM international conference on management of data (SIGMOD), Philadelphia, PA, ACM, pp 61–72

  • Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large data bases (VLDB), Santiago de Chile, Chile, pp 487–499

  • Cover TM, Thomas JA (2006) Elements of information theory. Wiley, New York

    MATH  Google Scholar 

  • Csiszár I (1975) I-divergence geometry of probability distributions and minimization problems. Ann Probab 3(1): 146–158

    Article  MATH  Google Scholar 

  • Darroch J, Ratcliff D (1972) Generalized iterative scaling for log-linear models. Ann Math Stat 43(5): 1470–1480

    Article  MathSciNet  MATH  Google Scholar 

  • De Bie T (2011a) An information theoretic framework for data mining. In: Proceedings of the 17th ACM international conference on knowledge discovery and data mining (SIGKDD), San Diego, CA, ACM

  • De Bie T (2011b) Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Discov 23: 1–40

    Article  MathSciNet  Google Scholar 

  • Fortelius M, Gionis A, Jernvall J, Mannila H (2006) Spectral ordering and biochronology of European fossil mammals. Paleobiology 32(2): 206–214

    Article  Google Scholar 

  • Gallo A, Miettinen P, Mannila H (2008) Finding subgroups having several descriptions: algorithms for redescription mining. In: Proceedings of the 8th SIAM international conference on data mining (SDM), Atlanta, GA

  • Garriga GC, Junttila E, Mannila H (2011) Banded structure in binary matrices. Knowl Inform Syst 28(1): 197–226

    Article  Google Scholar 

  • Geerts F, Goethals B, Mielikäinen T (2004) Tiling databases. In: Proceedings of discovery science, pp 278–289

  • Gionis A, Mannila H, Seppänen JK (2004) Geometric and combinatorial tiles in 0–1 data. In: Proceedings of the 8th European conference on principles and practice of knowledge discovery in databases (PKDD), Pisa, Italy, pp 173–184

  • Hanhijärvi S, Ojala M, Vuokko N, Puolamäki K, Tatti N, Mannila H (2009) Tell me something I don’t know: randomization strategies for iterative data mining. In: Proceedings of the 15th ACM international conference on knowledge discovery and data mining (SIGKDD), Paris, France, ACM, pp 379–388

  • Hollmén J, Seppänen JK, Mannila H (2003) Mixture models and frequent sets: combining global and local methods for 0–1 data. In: Proceedings of the 3rd SIAM international conference on data mining (SDM), San Francisco, CA

  • Jaynes E (1982) On the rationale of maximum-entropy methods. Proc IEEE 70(9): 939–952

    Article  Google Scholar 

  • Knobbe A, Ho E (2006) Pattern teams. In: Proceedings of the 10th European conference on principles and practice of knowledge discovery in databases (PKDD), Berlin, Germany, vol 4213. Springer, pp 577–584

  • Kontonasios KN, De Bie T (2010) An information-theoretic approach to finding noisy tiles in binary databases. In: Proceedings of the 10th SIAM international conference on data mining (SDM), Columbus, OH, SIAM, pp 153–164

  • Kontonasios KN, Vreeken J, De Bie T (2011) Maximum entropy modelling for assessing results on real-valued data. In: Proceedings of the 11th IEEE international conference on data mining (ICDM), Vancouver, Canada, ICDM

  • Kuncheva LI, Whitaker CJ (2003) Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Mach Learn 51(2): 181–207

    Article  MATH  Google Scholar 

  • MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statististics and probability (Berkeley, Calif., 1965/66), vol I: Statistics. Univ. California Press, Berkeley, pp 281–297

  • Mampaey M, Vreeken J (2010) Summarising data by clustering items. In: Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECML PKDD), Barcelona, Spain. Springer, pp 321–336

  • Mampaey M, Vreeken J (2012) Summarising categorical data by clustering attributes. Data Min Knowl Discov (in press)

  • Mampaey M, Tatti N, Vreeken J (2012) Succinctly summarizing data with itemsets. ACM Trans Knowl Discov Data (in press)

  • Mannila H, Terzi E (2007) Nestedness and segmented nestedness. In: Proceedings of the 13th ACM international conference on knowledge discovery and data mining (SIGKDD), San Jose, CA, ACM, p 489

  • Miettinen P (2008) On the positive-negative partial set cover problem. Inform Process Lett 108(4): 219–221

    Article  MathSciNet  Google Scholar 

  • Miettinen P, Mielikäinen T, Gionis A, Das G, Mannila H (2008) The discrete basis problem. IEEE Trans Knowl Data Eng 20(10): 1348–1362

    Article  Google Scholar 

  • Mitchell-Jones A, Amori G, Bogdanowicz W, Krystufek B, Reijnders PH, Spitzenberger F, Stubbe M, Thissen J, Vohralik V, Zima J (1999) The Atlas of European Mammals. Academic Press

  • Müller E, Günnemann S, Assent I, Seidl T (2009) Evaluating clustering in subspace projections of high dimensional data. In: Proceedings of the 35th international conference on very large databases (VLDB), Lyon, France, pp 1270–1281

  • Myllykangas S, Himberg J, Böhling T, Nagy B, Hollmén J, Knuutila S (2006) DNA copy number amplification profiling of human neoplasms. Oncogene 25(55): 7324–7332

    Article  Google Scholar 

  • Pensa RG, Robardet C, Boulicaut JF (2005) A bi-clustering framework for categorical data. In: Proceedings of the 9th European conference on principles and practice of knowledge discovery in databases (PKDD), Porto, Portugal, pp 643–650

  • Puolamäki K, Hanhijärvi S, Garriga GC (2008) An approximation ratio for biclustering. Inform Process Lett 108(2): 45–49

    Article  MathSciNet  Google Scholar 

  • Ramakrishnan N, Kumar D, Mishra B, Potts M, Helm RF (2004) Turning cartwheels: an alternating algorithm for mining redescriptions. In: Proceedings of the 10th ACM international conference on knowledge discovery and data mining (SIGKDD), Seattle, WA, pp 266–275

  • Rasch G (1960) Probabilistic Models for Some Intelligence and Attainnment Tests. Danmarks paedagogiske Institut

  • Sammon J (1969) A nonlinear mapping for data structure analysis. IEEE Trans Comput 18: 401–409

    Article  Google Scholar 

  • Siebes A, Vreeken J, van Leeuwen M (2006) Item sets that compress. In: Proceedings of the 6th SIAM international conference on data mining (SDM), Bethesda, MD, SIAM, pp 393–404

  • Tatti N (2006) Computational complexity of queries based on itemsets. Inform Process Lett 98(5): 183–187. doi:10.1016/j.ipl.2006.02.003

    Article  MathSciNet  MATH  Google Scholar 

  • Tatti N (2007) Distances between data sets based on summary statistics. J Mach Learn Res 8: 131–154

    MathSciNet  MATH  Google Scholar 

  • Tatti N, Vreeken J (2011) Comparing apples and oranges: measuring differences between data mining results. In: Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECML PKDD), Athens, Greece. Springer, pp 398–413

  • Vreeken J, van Leeuwen M, Siebes A (2007) Characterising the difference. In: Proceedings of the 13th ACM international conference on knowledge discovery and data mining (SIGKDD), San Jose, CA, pp 765–774

  • Vreeken J, van Leeuwen M, Siebes A (2011) Krimp: mining itemsets that compress. Data Min Knowl Discov 23(1): 169–214

    Article  MathSciNet  MATH  Google Scholar 

  • Wang C, Parthasarathy S (2006) Summarizing itemset patterns using probabilistic models. In: Proceedings of the 12th ACM international conference on knowledge discovery and data mining (SIGKDD), Philadelphia, PA, pp 730–735

  • Xiang Y, Jin R, Fuhry D, Dragan F (2011) Summarizing transactional databases with overlapped hyperrectangles. Data Min Knowl Discov 23(2): 215–251

    Article  MathSciNet  MATH  Google Scholar 

  • Zaki MJ, Ramakrishnan N (2005) Reasoning about sets using redescription mining. In: Proceedings of the 11th ACM international conference on knowledge discovery and data mining (SIGKDD), Chicago, IL, ACM, pp 364–373. doi:10.1145/1081870.1081912

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nikolaj Tatti.

Additional information

Responsible editor: Dimitrios Gunopulos, Donato Malerba, Michalis Vazirgiannis.

The research described in this paper builds upon and extends the work appearing in ECML PKDD’11 as Tatti and Vreeken (2011).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tatti, N., Vreeken, J. Comparing apples and oranges: measuring differences between exploratory data mining results. Data Min Knowl Disc 25, 173–207 (2012). https://doi.org/10.1007/s10618-012-0275-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-012-0275-9

Keywords

Navigation