From sets of good redescriptions to good sets of redescriptions

  • Janis Kalofolias
  • Esther Galbrun
  • Pauli Miettinen
Regular Paper
  • 35 Downloads

Abstract

Redescription mining aims at finding pairs of queries over data variables that describe roughly the same set of observations. These redescriptions can be used to obtain different views on the same set of entities. So far, redescription mining methods have aimed at listing all redescriptions supported by the data. Such an approach can result in many redundant redescriptions and hinder the user’s ability to understand the overall characteristics of the data. In this work, we present an approach to identify and remove the redundant redescriptions, that is, an approach to move from a set of good redescriptions to a good set of redescriptions. We measure the redundancy of a redescription using a framework inspired by the concept of subjective interestingness based on maximum entropy distributions as proposed by De Bie (Data Min Knowl Discov 23(3):407–446, 2011). Redescriptions, however, generate specific requirements on the framework, and our solution differs significantly from the existing ones. Notably, our approach can handle disjunctions and conjunctions in the queries, whereas the existing approaches are limited only to conjunctive queries. Our framework can also handle data with Boolean, nominal, or real-valued data, possibly containing missing values, making it applicable to a wide variety of data sets. Our experiments show that our framework can efficiently reduce the redundancy even on large data sets.

Keywords

Data mining Redescription mining Pattern selection Maximum entropy Subjective interestingness 

References

  1. 1.
    Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD Rec 27(2):94–105CrossRefGoogle Scholar
  2. 2.
    Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of 20th international conference on very large data bases (VLDB’94), pp 487–499Google Scholar
  3. 3.
    Barber D (2012) Bayesian reasoning and machine learning. Cambridge University Press, CambridgeMATHGoogle Scholar
  4. 4.
    Berger AL, Pietra VJD, Pietra SAD (1996) A maximum entropy approach to natural language processing. Comput Linguist 22(1):39–71Google Scholar
  5. 5.
    Bickel S, Scheffer T (2004) Multi-view clustering. In: Proceedings of the 4th IEEE international conference on data mining (ICDM’04), pp 19–26Google Scholar
  6. 6.
    Burden RL, Faires JD (2011) Numerical analysis, 9th edn. Brooks/Cole, BostonMATHGoogle Scholar
  7. 7.
    De Bie T (2011) Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Discov 23(3):407–446MathSciNetCrossRefMATHGoogle Scholar
  8. 8.
    Galbrun E, Kimmig A (2014) Finding relational redescriptions. Mach Learn 96(3):225–248MathSciNetCrossRefMATHGoogle Scholar
  9. 9.
    Galbrun E, Miettinen P (2012a) From black and white to full color: extending redescription mining outside the boolean world. Stat Anal Data Min 5(4):284–303MathSciNetCrossRefGoogle Scholar
  10. 10.
    Galbrun E, Miettinen P (2012b) Siren: an interactive tool for mining and visualizing geospatial redescriptions. In: Proceedings of the 18th ACM SIGKDD International conference on knowledge discovery and data mining (KDD’12), pp 1544–1547Google Scholar
  11. 11.
    Galbrun E, Miettinen P (2014) Interactive redescription mining. In: Proceedings of the 2016 international conference on management of data (SIGMOD’14), pp 1079–1082Google Scholar
  12. 12.
    Galbrun E, Miettinen P (2018) Redescription mining. Springer, ChamGoogle Scholar
  13. 13.
    Gallo A, Miettinen P, Mannila H (2008) Finding subgroups having several descriptions: algorithms for redescription mining. In: Proceedings of the 8th SIAM international conference on data mining (SDM’08), pp 334–345Google Scholar
  14. 14.
    Gray JP (1999) A corrected ethnographic atlas. World Cultures 10(1):24–85Google Scholar
  15. 15.
    Grove AJ, Halpern JY, Koller D (1992) Random worlds and maximum entropy. In: Proceedings of the 7th annual IEEE symposium on logic in computer science (LICS’92), pp 22–33Google Scholar
  16. 16.
    Hijmans RJ, Cameron SE, Parra LJ, Jones PG, Jarvis A (2005) Very high resolution interpolated climate surfaces for global land areas. Int J Climatol 25:1965–1978CrossRefGoogle Scholar
  17. 17.
    Jaroszewicz S, Simovici DA (2002) Pruning redundant association rules using maximum entropy principle. In: Proceedings of the 6th Pacific–Asia conference on advances in knowledge discovery and data mining (PAKDD’02), pp 135–147Google Scholar
  18. 18.
    Jaynes E (1982) On the rationale of maximum-entropy methods. Proc IEEE 70(9):939–952CrossRefGoogle Scholar
  19. 19.
    Jaynes ET (2003) Probability theory: the logic of science, vol 10. Cambridge University Press, Cambridge, p 33CrossRefGoogle Scholar
  20. 20.
    Jensen FV, Jensen F (1994) Optimal junction trees. In: Proceedings of the 10th annual conference on uncertainty in artificial intelligence (UAI’94), pp 360–366Google Scholar
  21. 21.
    Kalofolias J, Galbrun E, Miettinen P (2016) From sets of good redescriptions to good sets of redescriptions. In: Proceedings of the 16th IEEE international conference on data mining (ICDM’16), pp 211–220Google Scholar
  22. 22.
    Kontonasios K-N, De Bie T (2012) Formalizing complex prior information to quantify subjective interestingness of frequent pattern sets. In: Proceedings of the 11th international symposium on advances in intelligent data analysis (IDA’12), pp 161–171Google Scholar
  23. 23.
    Kontonasios K-N, De Bie T (2015) Subjectively interesting alternative clusterings. Mach Learn 98(1–2):31–56MathSciNetCrossRefMATHGoogle Scholar
  24. 24.
    Kontonasios K-N, Vreeken J, De Bie T (2011) Maximum entropy modelling for assessing results on real-valued data. In: Proceedings of the 11th IEEE international conference on data mining (ICDM’1), pp 350–359Google Scholar
  25. 25.
    Kontonasios K-N, Vreeken J, De Bie T (2013) Maximum entropy models for iteratively identifying subjectively interesting structure in real-valued data. In: Proceedings of the 2013 European conference on machine learning and principles and practice of knowledge discovery in databases (ECML-PKDD’13), pp 256–271Google Scholar
  26. 26.
    Kröger P (2009) Subspace clustering techniques. In: Liu L, Özsu M T (eds) Encyclopedia of database systems. Springer, Berlin, pp 2873–2875Google Scholar
  27. 27.
    Mampaey M, Tatti N, Vreeken J (2011) Tell me what i need to know: succinctly summarizing data with itemsets. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’11), pp 573–581Google Scholar
  28. 28.
    Mampaey M, Vreeken J, Tatti N (2012) Summarizing data succinctly with the most informative itemsets. ACM Trans Knowl Discov Data 6(4):16:1–16:42CrossRefGoogle Scholar
  29. 29.
    Mannila H, Pavlov D, Smyth P (1999) Prediction with local patterns using cross-entropy. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’99), pp 357–361Google Scholar
  30. 30.
    Mihelčić M, Šmuc T (2016) InterSet: interactive redescription set exploration. In: Proceedings of the 19th international conference on discovery science (DS’16), pp 35–50Google Scholar
  31. 31.
    Mitchell-Jones A J et al (1999) The atlas of European mammals. Academic Press, New YorkGoogle Scholar
  32. 32.
    Murdock GP (1967) Ethnographic atlas: a summary. Ethnology 6(2):109–236CrossRefGoogle Scholar
  33. 33.
    Novak PK, Lavrač N, Webb GI (2009) Supervised descriptive rule discovery: a unifying survey of contrast set, emerging pattern and subgroup mining. J Mach Learn Res 10:377–403MATHGoogle Scholar
  34. 34.
    Parida L, Ramakrishnan N (2005) Redescription mining: structure theory and algorithms. In: Proceedings of the 20th national conference on artificial intelligence and the 7th innovative applications of artificial intelligence conference (AAAI’05), pp 837–844Google Scholar
  35. 35.
    Pavlov D, Mannila H, Smyth P (2003) Beyond independence: probabilistic models for query approximation on binary transaction data. IEEE Trans Knowl Data Eng 15(6):1409–1421CrossRefGoogle Scholar
  36. 36.
    Phillips SJ, Anderson RP, Schapire RE (2006) Maximum entropy modeling of species geographic distributions. Ecol Model 190(3):231–259CrossRefGoogle Scholar
  37. 37.
    Ramakrishnan N, Kumar D, Mishra B, Potts M, Helm RF (2004) Turning CARTwheels: an alternating algorithm for mining redescriptions. In: Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’04), pp 266–275Google Scholar
  38. 38.
    Rasch G (1960) Probabilistic models for some intelligence and achievement tests. Danish Institute for Educational Research, CopenhagenGoogle Scholar
  39. 39.
    Tatti N (2006) Computational complexity of queries based on itemsets. Inf Process Lett 98(5):183–187MathSciNetCrossRefMATHGoogle Scholar
  40. 40.
    Tatti N (2008) Maximum entropy based significance of itemsets. Knowl Inf Syst 17(1):57–77CrossRefGoogle Scholar
  41. 41.
    Tatti N, Vreeken J (2011) Comparing apples and oranges. In: Joint european conference on machine learning and knowledge discovery in databases, Springer, pp 398–413Google Scholar
  42. 42.
    van Leeuwen M, Galbrun E (2015) Association discovery in two-view data. IEEE Trans Knowl Data Eng 27(12):3190–3202CrossRefGoogle Scholar
  43. 43.
    Vreeken J, van Leeuwen M (2011) KRIMP: mining itemsets that compress. Data Min Knowl Disc 23(1):169–214MathSciNetCrossRefMATHGoogle Scholar
  44. 44.
    Wang C, Parthasarathy S (2006) Summarizing itemset patterns using probabilistic models. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’06), pp 730–735Google Scholar
  45. 45.
    Wu H, Vreeken J, Tatti N, Ramakrishnan N (2014) Uncovering the plot: detecting surprising coalitions of entities in multi-relational schemas. Data Min Knowl Discov 28(5–6):1398–1428MathSciNetCrossRefGoogle Scholar
  46. 46.
    Zaki MJ, Ramakrishnan N (2005) Reasoning about sets using redescription mining. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’05), pp 364–373Google Scholar
  47. 47.
    Zinchenko T, Galbrun E, Miettinen P (2015) Mining predictive redescriptions with trees. In: IEEE International conference on data mining workshops, pp 1672–1675Google Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2018

Authors and Affiliations

  1. 1.Max Planck Institute for InformaticsSaarbrückenGermany
  2. 2.Inria Nancy – Grand EstNancyFrance

Personalised recommendations