From sets of good redescriptions to good sets of redescriptions

Kalofolias, Janis; Galbrun, Esther; Miettinen, Pauli

doi:10.1007/s10115-017-1149-7

From sets of good redescriptions to good sets of redescriptions

Regular Paper
Published: 19 January 2018

Volume 57, pages 21–54, (2018)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

236 Accesses
Explore all metrics

Abstract

Redescription mining aims at finding pairs of queries over data variables that describe roughly the same set of observations. These redescriptions can be used to obtain different views on the same set of entities. So far, redescription mining methods have aimed at listing all redescriptions supported by the data. Such an approach can result in many redundant redescriptions and hinder the user’s ability to understand the overall characteristics of the data. In this work, we present an approach to identify and remove the redundant redescriptions, that is, an approach to move from a set of good redescriptions to a good set of redescriptions. We measure the redundancy of a redescription using a framework inspired by the concept of subjective interestingness based on maximum entropy distributions as proposed by De Bie (Data Min Knowl Discov 23(3):407–446, 2011). Redescriptions, however, generate specific requirements on the framework, and our solution differs significantly from the existing ones. Notably, our approach can handle disjunctions and conjunctions in the queries, whereas the existing approaches are limited only to conjunctive queries. Our framework can also handle data with Boolean, nominal, or real-valued data, possibly containing missing values, making it applicable to a wide variety of data sets. Our experiments show that our framework can efficiently reduce the redundancy even on large data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

The present work is an extended version of our earlier conference publication [21].
The Dirac delta, which is the continuous equivalent of the Kronecker delta, is a generalised function that assumes an infinite mass when its argument is zero, in our case effectively ensuring that only the case of \({{\varvec{r}}}={{\varvec{r}}}_\rho \) is possible.
The source code is available at http://siren.mpi-inf.mpg.de/max-ent/.
http://siren.gforge.inria.fr, accessed 13 December 2017.
http://www.informatik.uni-trier.de/~ley/db/, accessed 13 December 2017.
https://archive.ics.uci.edu/ml/datasets/Covertype, accessed 13 December 2017.
http://intersci.ss.uci.edu/wiki/index.php/Ethnographic_Atlas#Rdata_format_version_of_Ethnographic_Atlas, accessed 13 December 2017.

References

Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD Rec 27(2):94–105
Article Google Scholar
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of 20th international conference on very large data bases (VLDB’94), pp 487–499
Barber D (2012) Bayesian reasoning and machine learning. Cambridge University Press, Cambridge
MATH Google Scholar
Berger AL, Pietra VJD, Pietra SAD (1996) A maximum entropy approach to natural language processing. Comput Linguist 22(1):39–71
Google Scholar
Bickel S, Scheffer T (2004) Multi-view clustering. In: Proceedings of the 4th IEEE international conference on data mining (ICDM’04), pp 19–26
Burden RL, Faires JD (2011) Numerical analysis, 9th edn. Brooks/Cole, Boston
MATH Google Scholar
De Bie T (2011) Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Discov 23(3):407–446
Article MathSciNet MATH Google Scholar
Galbrun E, Kimmig A (2014) Finding relational redescriptions. Mach Learn 96(3):225–248
Article MathSciNet MATH Google Scholar
Galbrun E, Miettinen P (2012a) From black and white to full color: extending redescription mining outside the boolean world. Stat Anal Data Min 5(4):284–303
Article MathSciNet Google Scholar
Galbrun E, Miettinen P (2012b) Siren: an interactive tool for mining and visualizing geospatial redescriptions. In: Proceedings of the 18th ACM SIGKDD International conference on knowledge discovery and data mining (KDD’12), pp 1544–1547
Galbrun E, Miettinen P (2014) Interactive redescription mining. In: Proceedings of the 2016 international conference on management of data (SIGMOD’14), pp 1079–1082
Galbrun E, Miettinen P (2018) Redescription mining. Springer, Cham
Google Scholar
Gallo A, Miettinen P, Mannila H (2008) Finding subgroups having several descriptions: algorithms for redescription mining. In: Proceedings of the 8th SIAM international conference on data mining (SDM’08), pp 334–345
Gray JP (1999) A corrected ethnographic atlas. World Cultures 10(1):24–85
MathSciNet Google Scholar
Grove AJ, Halpern JY, Koller D (1992) Random worlds and maximum entropy. In: Proceedings of the 7th annual IEEE symposium on logic in computer science (LICS’92), pp 22–33
Hijmans RJ, Cameron SE, Parra LJ, Jones PG, Jarvis A (2005) Very high resolution interpolated climate surfaces for global land areas. Int J Climatol 25:1965–1978
Article Google Scholar
Jaroszewicz S, Simovici DA (2002) Pruning redundant association rules using maximum entropy principle. In: Proceedings of the 6th Pacific–Asia conference on advances in knowledge discovery and data mining (PAKDD’02), pp 135–147
Jaynes E (1982) On the rationale of maximum-entropy methods. Proc IEEE 70(9):939–952
Article Google Scholar
Jaynes ET (2003) Probability theory: the logic of science, vol 10. Cambridge University Press, Cambridge, p 33
Book Google Scholar
Jensen FV, Jensen F (1994) Optimal junction trees. In: Proceedings of the 10th annual conference on uncertainty in artificial intelligence (UAI’94), pp 360–366
Kalofolias J, Galbrun E, Miettinen P (2016) From sets of good redescriptions to good sets of redescriptions. In: Proceedings of the 16th IEEE international conference on data mining (ICDM’16), pp 211–220
Kontonasios K-N, De Bie T (2012) Formalizing complex prior information to quantify subjective interestingness of frequent pattern sets. In: Proceedings of the 11th international symposium on advances in intelligent data analysis (IDA’12), pp 161–171
Kontonasios K-N, De Bie T (2015) Subjectively interesting alternative clusterings. Mach Learn 98(1–2):31–56
Article MathSciNet MATH Google Scholar
Kontonasios K-N, Vreeken J, De Bie T (2011) Maximum entropy modelling for assessing results on real-valued data. In: Proceedings of the 11th IEEE international conference on data mining (ICDM’1), pp 350–359
Kontonasios K-N, Vreeken J, De Bie T (2013) Maximum entropy models for iteratively identifying subjectively interesting structure in real-valued data. In: Proceedings of the 2013 European conference on machine learning and principles and practice of knowledge discovery in databases (ECML-PKDD’13), pp 256–271
Kröger P (2009) Subspace clustering techniques. In: Liu L, Özsu M T (eds) Encyclopedia of database systems. Springer, Berlin, pp 2873–2875
Google Scholar
Mampaey M, Tatti N, Vreeken J (2011) Tell me what i need to know: succinctly summarizing data with itemsets. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’11), pp 573–581
Mampaey M, Vreeken J, Tatti N (2012) Summarizing data succinctly with the most informative itemsets. ACM Trans Knowl Discov Data 6(4):16:1–16:42
Article Google Scholar
Mannila H, Pavlov D, Smyth P (1999) Prediction with local patterns using cross-entropy. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’99), pp 357–361
Mihelčić M, Šmuc T (2016) InterSet: interactive redescription set exploration. In: Proceedings of the 19th international conference on discovery science (DS’16), pp 35–50
Mitchell-Jones A J et al (1999) The atlas of European mammals. Academic Press, New York
Google Scholar
Murdock GP (1967) Ethnographic atlas: a summary. Ethnology 6(2):109–236
Article Google Scholar
Novak PK, Lavrač N, Webb GI (2009) Supervised descriptive rule discovery: a unifying survey of contrast set, emerging pattern and subgroup mining. J Mach Learn Res 10:377–403
MATH Google Scholar
Parida L, Ramakrishnan N (2005) Redescription mining: structure theory and algorithms. In: Proceedings of the 20th national conference on artificial intelligence and the 7th innovative applications of artificial intelligence conference (AAAI’05), pp 837–844
Pavlov D, Mannila H, Smyth P (2003) Beyond independence: probabilistic models for query approximation on binary transaction data. IEEE Trans Knowl Data Eng 15(6):1409–1421
Article Google Scholar
Phillips SJ, Anderson RP, Schapire RE (2006) Maximum entropy modeling of species geographic distributions. Ecol Model 190(3):231–259
Article Google Scholar
Ramakrishnan N, Kumar D, Mishra B, Potts M, Helm RF (2004) Turning CARTwheels: an alternating algorithm for mining redescriptions. In: Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’04), pp 266–275
Rasch G (1960) Probabilistic models for some intelligence and achievement tests. Danish Institute for Educational Research, Copenhagen
Google Scholar
Tatti N (2006) Computational complexity of queries based on itemsets. Inf Process Lett 98(5):183–187
Article MathSciNet MATH Google Scholar
Tatti N (2008) Maximum entropy based significance of itemsets. Knowl Inf Syst 17(1):57–77
Article Google Scholar
Tatti N, Vreeken J (2011) Comparing apples and oranges. In: Joint european conference on machine learning and knowledge discovery in databases, Springer, pp 398–413
van Leeuwen M, Galbrun E (2015) Association discovery in two-view data. IEEE Trans Knowl Data Eng 27(12):3190–3202
Article Google Scholar
Vreeken J, van Leeuwen M (2011) KRIMP: mining itemsets that compress. Data Min Knowl Disc 23(1):169–214
Article MathSciNet MATH Google Scholar
Wang C, Parthasarathy S (2006) Summarizing itemset patterns using probabilistic models. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’06), pp 730–735
Wu H, Vreeken J, Tatti N, Ramakrishnan N (2014) Uncovering the plot: detecting surprising coalitions of entities in multi-relational schemas. Data Min Knowl Discov 28(5–6):1398–1428
Article MathSciNet Google Scholar
Zaki MJ, Ramakrishnan N (2005) Reasoning about sets using redescription mining. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’05), pp 364–373
Zinchenko T, Galbrun E, Miettinen P (2015) Mining predictive redescriptions with trees. In: IEEE International conference on data mining workshops, pp 1672–1675

Download references

Author information

Authors and Affiliations

Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany
Janis Kalofolias & Pauli Miettinen
Inria Nancy – Grand Est, Nancy, France
Esther Galbrun

Authors

Janis Kalofolias
View author publications
You can also search for this author in PubMed Google Scholar
Esther Galbrun
View author publications
You can also search for this author in PubMed Google Scholar
Pauli Miettinen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Janis Kalofolias.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kalofolias, J., Galbrun, E. & Miettinen, P. From sets of good redescriptions to good sets of redescriptions. Knowl Inf Syst 57, 21–54 (2018). https://doi.org/10.1007/s10115-017-1149-7

Download citation

Received: 16 March 2017
Revised: 20 September 2017
Accepted: 27 December 2017
Published: 19 January 2018
Issue Date: October 2018
DOI: https://doi.org/10.1007/s10115-017-1149-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

From sets of good redescriptions to good sets of redescriptions

Abstract

Access this article

Similar content being viewed by others

Targeted and contextual redescription set exploration

Extending Redescription Mining to Multiple Views

InterSet: Interactive Redescription Set Exploration

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

From sets of good redescriptions to good sets of redescriptions

Abstract

Access this article

Similar content being viewed by others

Targeted and contextual redescription set exploration

Extending Redescription Mining to Multiple Views

InterSet: Interactive Redescription Set Exploration

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation