Using background knowledge to rank itemsets

Tatti, Nikolaj; Mampaey, Michael

doi:10.1007/s10618-010-0188-4

Using background knowledge to rank itemsets

Published: 23 July 2010

Volume 21, pages 293–309, (2010)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Nikolaj Tatti¹ &
Michael Mampaey¹

350 Accesses
21 Citations
1 Altmetric
Explore all metrics

Abstract

Assessing the quality of discovered results is an important open problem in data mining. Such assessment is particularly vital when mining itemsets, since commonly many of the discovered patterns can be easily explained by background knowledge. The simplest approach to screen uninteresting patterns is to compare the observed frequency against the independence model. Since the parameters for the independence model are the column margins, we can view such screening as a way of using the column margins as background knowledge. In this paper we study techniques for more flexible approaches for infusing background knowledge. Namely, we show that we can efficiently use additional knowledge such as row margins, lazarus counts, and bounds of ones. We demonstrate that these statistics describe forms of data that occur in practice and have been studied in data mining. To infuse the information efficiently we use a maximum entropy approach. In its general setting, solving a maximum entropy model is infeasible, but we demonstrate that for our setting it can be solved in polynomial time. Experiments show that more sophisticated models fit the data better and that using more information improves the frequency prediction of itemsets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aggarwal CC, Yu PS (1998) A new framework for itemset generation. In: PODS ’98: Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems. ACM Press, New York, pp 18–24
Agrawal R, Mannila H, Srikant R, Toivonen H, Verkamo AI (1996) Fast discovery of association rules. In: Advances in knowledge discovery and data mining. AAAI/MIT Press, Boston, pp 307–328
Brijs T, Swinnen G, Vanhoof K, Wets G (1999) Using association rules for product assortment decisions: a case study. In: Knowledge discovery and data mining, pp 254–260
Brin S, Motwani R, Silverstein C (1997) Beyond market baskets: generalizing association rules to correlations. In: Peckham J (ed) SIGMOD 1997, Proceedings ACM SIGMOD international conference on management of data, May. ACM Press, New York, pp 265–276
Cowell RG, Dawid AP, Lauritzen SL, Spiegelhalter DJ (1999) Probabilistic networks and expert systems. Statistics for engineering and information science. Springer, New York
Google Scholar
Csiszár I (1975) I-divergence geometry of probability distributions and minimization problems. Ann Probab 3(1): 146–158
Article MATH Google Scholar
Darroch J, Ratcliff D (1972) Generalized iterative scaling for log-linear models. Ann Math Stat 43(5): 1470–1480
Article MATH MathSciNet Google Scholar
Garriga GC, Junttila E, Mannila H (2008) Banded structure in binary matrices. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, Las Vegas, Nevada, USA, August 24–27, 2008. ACM, New York, pp 292–300
Gionis A, Mannila H, Mielikäinen T, Tsaparas P (2007) Assessing data mining results via swap randomization. TKDD 1(3)
Hanhijärvi S, Ojala M, Vuokko N, Puolamäki K, Tatti N, Mannila H (2009) Tell me something I don’t know: randomization strategies for iterative data mining. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining (KDD 2009), pp 379–388
Jaroszewicz S, Scheffer T (2005) Fast discovery of unexpected patterns in data, relative to a Bayesian network. In: KDD ’05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. ACM, New York, pp 118–127
Jaroszewicz S, Simovici DA (2004) Interestingness of frequent itemsets using Bayesian networks as background knowledge. In: KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, pp 178–186
Kohavi R, Brodley C, Frasca B, Mason L, Zheng Z (2000) KDD-Cup 2000 organizers’ report: peeling the onion. SIGKDD Explor 2(2):86–98. http://www.ecn.purdue.edu/KDDCUP
Mannila H, Terzi E (2007) Nestedness and segmented nestedness. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, p 489
Meo R (2000) Theory of dependence values. ACM Trans Database Syst 25(3): 380–406
Article Google Scholar
Myllykangas S, Himberg J, Böhling T, Nagy B, Hollmén J, Knuutila S (2006) DNA copy number amplification profiling of human neoplasms. Oncogene 25(55): 7324–7332
Article Google Scholar
Rasch G (1960) Probabilistic models for some intelligence and attainment tests. Danmarks paedagogiske Institut, Copenhagen
Google Scholar
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2): 461–464
Article MATH Google Scholar
Tatti N (2006) Safe projections of binary data sets. Acta Inf 42(8–9): 617–638
Article MATH MathSciNet Google Scholar
Tatti N (2008) Maximum entropy based significance of itemsets. Knowl Inf Syst 17(1): 57–77
Article Google Scholar
Zaki JZ (2000) Scalable algorithms for association mining. IEEE Trans Knowl Data Eng 12(3): 372–390
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Universiteit Antwerpen, Antwerp, Belgium
Nikolaj Tatti & Michael Mampaey

Authors

Nikolaj Tatti
View author publications
You can also search for this author in PubMed Google Scholar
Michael Mampaey
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nikolaj Tatti.

Additional information

Responsible editors: José L Balcázar, Francesco Bonchi, Aristides Gionis, Michèle Sebag.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tatti, N., Mampaey, M. Using background knowledge to rank itemsets. Data Min Knowl Disc 21, 293–309 (2010). https://doi.org/10.1007/s10618-010-0188-4

Download citation

Received: 30 April 2010
Accepted: 20 June 2010
Published: 23 July 2010
Issue Date: September 2010
DOI: https://doi.org/10.1007/s10618-010-0188-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using background knowledge to rank itemsets

Abstract

Access this article

Similar content being viewed by others

Interesting Patterns

Fast Estimation of the Pattern Frequency Spectrum

Mining Rank Data

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Using background knowledge to rank itemsets

Abstract

Access this article

Similar content being viewed by others

Interesting Patterns

Fast Estimation of the Pattern Frequency Spectrum

Mining Rank Data

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation