Skip to main content
Log in

Using background knowledge to rank itemsets

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Assessing the quality of discovered results is an important open problem in data mining. Such assessment is particularly vital when mining itemsets, since commonly many of the discovered patterns can be easily explained by background knowledge. The simplest approach to screen uninteresting patterns is to compare the observed frequency against the independence model. Since the parameters for the independence model are the column margins, we can view such screening as a way of using the column margins as background knowledge. In this paper we study techniques for more flexible approaches for infusing background knowledge. Namely, we show that we can efficiently use additional knowledge such as row margins, lazarus counts, and bounds of ones. We demonstrate that these statistics describe forms of data that occur in practice and have been studied in data mining. To infuse the information efficiently we use a maximum entropy approach. In its general setting, solving a maximum entropy model is infeasible, but we demonstrate that for our setting it can be solved in polynomial time. Experiments show that more sophisticated models fit the data better and that using more information improves the frequency prediction of itemsets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Aggarwal CC, Yu PS (1998) A new framework for itemset generation. In: PODS ’98: Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems. ACM Press, New York, pp 18–24

  • Agrawal R, Mannila H, Srikant R, Toivonen H, Verkamo AI (1996) Fast discovery of association rules. In: Advances in knowledge discovery and data mining. AAAI/MIT Press, Boston, pp 307–328

  • Brijs T, Swinnen G, Vanhoof K, Wets G (1999) Using association rules for product assortment decisions: a case study. In: Knowledge discovery and data mining, pp 254–260

  • Brin S, Motwani R, Silverstein C (1997) Beyond market baskets: generalizing association rules to correlations. In: Peckham J (ed) SIGMOD 1997, Proceedings ACM SIGMOD international conference on management of data, May. ACM Press, New York, pp 265–276

  • Cowell RG, Dawid AP, Lauritzen SL, Spiegelhalter DJ (1999) Probabilistic networks and expert systems. Statistics for engineering and information science. Springer, New York

    Google Scholar 

  • Csiszár I (1975) I-divergence geometry of probability distributions and minimization problems. Ann Probab 3(1): 146–158

    Article  MATH  Google Scholar 

  • Darroch J, Ratcliff D (1972) Generalized iterative scaling for log-linear models. Ann Math Stat 43(5): 1470–1480

    Article  MATH  MathSciNet  Google Scholar 

  • Garriga GC, Junttila E, Mannila H (2008) Banded structure in binary matrices. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, Las Vegas, Nevada, USA, August 24–27, 2008. ACM, New York, pp 292–300

  • Gionis A, Mannila H, Mielikäinen T, Tsaparas P (2007) Assessing data mining results via swap randomization. TKDD 1(3)

  • Hanhijärvi S, Ojala M, Vuokko N, Puolamäki K, Tatti N, Mannila H (2009) Tell me something I don’t know: randomization strategies for iterative data mining. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining (KDD 2009), pp 379–388

  • Jaroszewicz S, Scheffer T (2005) Fast discovery of unexpected patterns in data, relative to a Bayesian network. In: KDD ’05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. ACM, New York, pp 118–127

  • Jaroszewicz S, Simovici DA (2004) Interestingness of frequent itemsets using Bayesian networks as background knowledge. In: KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, pp 178–186

  • Kohavi R, Brodley C, Frasca B, Mason L, Zheng Z (2000) KDD-Cup 2000 organizers’ report: peeling the onion. SIGKDD Explor 2(2):86–98. http://www.ecn.purdue.edu/KDDCUP

  • Mannila H, Terzi E (2007) Nestedness and segmented nestedness. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, p 489

  • Meo R (2000) Theory of dependence values. ACM Trans Database Syst 25(3): 380–406

    Article  Google Scholar 

  • Myllykangas S, Himberg J, Böhling T, Nagy B, Hollmén J, Knuutila S (2006) DNA copy number amplification profiling of human neoplasms. Oncogene 25(55): 7324–7332

    Article  Google Scholar 

  • Rasch G (1960) Probabilistic models for some intelligence and attainment tests. Danmarks paedagogiske Institut, Copenhagen

    Google Scholar 

  • Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2): 461–464

    Article  MATH  Google Scholar 

  • Tatti N (2006) Safe projections of binary data sets. Acta Inf 42(8–9): 617–638

    Article  MATH  MathSciNet  Google Scholar 

  • Tatti N (2008) Maximum entropy based significance of itemsets. Knowl Inf Syst 17(1): 57–77

    Article  Google Scholar 

  • Zaki JZ (2000) Scalable algorithms for association mining. IEEE Trans Knowl Data Eng 12(3): 372–390

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nikolaj Tatti.

Additional information

Responsible editors: José L Balcázar, Francesco Bonchi, Aristides Gionis, Michèle Sebag.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tatti, N., Mampaey, M. Using background knowledge to rank itemsets. Data Min Knowl Disc 21, 293–309 (2010). https://doi.org/10.1007/s10618-010-0188-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-010-0188-4

Keywords

Navigation