Abstract
Assessing the quality of discovered results is an important open problem in data mining. Such assessment is particularly vital when mining itemsets, since commonly many of the discovered patterns can be easily explained by background knowledge. The simplest approach to screen uninteresting patterns is to compare the observed frequency against the independence model. Since the parameters for the independence model are the column margins, we can view such screening as a way of using the column margins as background knowledge. In this paper we study techniques for more flexible approaches for infusing background knowledge. Namely, we show that we can efficiently use additional knowledge such as row margins, lazarus counts, and bounds of ones. We demonstrate that these statistics describe forms of data that occur in practice and have been studied in data mining. To infuse the information efficiently we use a maximum entropy approach. In its general setting, solving a maximum entropy model is infeasible, but we demonstrate that for our setting it can be solved in polynomial time. Experiments show that more sophisticated models fit the data better and that using more information improves the frequency prediction of itemsets.
Similar content being viewed by others
References
Aggarwal CC, Yu PS (1998) A new framework for itemset generation. In: PODS ’98: Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems. ACM Press, New York, pp 18–24
Agrawal R, Mannila H, Srikant R, Toivonen H, Verkamo AI (1996) Fast discovery of association rules. In: Advances in knowledge discovery and data mining. AAAI/MIT Press, Boston, pp 307–328
Brijs T, Swinnen G, Vanhoof K, Wets G (1999) Using association rules for product assortment decisions: a case study. In: Knowledge discovery and data mining, pp 254–260
Brin S, Motwani R, Silverstein C (1997) Beyond market baskets: generalizing association rules to correlations. In: Peckham J (ed) SIGMOD 1997, Proceedings ACM SIGMOD international conference on management of data, May. ACM Press, New York, pp 265–276
Cowell RG, Dawid AP, Lauritzen SL, Spiegelhalter DJ (1999) Probabilistic networks and expert systems. Statistics for engineering and information science. Springer, New York
Csiszár I (1975) I-divergence geometry of probability distributions and minimization problems. Ann Probab 3(1): 146–158
Darroch J, Ratcliff D (1972) Generalized iterative scaling for log-linear models. Ann Math Stat 43(5): 1470–1480
Garriga GC, Junttila E, Mannila H (2008) Banded structure in binary matrices. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, Las Vegas, Nevada, USA, August 24–27, 2008. ACM, New York, pp 292–300
Gionis A, Mannila H, Mielikäinen T, Tsaparas P (2007) Assessing data mining results via swap randomization. TKDD 1(3)
Hanhijärvi S, Ojala M, Vuokko N, Puolamäki K, Tatti N, Mannila H (2009) Tell me something I don’t know: randomization strategies for iterative data mining. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining (KDD 2009), pp 379–388
Jaroszewicz S, Scheffer T (2005) Fast discovery of unexpected patterns in data, relative to a Bayesian network. In: KDD ’05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. ACM, New York, pp 118–127
Jaroszewicz S, Simovici DA (2004) Interestingness of frequent itemsets using Bayesian networks as background knowledge. In: KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, pp 178–186
Kohavi R, Brodley C, Frasca B, Mason L, Zheng Z (2000) KDD-Cup 2000 organizers’ report: peeling the onion. SIGKDD Explor 2(2):86–98. http://www.ecn.purdue.edu/KDDCUP
Mannila H, Terzi E (2007) Nestedness and segmented nestedness. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, p 489
Meo R (2000) Theory of dependence values. ACM Trans Database Syst 25(3): 380–406
Myllykangas S, Himberg J, Böhling T, Nagy B, Hollmén J, Knuutila S (2006) DNA copy number amplification profiling of human neoplasms. Oncogene 25(55): 7324–7332
Rasch G (1960) Probabilistic models for some intelligence and attainment tests. Danmarks paedagogiske Institut, Copenhagen
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2): 461–464
Tatti N (2006) Safe projections of binary data sets. Acta Inf 42(8–9): 617–638
Tatti N (2008) Maximum entropy based significance of itemsets. Knowl Inf Syst 17(1): 57–77
Zaki JZ (2000) Scalable algorithms for association mining. IEEE Trans Knowl Data Eng 12(3): 372–390
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editors: José L Balcázar, Francesco Bonchi, Aristides Gionis, Michèle Sebag.
Rights and permissions
About this article
Cite this article
Tatti, N., Mampaey, M. Using background knowledge to rank itemsets. Data Min Knowl Disc 21, 293–309 (2010). https://doi.org/10.1007/s10618-010-0188-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-010-0188-4