Maximum entropy based significance of itemsets

Tatti, Nikolaj

doi:10.1007/s10115-008-0128-4

Maximum entropy based significance of itemsets

Regular Paper
Published: 11 March 2008

Volume 17, pages 57–77, (2008)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Nikolaj Tatti¹

160 Accesses
40 Citations
Explore all metrics

Abstract

We consider the problem of defining the significance of an itemset. We say that the itemset is significant if we are surprised by its frequency when compared to the frequencies of its sub-itemsets. In other words, we estimate the frequency of the itemset from the frequencies of its sub-itemsets and compute the deviation between the real value and the estimate. For the estimation we use Maximum Entropy and for measuring the deviation we use Kullback–Leibler divergence. A major advantage compared to the previous methods is that we are able to use richer models whereas the previous approaches only measure the deviation from the independence model. We show that our measure of significance goes to zero for derivable itemsets and that we can use the rank as a statistical test. Our empirical results demonstrate that for our real datasets the independence assumption is too strong but applying more flexible models leads to good results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fast Estimation of the Pattern Frequency Spectrum

Interesting Patterns

Discovering Frequent Itemsets on Uncertain Data: A Systematic Review

References

Aggarwal CC, Yu PS (1998) A new framework for itemset generation. In: PODS ’98: Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems. ACM Press, New York, pp 18–24
Agrawal R, Imielinski T, Swami AN (1993) Mining association rules between sets of items in large databases. In: Buneman P, Jajodia S (eds) Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data. Washington, D.C., pp 207–216
Agrawal R, Mannila H, Srikant R, Toivonen H and Verkamo AI (1996). Fast discovery of association rules. In: Fayyad, U, Piatetsky-Shapiro, G, Smyth, P, and Uthurusamy, R (eds) Advances in Knowledge Discovery and Data Mining., pp 307–328. AAAI Press/The MIT Press, Cambridge
Google Scholar
Boulicaut J-F, Bykowski A, Rigotti C (2000) Approximation of frequency queries by means of free-sets. In: Principles of Data Mining and Knowledge Discovery, pp 75–85
Brijs T, Swinnen G, Vanhoof K, Wets G (1999) Using association rules for product assortment decisions: a case study. In: Knowledge Discovery and Data Mining. ACM, New York, pp 254–260
Brin S, Motwani R and Silverstein C (1997). Beyond market baskets: Generalizing association rules to correlations. In: Peckham, J (eds) SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, pp 265–276. ACM Press, New York
Chapter Google Scholar
Brin S, Motwani R, Ullman JD, Tsur S (1997) Dynamic itemset counting and implication rules for market basket data. In: SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data. pp 255–264
Calders T, Goethals B (2002) Mining all non-derivable frequent itemsets. In: Proceedings of the 6th European Conference on Principles and Practice of Knowledge Discovery in Databases
Chow C and Liu C (1968). Approximating discrete probability distributions with dependence trees. IEEE Trans Inf Theory 14(3): 462–467
Article MATH MathSciNet Google Scholar
Cooper G (1990). The computational complexity of probabilistic inference using bayesian belief networks. Artif Intell 42(2–3): 393–405
Article MATH Google Scholar
Csiszár I (1975). I-divergence geometry of probability distributions and minimization problems. Ann Prob 3(1): 146–158
Article MATH Google Scholar
Darroch J and Ratchli D (1972). Generalized iterative scaling for log-linear models. Ann Math Stat 43(5): 1470–1480
Article MATH Google Scholar
Dong G, Li J (1999) Efficient mining of emerging patterns: Discovering trends and differences. In: Knowledge Discovery and Data Mining, pp 43–52
DuMouchel W, Pregibon D (2001) Empirical bayes screening for multi-item associations. In: Knowledge Discovery and Data Mining, pp 67–76
Fortelius M (2005) Neogene of the old world database of fossil mammals (NOW). University of Helsinki, http://www.helsinki.fi/science/now/
Fortelius M, Gionis A, Jernvall J and Mannila H (2006). Spectral ordering and biochronology of european fossil mammals paleobiology. Paleobiology 32(2): 206–214
Article Google Scholar
Gallo A, Bie TD, Christianini N (2007) Mini: Mining informative non-redundant itemsets. In: 11th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), pp 438–445
Geurts K, Wets G, Brijs T, Vanhoof K (2003) Profiling high frequency accident locations using association rules. In: Proceedings of the 82nd Annual Transportation Research Board, Washington DC. (USA), January 12–16
Heikinheimo H, Hinkkanen E, Mannila H, Mielikäinen T, Seppänen JK (2007) Finding low-entropy sets and trees from binary data. In: Knowledge Discovery and Data Mining
Jaroszewicz S, Simovici DA (2002) Pruning redundant association rules using maximum entropy principle. In: Advances in Knowledge Discovery and Data Mining, 6th Pacific-Asia Conference, PAKDD’02, pp 135–147
Jiroušek R and Přeušil S (1995). On the effective implementation of the iterative proportional fitting procedure. Comput Stat Data Anal 19: 177–189
Article MATH Google Scholar
Kohavi R, Brodley C, Frasca B, Mason L and Zheng Z (2000). KDD-Cup 2000 organizers report: Peeling the onion. SIGKDD Explorat 2(2): 86–98
Article Google Scholar
Kullback S (1968). Information Theory and Statistics. Dover Publications, Inc., New York
Google Scholar
Mannila H, Mielikäinen T (2003) The pattern ordering problem. In: Principles of Data Mining and Knowledge Discovery, pp 327–338
Norén GN, Bate A and Edwards IR (2007). Extending the methods used to screen the who drug safety database towards analysis of complex associations and improved accuracy for rare events. Stat Med 25: 3740–3757
Article Google Scholar
Omiecinski ER (2003). Alternative interest measures for mining associations in databases. IEEE Trans Knowl Data Eng 15(1): 57–69
Article MathSciNet Google Scholar
Pasquier N, Bastide Y, Taouil R and Lakhal L (1999). Discovering frequent closed itemsets for association rules. Lecture Notes in Computer Science 1540: 398–416
Article Google Scholar
Pavlov D, Mannila H and Smyth P (2003). Beyond independence: Probabilistic models for query approximation on binary transaction data. IEEE Trans Knowl Data Eng 15(6): 1409–1421
Article Google Scholar
Piatetsky-Shapiro G (1991) Discovery, analysis, and presentation of strong rules. In: Knowledge Discovery in Databases. AAAI/MIT Press, New York, pp 229–248
Tatti N (2006a) Computational complexity of queries based on itemsets. Inf Process Lett pp 183–187
Tatti N (2006). Safe projections of binary data sets. Acta Inf 42(8–9): 617–638
Article MATH MathSciNet Google Scholar
Tatti N (2007) Maximum entropy based significance of itemsets. In: Proceedings of Seventh IEEE International Conference on Data Mining (ICDM 2007), pp 312–321
van der Vaart AW (1998). Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge
Google Scholar
Webb GI (2006) Discovering significant rules. In: Knowledge discovery and data mining, pp 434–443

Download references

Author information

Authors and Affiliations

HIIT Basic Research Unit, Department of Computer Science, Helsinki University of Technology, Helsinki, Finland
Nikolaj Tatti

Authors

Nikolaj Tatti
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nikolaj Tatti.

Additional information

A preliminary version appeared as “Maximum Entropy Based Significance of Itemsets”, In Proceedings of Seventh IEEE International Conference on Data Mining (ICDM 2007), pp 312–321, 2006 [32].

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tatti, N. Maximum entropy based significance of itemsets. Knowl Inf Syst 17, 57–77 (2008). https://doi.org/10.1007/s10115-008-0128-4

Download citation

Received: 07 December 2007
Accepted: 29 January 2008
Published: 11 March 2008
Issue Date: October 2008
DOI: https://doi.org/10.1007/s10115-008-0128-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Maximum entropy based significance of itemsets

Abstract

Access this article

Similar content being viewed by others

Fast Estimation of the Pattern Frequency Spectrum

Interesting Patterns

Discovering Frequent Itemsets on Uncertain Data: A Systematic Review

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Maximum entropy based significance of itemsets

Abstract

Access this article

Similar content being viewed by others

Fast Estimation of the Pattern Frequency Spectrum

Interesting Patterns

Discovering Frequent Itemsets on Uncertain Data: A Systematic Review

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation