Skip to main content
Log in

Maximum entropy based significance of itemsets

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

We consider the problem of defining the significance of an itemset. We say that the itemset is significant if we are surprised by its frequency when compared to the frequencies of its sub-itemsets. In other words, we estimate the frequency of the itemset from the frequencies of its sub-itemsets and compute the deviation between the real value and the estimate. For the estimation we use Maximum Entropy and for measuring the deviation we use Kullback–Leibler divergence. A major advantage compared to the previous methods is that we are able to use richer models whereas the previous approaches only measure the deviation from the independence model. We show that our measure of significance goes to zero for derivable itemsets and that we can use the rank as a statistical test. Our empirical results demonstrate that for our real datasets the independence assumption is too strong but applying more flexible models leads to good results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Aggarwal CC, Yu PS (1998) A new framework for itemset generation. In: PODS ’98: Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems. ACM Press, New York, pp 18–24

  • Agrawal R, Imielinski T, Swami AN (1993) Mining association rules between sets of items in large databases. In: Buneman P, Jajodia S (eds) Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data. Washington, D.C., pp 207–216

  • Agrawal R, Mannila H, Srikant R, Toivonen H and Verkamo AI (1996). Fast discovery of association rules. In: Fayyad, U, Piatetsky-Shapiro, G, Smyth, P, and Uthurusamy, R (eds) Advances in Knowledge Discovery and Data Mining., pp 307–328. AAAI Press/The MIT Press, Cambridge

    Google Scholar 

  • Boulicaut J-F, Bykowski A, Rigotti C (2000) Approximation of frequency queries by means of free-sets. In: Principles of Data Mining and Knowledge Discovery, pp 75–85

  • Brijs T, Swinnen G, Vanhoof K, Wets G (1999) Using association rules for product assortment decisions: a case study. In: Knowledge Discovery and Data Mining. ACM, New York, pp 254–260

  • Brin S, Motwani R and Silverstein C (1997). Beyond market baskets: Generalizing association rules to correlations. In: Peckham, J (eds) SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, pp 265–276. ACM Press, New York

    Chapter  Google Scholar 

  • Brin S, Motwani R, Ullman JD, Tsur S (1997) Dynamic itemset counting and implication rules for market basket data. In: SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data. pp 255–264

  • Calders T, Goethals B (2002) Mining all non-derivable frequent itemsets. In: Proceedings of the 6th European Conference on Principles and Practice of Knowledge Discovery in Databases

  • Chow C and Liu C (1968). Approximating discrete probability distributions with dependence trees. IEEE Trans Inf Theory 14(3): 462–467

    Article  MATH  MathSciNet  Google Scholar 

  • Cooper G (1990). The computational complexity of probabilistic inference using bayesian belief networks. Artif Intell 42(2–3): 393–405

    Article  MATH  Google Scholar 

  • Csiszár I (1975). I-divergence geometry of probability distributions and minimization problems. Ann Prob 3(1): 146–158

    Article  MATH  Google Scholar 

  • Darroch J and Ratchli D (1972). Generalized iterative scaling for log-linear models. Ann Math Stat 43(5): 1470–1480

    Article  MATH  Google Scholar 

  • Dong G, Li J (1999) Efficient mining of emerging patterns: Discovering trends and differences. In: Knowledge Discovery and Data Mining, pp 43–52

  • DuMouchel W, Pregibon D (2001) Empirical bayes screening for multi-item associations. In: Knowledge Discovery and Data Mining, pp 67–76

  • Fortelius M (2005) Neogene of the old world database of fossil mammals (NOW). University of Helsinki, http://www.helsinki.fi/science/now/

  • Fortelius M, Gionis A, Jernvall J and Mannila H (2006). Spectral ordering and biochronology of european fossil mammals paleobiology. Paleobiology 32(2): 206–214

    Article  Google Scholar 

  • Gallo A, Bie TD, Christianini N (2007) Mini: Mining informative non-redundant itemsets. In: 11th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), pp 438–445

  • Geurts K, Wets G, Brijs T, Vanhoof K (2003) Profiling high frequency accident locations using association rules. In: Proceedings of the 82nd Annual Transportation Research Board, Washington DC. (USA), January 12–16

  • Heikinheimo H, Hinkkanen E, Mannila H, Mielikäinen T, Seppänen JK (2007) Finding low-entropy sets and trees from binary data. In: Knowledge Discovery and Data Mining

  • Jaroszewicz S, Simovici DA (2002) Pruning redundant association rules using maximum entropy principle. In: Advances in Knowledge Discovery and Data Mining, 6th Pacific-Asia Conference, PAKDD’02, pp 135–147

  • Jiroušek R and Přeušil S (1995). On the effective implementation of the iterative proportional fitting procedure. Comput Stat Data Anal 19: 177–189

    Article  MATH  Google Scholar 

  • Kohavi R, Brodley C, Frasca B, Mason L and Zheng Z (2000). KDD-Cup 2000 organizers report: Peeling the onion. SIGKDD Explorat 2(2): 86–98

    Article  Google Scholar 

  • Kullback S (1968). Information Theory and Statistics. Dover Publications, Inc., New York

    Google Scholar 

  • Mannila H, Mielikäinen T (2003) The pattern ordering problem. In: Principles of Data Mining and Knowledge Discovery, pp 327–338

  • Norén GN, Bate A and Edwards IR (2007). Extending the methods used to screen the who drug safety database towards analysis of complex associations and improved accuracy for rare events. Stat Med 25: 3740–3757

    Article  Google Scholar 

  • Omiecinski ER (2003). Alternative interest measures for mining associations in databases. IEEE Trans Knowl Data Eng 15(1): 57–69

    Article  MathSciNet  Google Scholar 

  • Pasquier N, Bastide Y, Taouil R and Lakhal L (1999). Discovering frequent closed itemsets for association rules. Lecture Notes in Computer Science 1540: 398–416

    Article  Google Scholar 

  • Pavlov D, Mannila H and Smyth P (2003). Beyond independence: Probabilistic models for query approximation on binary transaction data. IEEE Trans Knowl Data Eng 15(6): 1409–1421

    Article  Google Scholar 

  • Piatetsky-Shapiro G (1991) Discovery, analysis, and presentation of strong rules. In: Knowledge Discovery in Databases. AAAI/MIT Press, New York, pp 229–248

  • Tatti N (2006a) Computational complexity of queries based on itemsets. Inf Process Lett pp 183–187

  • Tatti N (2006). Safe projections of binary data sets. Acta Inf 42(8–9): 617–638

    Article  MATH  MathSciNet  Google Scholar 

  • Tatti N (2007) Maximum entropy based significance of itemsets. In: Proceedings of Seventh IEEE International Conference on Data Mining (ICDM 2007), pp 312–321

  • van der Vaart AW (1998). Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge

    Google Scholar 

  • Webb GI (2006) Discovering significant rules. In: Knowledge discovery and data mining, pp 434–443

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nikolaj Tatti.

Additional information

A preliminary version appeared as “Maximum Entropy Based Significance of Itemsets”, In Proceedings of Seventh IEEE International Conference on Data Mining (ICDM 2007), pp 312–321, 2006 [32].

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tatti, N. Maximum entropy based significance of itemsets. Knowl Inf Syst 17, 57–77 (2008). https://doi.org/10.1007/s10115-008-0128-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-008-0128-4

Keywords

Navigation