Advertisement

Knowledge and Information Systems

, Volume 17, Issue 1, pp 57–77 | Cite as

Maximum entropy based significance of itemsets

  • Nikolaj TattiEmail author
Regular Paper

Abstract

We consider the problem of defining the significance of an itemset. We say that the itemset is significant if we are surprised by its frequency when compared to the frequencies of its sub-itemsets. In other words, we estimate the frequency of the itemset from the frequencies of its sub-itemsets and compute the deviation between the real value and the estimate. For the estimation we use Maximum Entropy and for measuring the deviation we use Kullback–Leibler divergence. A major advantage compared to the previous methods is that we are able to use richer models whereas the previous approaches only measure the deviation from the independence model. We show that our measure of significance goes to zero for derivable itemsets and that we can use the rank as a statistical test. Our empirical results demonstrate that for our real datasets the independence assumption is too strong but applying more flexible models leads to good results.

Keywords

Binary data mining Itemsets Maximum entropy 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aggarwal CC, Yu PS (1998) A new framework for itemset generation. In: PODS ’98: Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems. ACM Press, New York, pp 18–24Google Scholar
  2. Agrawal R, Imielinski T, Swami AN (1993) Mining association rules between sets of items in large databases. In: Buneman P, Jajodia S (eds) Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data. Washington, D.C., pp 207–216Google Scholar
  3. Agrawal R, Mannila H, Srikant R, Toivonen H and Verkamo AI (1996). Fast discovery of association rules. In: Fayyad, U, Piatetsky-Shapiro, G, Smyth, P, and Uthurusamy, R (eds) Advances in Knowledge Discovery and Data Mining., pp 307–328. AAAI Press/The MIT Press, Cambridge Google Scholar
  4. Boulicaut J-F, Bykowski A, Rigotti C (2000) Approximation of frequency queries by means of free-sets. In: Principles of Data Mining and Knowledge Discovery, pp 75–85Google Scholar
  5. Brijs T, Swinnen G, Vanhoof K, Wets G (1999) Using association rules for product assortment decisions: a case study. In: Knowledge Discovery and Data Mining. ACM, New York, pp 254–260Google Scholar
  6. Brin S, Motwani R and Silverstein C (1997). Beyond market baskets: Generalizing association rules to correlations. In: Peckham, J (eds) SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, pp 265–276. ACM Press, New York CrossRefGoogle Scholar
  7. Brin S, Motwani R, Ullman JD, Tsur S (1997) Dynamic itemset counting and implication rules for market basket data. In: SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data. pp 255–264Google Scholar
  8. Calders T, Goethals B (2002) Mining all non-derivable frequent itemsets. In: Proceedings of the 6th European Conference on Principles and Practice of Knowledge Discovery in DatabasesGoogle Scholar
  9. Chow C and Liu C (1968). Approximating discrete probability distributions with dependence trees. IEEE Trans Inf Theory 14(3): 462–467 zbMATHCrossRefMathSciNetGoogle Scholar
  10. Cooper G (1990). The computational complexity of probabilistic inference using bayesian belief networks. Artif Intell 42(2–3): 393–405 zbMATHCrossRefGoogle Scholar
  11. Csiszár I (1975). I-divergence geometry of probability distributions and minimization problems. Ann Prob 3(1): 146–158 zbMATHCrossRefGoogle Scholar
  12. Darroch J and Ratchli D (1972). Generalized iterative scaling for log-linear models. Ann Math Stat 43(5): 1470–1480 zbMATHCrossRefGoogle Scholar
  13. Dong G, Li J (1999) Efficient mining of emerging patterns: Discovering trends and differences. In: Knowledge Discovery and Data Mining, pp 43–52Google Scholar
  14. DuMouchel W, Pregibon D (2001) Empirical bayes screening for multi-item associations. In: Knowledge Discovery and Data Mining, pp 67–76Google Scholar
  15. Fortelius M (2005) Neogene of the old world database of fossil mammals (NOW). University of Helsinki, http://www.helsinki.fi/science/now/Google Scholar
  16. Fortelius M, Gionis A, Jernvall J and Mannila H (2006). Spectral ordering and biochronology of european fossil mammals paleobiology. Paleobiology 32(2): 206–214 CrossRefGoogle Scholar
  17. Gallo A, Bie TD, Christianini N (2007) Mini: Mining informative non-redundant itemsets. In: 11th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), pp 438–445Google Scholar
  18. Geurts K, Wets G, Brijs T, Vanhoof K (2003) Profiling high frequency accident locations using association rules. In: Proceedings of the 82nd Annual Transportation Research Board, Washington DC. (USA), January 12–16Google Scholar
  19. Heikinheimo H, Hinkkanen E, Mannila H, Mielikäinen T, Seppänen JK (2007) Finding low-entropy sets and trees from binary data. In: Knowledge Discovery and Data MiningGoogle Scholar
  20. Jaroszewicz S, Simovici DA (2002) Pruning redundant association rules using maximum entropy principle. In: Advances in Knowledge Discovery and Data Mining, 6th Pacific-Asia Conference, PAKDD’02, pp 135–147Google Scholar
  21. Jiroušek R and Přeušil S (1995). On the effective implementation of the iterative proportional fitting procedure. Comput Stat Data Anal 19: 177–189 zbMATHCrossRefGoogle Scholar
  22. Kohavi R, Brodley C, Frasca B, Mason L and Zheng Z (2000). KDD-Cup 2000 organizers report: Peeling the onion. SIGKDD Explorat 2(2): 86–98 CrossRefGoogle Scholar
  23. Kullback S (1968). Information Theory and Statistics. Dover Publications, Inc., New York Google Scholar
  24. Mannila H, Mielikäinen T (2003) The pattern ordering problem. In: Principles of Data Mining and Knowledge Discovery, pp 327–338Google Scholar
  25. Norén GN, Bate A and Edwards IR (2007). Extending the methods used to screen the who drug safety database towards analysis of complex associations and improved accuracy for rare events. Stat Med 25: 3740–3757 CrossRefGoogle Scholar
  26. Omiecinski ER (2003). Alternative interest measures for mining associations in databases. IEEE Trans Knowl Data Eng 15(1): 57–69 CrossRefMathSciNetGoogle Scholar
  27. Pasquier N, Bastide Y, Taouil R and Lakhal L (1999). Discovering frequent closed itemsets for association rules. Lecture Notes in Computer Science 1540: 398–416 CrossRefGoogle Scholar
  28. Pavlov D, Mannila H and Smyth P (2003). Beyond independence: Probabilistic models for query approximation on binary transaction data. IEEE Trans Knowl Data Eng 15(6): 1409–1421 CrossRefGoogle Scholar
  29. Piatetsky-Shapiro G (1991) Discovery, analysis, and presentation of strong rules. In: Knowledge Discovery in Databases. AAAI/MIT Press, New York, pp 229–248Google Scholar
  30. Tatti N (2006a) Computational complexity of queries based on itemsets. Inf Process Lett pp 183–187Google Scholar
  31. Tatti N (2006). Safe projections of binary data sets. Acta Inf 42(8–9): 617–638 zbMATHCrossRefMathSciNetGoogle Scholar
  32. Tatti N (2007) Maximum entropy based significance of itemsets. In: Proceedings of Seventh IEEE International Conference on Data Mining (ICDM 2007), pp 312–321Google Scholar
  33. van der Vaart AW (1998). Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge Google Scholar
  34. Webb GI (2006) Discovering significant rules. In: Knowledge discovery and data mining, pp 434–443Google Scholar

Copyright information

© Springer-Verlag London Limited 2008

Authors and Affiliations

  1. 1.HIIT Basic Research Unit, Department of Computer ScienceHelsinki University of TechnologyHelsinkiFinland

Personalised recommendations