Data Mining and Knowledge Discovery

, Volume 13, Issue 2, pp 137–166 | Cite as

A Model-Based Frequency Constraint for Mining Associations from Transaction Data

Article

Abstract

Mining frequent itemsets is a popular method for finding associated items in databases. For this method, support, the co-occurrence frequency of the items which form an association, is used as the primary indicator of the associations's significance. A single, user-specified support threshold is used to decided if associations should be further investigated. Support has some known problems with rare items, favors shorter itemsets and sometimes produces misleading associations.

In this paper we develop a novel model-based frequency constraint as an alternative to a single, user-specified minimum support. The constraint utilizes knowledge of the process generating transaction data by applying a simple stochastic mixture model (the NB model) which allows for transaction data's typically highly skewed item frequency distribution. A user-specified precision threshold is used together with the model to find local frequency thresholds for groups of itemsets. Based on the constraint we develop the notion of NB-frequent itemsets and adapt a mining algorithm to find all NB-frequent itemsets in a database. In experiments with publicly available transaction databases we show that the new constraint provides improvements over a single minimum support threshold and that the precision threshold is more robust and easier to set and interpret by the user.

Keywords

Data mining associations interest measures mixture models transaction data 

References

  1. Agarwal, R.C., Aggarwal, C.C., and Prasad, V.V.V. 2000. Depth first generation of long patterns. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 108–118.Google Scholar
  2. Agrawal, R., Imielinski, T., and Swami, A. 1993. Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Washington, DC, pp. 207–216.Google Scholar
  3. Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases, J. B. Bocca, M. Jarke, C. Zaniolo (Eds.) Santiago, Chile, pp. 487–499.Google Scholar
  4. Borgelt, C. 2003. Efficient implementations of apriori and eclat. In Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations, B. Goethals, M.J. Zaki (Eds.) Melbourne, FL, USA.Google Scholar
  5. Brin, S., Motwani, R., Ullman, J.D., and Tsur, S. 1997. Dynamic itemset counting and implication rules for market basket data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Tucson, Arizona, USA, pp. 255–264.Google Scholar
  6. Creighton, C. and Hanash, S. 2003. Mining gene expression databases for association rules. Bioinformatics, 19(1):79–86.Google Scholar
  7. Dempster, A.P., Laird, N.M., and Rubin, D.B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B (Methodological), 39:1–38.Google Scholar
  8. DuMouchel, W. and Pregibon, D. 2001. Empirical bayes screening for multiitem associations. In Proceedings of the 7th ACM SIGKDD Intentional Conference on Knowledge Discovery in Databases and Data Mining, F. Provost, R. Srikant (Eds.) ACM Press, pp. 67–76.Google Scholar
  9. Geyer-Schulz, A., Hahsler, M., and Jahn, M. 2002. A customer purchase incidence model applied to recommender systems. In WEBKDD 2001—Mining Log Data Across All Customer Touch Points, Third International Workshop, San Francisco, CA, USA, August 26, 2001, Revised Papers, Lecture Notes in Computer Science LNAI 2356, R. Kohavi, B. Masand, M. Spiliopoulou, J. Srivastava (Eds.) Springer-Verlag, pp. 25–47.Google Scholar
  10. Geyer-Schulz, A., Hahsler, M., Neumann, A., and Thede, A. 2003. Behaviorbased recommender systems as value-added services for scienti.c libraries. In Statistical Data Mining & Knowledge Discovery, H. Bozdogan, (Ed.) Chapman & Hall/CRC, pp. 433–454.Google Scholar
  11. Han, J., Pei, J., Yin, Y., and Mao, R. 2004. Mining frequent patterns without candidate generation. Data Mining and Knowledge Discovery, 8:53–87.Google Scholar
  12. Johnson, N.L., Kotz, S., and Kemp, A.W. 1993. Univariate Discrete Distributions, 2nd edn. New York: John Wiley & Sons.Google Scholar
  13. Kohavi, R., Brodley, C., Frasca, B., Mason, L., and Zheng, Z. 2000. KDD-Cup 2000 organizers’ report: Peeling the onion. SIGKDD Explorations, 2(2):86–98.Google Scholar
  14. Kohavi, R. and Provost, F. 1988. Glossary of terms. Machine Learning, 30(2–3):271–274.Google Scholar
  15. Liu, B., Hsu, W., and Ma, Y. 1999. Mining association rules with multiple minimum supports. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 337–341.Google Scholar
  16. Luo, J. and Bridges, S. 2000. Mining fuzzy association rules and fuzzy frequency episodes for intrusion detection. International Journal of Intelligent Systems, 15(8):687–703.Google Scholar
  17. Mannila, H., Toivonen, H., and Verkamo, A.I. 1994. Efficient algorithms for discovering association rules. In AAAI Workshop on Knowledge Discovery in Databases, U.M. Fayyad, R. Uthurusamy (Eds.) Seattle, Washington: AAAI Press, pp. 181–192,Google Scholar
  18. Omiecinski, E.R. 2003. Alternative interest measures for mining associations in databases. IEEE Transactions on Knowledge and Data Engineering, 15(1):57–69.Google Scholar
  19. Pei, J., Han, J., and Lakshmanan, L.V. 2001. Mining frequent itemsets with convertible constraints. In Proceedings of the 17th International Conference on Data Engineering, April 02–06, 2001, Heidelberg, Germany, pp. 433–442.Google Scholar
  20. Provost, F. and Fawcett, T. 1997. Analysis and visualization of classi.er performance: Comparison under imprecise class and cost distributions. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, Heckerman, D., Mannila, H., and Pregibon, D., editors, Newport Beach, CA: AAAI Press, pp. 43–48.Google Scholar
  21. Seno, M. and Karypis, G. 2001. Lpminer: An algorithm for finding frequent itemsets using length decreasing support constraint. In Proceedings of the 2001 IEEE International Conference on Data Mining, 29 November–2 December 2001, N. Cercone, T.Y. Lin, X. Wu (Eds.) San Jose, California, USA: IEEE Computer Society, pp. 505–512.Google Scholar
  22. Silverstein, C., Brin, S., and Motwani, R. 1998. Beyond market baskets: Generalizing association rules to dependence rules. Data Mining and Knowledge Discovery, 2:39–68.Google Scholar
  23. Srivastava, J., Cooley, R., Deshpande, M., and Tan, P.-N. 2000. Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Explorations, 1(2):12–23.Google Scholar
  24. Xiong, H., Tan, P.-N., and Kumar, V. 2003. Mining strong affinity association patterns in data sets with skewed support distribution. In Proceedings of the IEEE International Conference on Data Mining, November 19–22, 2003, B. Goethals, M.J. Zaki (Eds.) Melbourne, Florida, pp. 387–394.Google Scholar
  25. Zheng, Z., Kohavi, R., and Mason, L. 2001. Real world performance of association rule algorithms. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery in Databases and Data Mining, F. Provost, R. Srikant (Eds.) ACM Press, pp. 401–406.Google Scholar

Copyright information

© Springer Science + Business Media, LLC 2006

Authors and Affiliations

  1. 1.Vienna University of Economics and Business AdministrationViennaAustria

Personalised recommendations