Applied Intelligence

, Volume 18, Issue 1, pp 91–104 | Cite as

Identifying Approximate Itemsets of Interest in Large Databases

  • Chengqi Zhang
  • Shichao Zhang
  • Geoffrey I. Webb
Article

Abstract

This paper presents a method for discovering approximate frequent itemsets of interest in large scale databases. This method uses the central limit theorem to increase efficiency, enabling us to reduce the sample size by about half compared to previous approximations. Further efficiency is gained by pruning from the search space uninteresting frequent itemsets. In addition to improving efficiency, this measure also reduces the number of itemsets that the user need consider. The model and algorithm have been implemented and evaluated using both synthetic and real-world databases. Our experimental results demonstrate the efficiency of the approach.

data mining sampling approximate frequent itemset 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    C. Aggarawal and P. Yu, “A new framework for itemset generation,” in Proceedings of the ACM PODS, 1998, pp. 18–24.Google Scholar
  2. 2.
    R. Agrawal, T. Imielinski, and A. Swami, “Mining association rules between sets of items in large databases,” in Proceedings of the ACM SIGMOD Conference on Management of Data, 1993, pp. 207–216.Google Scholar
  3. 3.
    R. Agrawal, T. Imielinski, and A. Swami, “Database Mining: A Performance Perspective,” IEEE Trans. Knowledge and Data Eng., vol. 5, no.6, pp. 914–925, 1993.Google Scholar
  4. 4.
    S. Brin, R. Motwani, and C. Silverstein, “Beyond market baskets: Generalizing association rules to Correlations,” in Proceedings of the ACMSIGMOD International Conference on Management of Data, 1997, pp. 265–276.Google Scholar
  5. 5.
    C. Carter, H. Hamilton, and N. Cercone, “Share based measures for itemsets,” in Principles of Data Mining and Knowledge Discovery, edited by J. Komorowski and J. Zytkow, pp. 14–24, 1997.Google Scholar
  6. 6.
    J. Park, M. Chen, and P. Yu, “Using a Hash-based method with transaction trimming for mining association rules,” IEEE Trans. Knowledge and Data Eng., vol. 9, no.5, pp. 813–824, 1997.Google Scholar
  7. 7.
    T. Shintani and M. Kitsuregawa, “Parallel mining algorithms for generalized association rules with classification hierarchy,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, 1998, pp. 25–36.Google Scholar
  8. 8.
    R. Srikant and R. Agrawal, “Mining quantitative association rules in large relational tables,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, 1996, pp. 1–12.Google Scholar
  9. 9.
    R. Srikant and R. Agrawal, “Mining generalized association rules,” Future Generation Computer Systems, vol. 13, pp. 161–180, 1997.Google Scholar
  10. 10.
    D. Tsur, J. Ullman, S. Abiteboul, C. Clifton, R. Motwani, S. Nestorov, and A. Rosenthal, “Query flocks: A generalization of association-rule mining,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, 1998, pp. 1–12.Google Scholar
  11. 11.
    S. Brin, R. Motwani, J. Ullman, and S. Tsur, “Dynamic item-set counting and implication rules for market basket data,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, 1997, pp. 255–264.Google Scholar
  12. 12.
    H. Toivonen, “Sampling large databases for association rules,” in Proceedings of the 22nd VLDB Conference, 1996, pp. 134–145.Google Scholar
  13. 13.
    G. Webb, “Efficient search for association rules,” in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, 2000, pp. 99–107.Google Scholar
  14. 14.
    R. Durrett, Probability: Theory and Examples, Duxbury Press, 1996.Google Scholar
  15. 15.
    T. Hagerup and C. Rub, “A guided tour of Chernoff bounds,” Information Processing Letters, vol. 33, pp. 305–308, 1989.Google Scholar
  16. 16.
    R. Agrawal and R. Srikant, “Fast algorithms for mining association rules,” in Proceedings of the 20th VLDB Conference, 1994, pp. 487–499.Google Scholar
  17. 17.
    E. Omiecinski and A. Savasere, “Efficient mining of association rules in large dynamic databases,” in Proceedings of 16th British National Conference on Databases BNCOD 16, Cardiff, Wales, UK, 1998, pp. 49–63.Google Scholar
  18. 18.
    A. Savasere, E. Omiecinski, and S. Navathe, “An efficient algorithm for mining association rules in large databases,” in Proceedings of the 21st International Conference on Very Large Data Bases, Zurich, Switzerland, 1995, pp. 688–692.Google Scholar
  19. 19.
    G. Piatetsky-Shapiro, “Discovery, analysis, and presentation of strong rules,” in Knowledge Discovery in Databases, edited by G. Piatetsky-Shapiro and W. Frawley, AAAI Press/MIT Press, pp. 229–248, 1991.Google Scholar
  20. 20.
    D. Cheung, J. Han, V. Ng, and C. Wong, “Maintenance of discovered association rules in large databases: An incremental updating technique,” in Proceedings of IEEE, 1996, pp. 106–114.Google Scholar
  21. 21.
    R. Godin and R. Missaoui, “An incremental concept formation approach for learning from databases,” Theoretical Computer Science, vol. 133, pp. 387–419, 1994.Google Scholar
  22. 22.
    J. Han, Y. Cai, and N. Cercone, “Knowledge discovery in databases: An attribute-oriented approach,” in Proceedings of VLDB-92, Canada, 1992, pp. 547–559.Google Scholar
  23. 23.
    M. Houtsma and A. Swami, “Set-oriented data mining in relational databases,” Data & Knowledge Engineering, vol. 17, pp. 245–262, 1995.Google Scholar
  24. 24.
    R. Miller and Y. Yang, “Association rules over interval data,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, 1997, pp. 452–461.Google Scholar
  25. 25.
    D. Rasmussen and R. Yager, “Induction of fuzzy characteristic rules,” in Principles of Data Mining and Knowledge Discovery, edited by J. Komorowski and J. Zytkow, pp. 123–133. 1997.Google Scholar
  26. 26.
    E. Han, G. Karypis, and V. Kumar, “Scalable parallel data mining for association rules,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, 1997, pp. 277–288.Google Scholar
  27. 27.
    M. Chen, J. Han, and P. Yu, “Data mining: An overview from a database perspective,” IEEE Trans. Knowledge and Data Eng., vol. 8, no.6, pp. 866–881, 1996.Google Scholar
  28. 28.
    U. Fayyad and P. Stolorz, “Data mining and KDD: Promise and challenges,” Future Generation Computer Systems, vol. 13, pp. 99–115, 1997.Google Scholar
  29. 29.
    J. Hosking, E. Pednault, and M. Sudan, “A statistical perspective on data mining,” Future Generation Computer Systems, vol. 13,pp. 117–134, 1997.Google Scholar
  30. 30.
    H. Liu and H. Motoda, Instance Selection and Construction for Data Mining, Kluwer Academic Publishers: Dordrecht, 2001.Google Scholar
  31. 31.
    N. Syed, H. Liu, and K. Sung, “From incremental learning to model independent instance selection—A support vector machine approach,” Technical Report, TRA9/99, School of Computing, National University of Singapore, Sept, 1999 (http://techrep.comp.nus.edu.sg/techreports/1999/TRA9-99.asp).Google Scholar

Copyright information

© Kluwer Academic Publishers 2003

Authors and Affiliations

  • Chengqi Zhang
    • 1
  • Shichao Zhang
    • 1
    • 2
  • Geoffrey I. Webb
    • 3
  1. 1.Faculty of Information TechnologyUniversity of Technology, SydneyBroadwayAustralia
  2. 2.School of ComputingGuangxi UniversityPeople's Republic of China
  3. 3.School of Computing and MathematicsDeakin UniversityGeelongAustralia

Personalised recommendations