Skip to main content

Which Is Better for Frequent Pattern Mining: Approximate Counting or Sampling?

  • Conference paper
Book cover Data Warehousing and Knowledge Discovery (DaWaK 2009)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5691))

Included in the following conference series:

  • 1095 Accesses

Abstract

We investigate the problem of finding frequent patterns in a continuous stream of transactions. In the literature two prominent approaches are often used: (a) perform approximate counting (e.g., lossy counting algorithm (LCA) of Manku and Motwani, VLDB 2002) by using a lower support threshold than the one given by the user, or (b) maintain a running sample (e.g., reservoir sampling (Algo-Z) of Vitter, TOMS 1985) and generate frequent itemsets from the sample on demand. Both approaches have their advantages and disadvantages. For instance, LCA is known to output all frequent itemsets (recall = 1) but it also outputs many false frequent itemsets (low precision). Sampling is fast, but it outputs a large number of false itemsets as frequent itemsets, particularly when sample size is not large. Although both approaches are known to be practically useful, to the best of our knowledge there has been no comparison between the two approaches. In addition, we propose a novel sampling algorithm (DSS). DSS selects transactions to be included in the sample based on histogram of single itemsets. An empirical comparison study between the 3 algorithms is performed using synthetic and benchmark datasets. Results show that DSS is consistently more accurate than LCA and Algo-Z, whereas LCA performs consistently better than Algo-Z. Furthermore, DSS, although requires more time than Algo-Z, is faster than LCA.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Goethals, B.: Survey on frequent pattern mining (manuscript) (2003)

    Google Scholar 

  2. Manku, G., Motwani, R.: Approximate frequency counts over data streams. In: VLDB, pp. 346–357 (2002)

    Google Scholar 

  3. Misra, J., Gries, D.: Finding repeated elements. Scientific Computing Programming 2(2), 143–152 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  4. Karp, R.M., Shenker, S., Papadimitriou, C.H.: A simple algorithm for finding frequent elements in streams and bags. ACM Transactions on Database Systems 28(1), 51–55 (2003)

    Article  Google Scholar 

  5. Calders, T., Dexters, N., Goethals, B.: Mining frequent itemsets in a stream. In: Perner, P. (ed.) ICDM 2007. LNCS, vol. 4597, pp. 83–92. Springer, Heidelberg (2007)

    Google Scholar 

  6. Cheng, J., Ke, Y., Ng, W.: A survey on algorithms for mining frequent itemsets over data streams. An International Journal of Knowledge and Information Systems (2007)

    Google Scholar 

  7. Cheng, J., Ke, Y., Ng, W.: Maintaining frequent itemsets over high-speed data streams. In: Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 462–467 (2006)

    Google Scholar 

  8. Giannella, C., Han, J., Pei, J., Yan, X., Yu, P.: Mining frequent patterns in data streams at multiple time granularities. In: Kargupta, H., Joshi, A., Sivakumar, K., Yesha, Y. (eds.) Next Generation Data Mining, pp. 191–212. AAAI/MIT (2003)

    Google Scholar 

  9. Ng, W., Dash, M.: Efficient approximate mining of frequent patterns over transactional data streams. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2008. LNCS, vol. 5182, pp. 241–250. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  10. Toivonen, H.: Sampling large databases for association rules. In: VLDB 1996: Proceedings of the 22th International Conference on Very Large Data Bases, pp. 134–145 (1996)

    Google Scholar 

  11. Mannila, H., Toivonen, H., Verkamo, A.I.: Efficient algorithms for discovering association rules. In: Fayyad, U.M., Uthurusamy, R. (eds.) AAAI Workshop on Knowledge Discovery in Databases (KDD 1994), pp. 181–192 (1994)

    Google Scholar 

  12. Zaki, M., Parthasarathy, S., Li, W., Ogihara, M.: Evaluation of sampling for data mining of association rules. In: Seventh International Workshop on Research Issues in Data Engineering, RIDE 1997 (1996)

    Google Scholar 

  13. Yu, X., Chong, Z., Lu, H., Zhou, A.: False positive or false negative: Mining frequent itemsets from high speed transactional data streams. In: Int. Conf. on VLDB (2004)

    Google Scholar 

  14. Kohavi, Z.Z.R.: Real world performance of association rule algorithms. In: ACM SIGKDD (2001)

    Google Scholar 

  15. Vitter, J.: Random sampling with a reservoir. ACM Transactions on Mathematical Software 11, 37–57 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  16. Chen, B., Haas, P.J., Scheuermann, P.: A new two-phase sampling based algorithm for discovering association rules. In: KDD, pp. 462–468 (2002)

    Google Scholar 

  17. Bronnimann, H., Chen, B., Dash, M., Haas, P., Scheuermann, P.: Efficient data reduction with ease. In: Proceedings of ACM SIGKDD International Conference in Knowledge Discovery and Data Mining, pp. 59–68 (2003)

    Google Scholar 

  18. Chuang, K.-T., Chen, M.-S., Yang, W.-C.: Progressive sampling for association rules based on sampling error estimation. In: Ho, T.-B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS, vol. 3518, pp. 505–515. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  19. Kubica, J., Moore, A.: Probabilistic noise identification and data cleaning. In: Proceedings of International Conference on Data Mining, ICDM (2003)

    Google Scholar 

  20. Zhu, X., Wu, X., Khoshgoftaar, T.M., Shi, Y.: Empirical study of the noise impact on cost-sensitive learning. In: Proceedings of International Conference on Joint COnference on Artificial Intelligence (IJCAI) (2007)

    Google Scholar 

  21. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proc.of the 20th VLDB conf. (1994)

    Google Scholar 

  22. Bodon, F.: A fast apriori implementation. In: Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations, FIMI 2003 (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ng, W., Dash, M. (2009). Which Is Better for Frequent Pattern Mining: Approximate Counting or Sampling?. In: Pedersen, T.B., Mohania, M.K., Tjoa, A.M. (eds) Data Warehousing and Knowledge Discovery. DaWaK 2009. Lecture Notes in Computer Science, vol 5691. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03730-6_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-03730-6_13

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-03729-0

  • Online ISBN: 978-3-642-03730-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics