Which Is Better for Frequent Pattern Mining: Approximate Counting or Sampling?

Ng, Willie; Dash, Manoranjan

doi:10.1007/978-3-642-03730-6_13

Willie Ng¹⁹ &
Manoranjan Dash¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5691))

Included in the following conference series:

International Conference on Data Warehousing and Knowledge Discovery

1095 Accesses

Abstract

We investigate the problem of finding frequent patterns in a continuous stream of transactions. In the literature two prominent approaches are often used: (a) perform approximate counting (e.g., lossy counting algorithm (LCA) of Manku and Motwani, VLDB 2002) by using a lower support threshold than the one given by the user, or (b) maintain a running sample (e.g., reservoir sampling (Algo-Z) of Vitter, TOMS 1985) and generate frequent itemsets from the sample on demand. Both approaches have their advantages and disadvantages. For instance, LCA is known to output all frequent itemsets (recall = 1) but it also outputs many false frequent itemsets (low precision). Sampling is fast, but it outputs a large number of false itemsets as frequent itemsets, particularly when sample size is not large. Although both approaches are known to be practically useful, to the best of our knowledge there has been no comparison between the two approaches. In addition, we propose a novel sampling algorithm (DSS). DSS selects transactions to be included in the sample based on histogram of single itemsets. An empirical comparison study between the 3 algorithms is performed using synthetic and benchmark datasets. Results show that DSS is consistently more accurate than LCA and Algo-Z, whereas LCA performs consistently better than Algo-Z. Furthermore, DSS, although requires more time than Algo-Z, is faster than LCA.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Goethals, B.: Survey on frequent pattern mining (manuscript) (2003)
Google Scholar
Manku, G., Motwani, R.: Approximate frequency counts over data streams. In: VLDB, pp. 346–357 (2002)
Google Scholar
Misra, J., Gries, D.: Finding repeated elements. Scientific Computing Programming 2(2), 143–152 (1982)
Article MathSciNet MATH Google Scholar
Karp, R.M., Shenker, S., Papadimitriou, C.H.: A simple algorithm for finding frequent elements in streams and bags. ACM Transactions on Database Systems 28(1), 51–55 (2003)
Article Google Scholar
Calders, T., Dexters, N., Goethals, B.: Mining frequent itemsets in a stream. In: Perner, P. (ed.) ICDM 2007. LNCS, vol. 4597, pp. 83–92. Springer, Heidelberg (2007)
Google Scholar
Cheng, J., Ke, Y., Ng, W.: A survey on algorithms for mining frequent itemsets over data streams. An International Journal of Knowledge and Information Systems (2007)
Google Scholar
Cheng, J., Ke, Y., Ng, W.: Maintaining frequent itemsets over high-speed data streams. In: Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 462–467 (2006)
Google Scholar
Giannella, C., Han, J., Pei, J., Yan, X., Yu, P.: Mining frequent patterns in data streams at multiple time granularities. In: Kargupta, H., Joshi, A., Sivakumar, K., Yesha, Y. (eds.) Next Generation Data Mining, pp. 191–212. AAAI/MIT (2003)
Google Scholar
Ng, W., Dash, M.: Efficient approximate mining of frequent patterns over transactional data streams. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2008. LNCS, vol. 5182, pp. 241–250. Springer, Heidelberg (2008)
Chapter Google Scholar
Toivonen, H.: Sampling large databases for association rules. In: VLDB 1996: Proceedings of the 22th International Conference on Very Large Data Bases, pp. 134–145 (1996)
Google Scholar
Mannila, H., Toivonen, H., Verkamo, A.I.: Efficient algorithms for discovering association rules. In: Fayyad, U.M., Uthurusamy, R. (eds.) AAAI Workshop on Knowledge Discovery in Databases (KDD 1994), pp. 181–192 (1994)
Google Scholar
Zaki, M., Parthasarathy, S., Li, W., Ogihara, M.: Evaluation of sampling for data mining of association rules. In: Seventh International Workshop on Research Issues in Data Engineering, RIDE 1997 (1996)
Google Scholar
Yu, X., Chong, Z., Lu, H., Zhou, A.: False positive or false negative: Mining frequent itemsets from high speed transactional data streams. In: Int. Conf. on VLDB (2004)
Google Scholar
Kohavi, Z.Z.R.: Real world performance of association rule algorithms. In: ACM SIGKDD (2001)
Google Scholar
Vitter, J.: Random sampling with a reservoir. ACM Transactions on Mathematical Software 11, 37–57 (1985)
Article MathSciNet MATH Google Scholar
Chen, B., Haas, P.J., Scheuermann, P.: A new two-phase sampling based algorithm for discovering association rules. In: KDD, pp. 462–468 (2002)
Google Scholar
Bronnimann, H., Chen, B., Dash, M., Haas, P., Scheuermann, P.: Efficient data reduction with ease. In: Proceedings of ACM SIGKDD International Conference in Knowledge Discovery and Data Mining, pp. 59–68 (2003)
Google Scholar
Chuang, K.-T., Chen, M.-S., Yang, W.-C.: Progressive sampling for association rules based on sampling error estimation. In: Ho, T.-B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS, vol. 3518, pp. 505–515. Springer, Heidelberg (2005)
Chapter Google Scholar
Kubica, J., Moore, A.: Probabilistic noise identification and data cleaning. In: Proceedings of International Conference on Data Mining, ICDM (2003)
Google Scholar
Zhu, X., Wu, X., Khoshgoftaar, T.M., Shi, Y.: Empirical study of the noise impact on cost-sensitive learning. In: Proceedings of International Conference on Joint COnference on Artificial Intelligence (IJCAI) (2007)
Google Scholar
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proc.of the 20th VLDB conf. (1994)
Google Scholar
Bodon, F.: A fast apriori implementation. In: Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations, FIMI 2003 (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Centre for Advanced Information Systems, Nanyang Technological University, Singapore, 639798
Willie Ng & Manoranjan Dash

Authors

Willie Ng
View author publications
You can also search for this author in PubMed Google Scholar
Manoranjan Dash
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Aalborg University, Selma Lagerlöfsvej 300, 9220, Aalborg Ø, Denmark
Torben Bach Pedersen
IBM India Research Lab, Plot No. 4, Block C, Institutional Area, Vasant Kunj, 110 070, New Delhi, India
Mukesh K. Mohania
Institute of Software Technology and Interactive Systems, Vienna University of Technology, Favoritenstr. 9-11/188, 1040, Wien, Austria
A Min Tjoa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ng, W., Dash, M. (2009). Which Is Better for Frequent Pattern Mining: Approximate Counting or Sampling?. In: Pedersen, T.B., Mohania, M.K., Tjoa, A.M. (eds) Data Warehousing and Knowledge Discovery. DaWaK 2009. Lecture Notes in Computer Science, vol 5691. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03730-6_13

Download citation

DOI: https://doi.org/10.1007/978-3-642-03730-6_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03729-0
Online ISBN: 978-3-642-03730-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics