Advertisement

Discovery of Frequent Patterns in Transactional Data Streams

  • Willie Ng
  • Manoranjan Dash
Chapter
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6380)

Abstract

A data stream is generated continuously in a dynamic environment with huge volume, infinite flow, and fast changing behaviors. There have been increasing demands for developing novel techniques that are able to discover interesting patterns from data streams while they work within system resource constraints. In this paper, we overview the state-of-art techniques to mine frequent patterns in a continuous stream of transactions. In the literature two prominent approaches are often used: (a) perform approximate counting (e.g., lossy counting algorithm (LCA) of Manku and Motwani, VLDB 2002) by using a lower support threshold than the one given by the user, or (b) maintain a running sample (e.g., reservoir sampling (Algo-Z) of Vitter, TOMS 1985) and generate frequent patterns from the sample on demand. Although both approaches are practically useful, to the best of our knowledge there has been no comparison between the two approaches. We also introduce a novel sampling algorithm (DSS). DSS selects transactions to be included in the sample based on histogram of single itemsets. An empirical comparison study between the 3 algorithms is performed using synthetic and benchmark datasets. Results show that DSS is consistently more accurate than LCA and Algo-Z, whereas LCA performs consistently better than Algo-Z. Furthermore, DSS, although requires more time than Algo-Z, is faster than LCA.

Keywords

Data Stream Association Rule Test Point Frequent Pattern Frequent Itemsets 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [Agg06]
    Aggarwal, C.C.: On biased reservoir sampling in the presence of stream evolution. In: Proceedings of the 32nd International Conference on Very Large Data Bases, pp. 607–618 (2006)Google Scholar
  2. [Agg07]
    Aggarwal, C.C.: Data Streams: Models and Algorithms. Springer, Heidelberg (2007)zbMATHGoogle Scholar
  3. [AGP00]
    Acharya, S., Gibbons, P.B., Poosala, V.: Congressional samples for approximate answering of group-by queries. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 487–498 (2000)Google Scholar
  4. [AS94]
    Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proceedings of 20th International Conference on Very Large Data Bases, pp. 487–499 (1994)Google Scholar
  5. [AY06]
    Aggarwal, C.C., Yu, P.: A survey of synopsis construction in data streams. Data Streams: Models and Algorithms, 169–208 (2006)Google Scholar
  6. [BCD+03]
    Bronnimann, H., Chen, B., Dash, M., Haas, P., Scheuermann, P.: Efficient data reduction with ease. In: Proceedings of the Ninth ACM SIGKDD International Conference in Knowledge Discovery and Data Mining, pp. 59–68 (2003)Google Scholar
  7. [BDF+97]
    Barbar’a, D., Dumouchel, W., Faloutsos, C., Haas, P.J., Hellerstein, J.M., Ioannidis, Y., Jagadish, H.V., Johnson, T., Ng, R., Poosala, V., Ross, K.A., Sevcik, K.C.: The new jersey data reduction report. IEEE Data Engineering Bulletin 20, 3–45 (1997)Google Scholar
  8. [BDM02]
    Babcock, B., Datar, M., Motwani, R.: Sampling from a moving window over streaming data. In: Proceedings of 13th Annual ACM-SIAM Symposium on Discrete Algorithms (2002)Google Scholar
  9. [Bod03]
    Bodon, F.: A fast apriori implementation. In: Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations, FIMI’03 (2003)Google Scholar
  10. [CDG07]
    Calders, T., Dexters, N., Goethals, B.: Mining frequent itemsets in a stream. In: IEEE International Conference on Data Mining (ICDM’07), pp. 83–92 (2007)Google Scholar
  11. [CDH+02]
    Chen, Y., Dong, G., Han, J., Wah, B.W., Wang, J.: Multi-dimensional regression analysis of time-series data streams. In: Proceedings of 28th International Conference on Very Large Data Bases, pp. 323–334 (2002)Google Scholar
  12. [CHS02]
    Chen, B., Haas, P.J., Scheuermann, P.: A new two-phase sampling based algorithm for discovering association rules. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 462–468 (2002)Google Scholar
  13. [CKN06]
    Cheng, J., Ke, Y., Ng, W.: Maintaining frequent itemsets over high-speed data streams. In: Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 462–467 (2006)Google Scholar
  14. [CKN08]
    Cheng, J., Ke, Y., Ng, W.: A survey on algorithms for mining frequent itemsets over data streams. An International Journal of Knowledge and Information Systems (2008)Google Scholar
  15. [CL03a]
    Chang, J.H., Lee, W.S.: estWin: adaptively monitoring the recent change of frequent itemsets over online data streams. In: CIKM, pp. 536–539 (2003)Google Scholar
  16. [CL03b]
    Chang, J.H., Lee, W.S.: Finding recent frequent itemsets adaptively over online data streams. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 487–492 (2003)Google Scholar
  17. [CL04]
    Chang, J.H., Lee, W.S.: A sliding window method for finding recently frequent itemsets over online data streams. Journal of Information Science and Engineeering 20(4), 753–762 (2004)Google Scholar
  18. [CS03]
    Cohen, E., Strauss, M.: Maintaining time-decaying stream aggregates. In: Proceedings of the Twenty-Second ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 223–233 (2003)Google Scholar
  19. [CWYM04]
    Chi, Y., Wang, H., Yu, P.S., Muntz, R.R.: Moment: Maintaining closed frequent itemsets over a stream sliding window. In: Proceedings of the 4th IEEE International Conference on Data Mining (ICDM 2004), pp. 59–66 (2004)Google Scholar
  20. [DH00]
    Domingos, P., Hulten, G.: Mining high-speed data streams. In: Proceedings of ACM SIGKDD International Conference in Knowledge Discovery and Data Mining, pp. 71–80 (2000)Google Scholar
  21. [DN06]
    Dash, M., Ng, W.: Efficient reservoir sampling for transactional data streams. In: IEEE ICDM workshop on Mining Evolving and Streaming Data, pp. 662–666 (2006)Google Scholar
  22. [FMR62]
    Fan, C.T., Muller, M.E., Rezucha, I.: Development of sampling plans by using sequential (item by item) selection techniques and digital computers. Journal of the American Statistical Association (1962)Google Scholar
  23. [GHP+03]
    Giannella, C., Han, J., Pei, J., Yan, X., Yu, P.S.: Mining frequent patterns in data streams at multiple time granularities. In: Next Generation Data Mining. AAAI/MIT (2003)Google Scholar
  24. [HCXY07]
    Han, J., Cheng, H., Xin, D., Yan, X.: Frequent pattern mining: current status and future directions. Data Min. Knowl. Discov. 15(1), 55–86 (2007)CrossRefMathSciNetGoogle Scholar
  25. [HK06a]
    Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann Publishers, San Francisco (2006)Google Scholar
  26. [HK06b]
    Hwang, W., Kim, D.: Improved association rule mining by modified trimming. In: Proceedings of Sixth IEEE International Conference on Computer and Information Technology, CIT (2006)Google Scholar
  27. [HPY00]
    Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: 2000 ACM SIGMOD Intl. Conference on Management of Data, pp. 1–12 (2000)Google Scholar
  28. [HR90]
    Hagerup, T., Rub, C.: A guided tour of chernoff bounds. Information Processing Letters, 305–308 (1990)Google Scholar
  29. [JG06]
    Jiang, N., Gruenwald, L.: Research issues in data stream association rule mining. SIGMOD Record 35, 14–19 (2006)CrossRefGoogle Scholar
  30. [JMR05]
    Johnson, T., Muthukrishnan, S., Rozenbaum, I.: Sampling algorithms in a stream operator. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pp. 1–12 (2005)Google Scholar
  31. [KK06]
    Kotsiantis, S., Kanellopoulos, D.: Association rules mining: A recent overview. GESTS International Transactions on Computer Science and Engineering 32(1), 71–82 (2006)Google Scholar
  32. [KM03]
    Kubica, J.M., Moore, A.: Probabilistic noise identification and data cleaning. In: Proceedings of International Conference on Data Mining (ICDM), pp. 131–138 (2003)Google Scholar
  33. [KSP03]
    Karp, R.M., Shenker, S., Papadimitriou, C.H.: A simple algorithm for finding frequent elements in streams and bags. ACM Transactions on Database Systems 28(1), 51–55 (2003)CrossRefGoogle Scholar
  34. [LCK98]
    Lee, S.D., Cheung, D.W.-L., Kao, B.: Is sampling useful in data mining? a case in the maintenance of discovered association rules. Data Mining and Knowledge Discovery 2(3), 233–262 (1998)CrossRefGoogle Scholar
  35. [Li94]
    Li, K.-H.: Reservoir sampling algorithms of time complexity o(n(1 + log(n/n))). ACM Transactions on Mathematical Software 20(4), 481–493 (1994)zbMATHCrossRefGoogle Scholar
  36. [LLS04]
    Li, H.F., Lee, S.Y., Shan, M.K.: An efficient algorithm for mining frequent itemsets over the entire history of data streams. In: Proc. of First International Workshop on Knowledge Discovery in Data Streams (2004)Google Scholar
  37. [LLS05]
    Li, H.F., Lee, S.Y., Shan, M.K.: Online mining (recently) maximal frequent itemsets over data streams. In: RIDE, pp. 11–18 (2005)Google Scholar
  38. [MG82]
    Misra, J., Gries, D.: Finding repeated elements. Scientific Computing Programming 2(2), 143–152 (1982)zbMATHCrossRefMathSciNetGoogle Scholar
  39. [MM02]
    Manku, G.S., Motwani, R.: Approximate frequency counts over data streams. In: Proceedings of 28th International Conference on Very Large Data Bases, pp. 346–357 (2002)Google Scholar
  40. [MTV94]
    Mannila, H., Toivonen, H., Inkeri Verkamo, A.: Efficient algorithms for discovering association rules. In: Fayyad, U.M., Uthurusamy, R. (eds.) AAAI Workshop on Knowledge Discovery in Databases (KDD-94), pp. 181–192 (1994)Google Scholar
  41. [ND06]
    Ng, W., Dash, M.: An evaluation of progressive sampling for imbalanced data sets. In: IEEE ICDM Workshop on Mining Evolving and Streaming Data, pp. 657–661 (2006)Google Scholar
  42. [ND08]
    Ng, W., Dash, M.: Efficient approximate mining of frequent patterns over transactional data streams. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2008. LNCS, vol. 5182, pp. 241–250. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  43. [OR95]
    Olken, F., Rotem, D.: Random sampling from databases - a survey. Statistics and Computing 5, 25–42 (1995)CrossRefGoogle Scholar
  44. [Par02]
    Parthasarathy, S.: Efficient progressive sampling for association rules. In: IEEE International Conference on Data Mining (ICDM’02), pp. 354–361 (2002)Google Scholar
  45. [PJO99]
    Provost, F.J., Jensen, D., Oates, T.: Efficient progressive sampling. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 23–32 (1999)Google Scholar
  46. [POSG04]
    Park, B.-H., Ostrouchov, G., Samatova, N.F., Geist, A.: Reservoir-based random sampling with replacement from data stream. In: Proceedings of the SIAM International Conference on Data Mining (SDM’04), pp. 492–501 (2004)Google Scholar
  47. [Sha02]
    Shasha, Y.Z.D.: Statstream: Statistical monitoring of thousands of data streams in real time. In: Proceedings of 28th International Conference on Very Large Data Bases, pp. 358–369 (2002)Google Scholar
  48. [Toi96]
    Toivonen, H.: Sampling large databases for association rules. In: VLDB ’96: Proceedings of the 22th International Conference on Very Large Data Bases, pp. 134–145 (1996)Google Scholar
  49. [Vit85]
    Vitter, J.S.: Random sampling with a reservoir. ACM Transactions on Mathematical Software 11, 37–57 (1985)zbMATHCrossRefMathSciNetGoogle Scholar
  50. [YCLZ04]
    Yu, J.X., Chong, Z., Lu, H., Zhou, A.: False positive or false negative: Mining frequent itemsets from high speed transactional data streams. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases (2004)Google Scholar
  51. [YSJ+00]
    Yi, B.-K., Sidiropoulos, N., Johnson, T., Jagadish, H.V., Faloutsos, C., Biliris, A.: Online mining for co-evolving time sequences. In: Proceedings of the 16th International Conference on Data Engineering, pp. 13–22 (2000)Google Scholar
  52. [ZPLO96]
    Zaki, M.J., Parthasarathy, S., Li, W., Ogihara, M.: Evaluation of sampling for data mining of association rules. In: Seventh International Workshop on Research Issues in Data Engineering, RIDE’97 (1996)Google Scholar
  53. [ZWKS07]
    Zhu, X., Wu, X., Khoshgoftaar, T., Shi, Y.: Empirical study of the noise impact on cost-sensitive learning. In: Proceedings of International Conference on Joint Conference on Artificial Intelligence, IJCAI (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Willie Ng
    • 1
  • Manoranjan Dash
    • 1
  1. 1.Centre for Advanced Information SystemsNanyang Technological UniversitySingapore

Personalised recommendations