Catch the moment: maintaining closed frequent itemsets over a data stream sliding window

Abstract

This paper considers the problem of mining closed frequent itemsets over a data stream sliding window using limited memory space. We design a synopsis data structure to monitor transactions in the sliding window so that we can output the current closed frequent itemsets at any time. Due to time and memory constraints, the synopsis data structure cannot monitor all possible itemsets. However, monitoring only frequent itemsets will make it impossible to detect new itemsets when they become frequent. In this paper, we introduce a compact data structure, the closed enumeration tree (CET), to maintain a dynamically selected set of itemsets over a sliding window. The selected itemsets contain a boundary between closed frequent itemsets and the rest of the itemsets. Concept drifts in a data stream are reflected by boundary movements in the CET. In other words, a status change of any itemset (e.g., from non-frequent to frequent) must occur through the boundary. Because the boundary is relatively stable, the cost of mining closed frequent itemsets over a sliding window is dramatically reduced to that of mining transactions that can possibly cause boundary movements in the CET. Our experiments show that our algorithm performs much better than representative algorithms for the sate-of-the-art approaches.

References

  1. 1.

    Agarwal RC, Aggarwal CC, Prasad VVV (2001) A tree projection algorithm for generation of frequent item sets. J Parallel Distrib Comput 61(3):350–371

    MATH  Article  Google Scholar 

  2. 2.

    Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large databases (VLDB'94)

  3. 3.

    Asai T, Arimura H, Abe K, Kawasoe S, Arikawa S (2002) Online algorithms for mining semi-structured data stream. In: Proceedings of the 2002 international conference on data mining (ICDM'02)

  4. 4.

    Bayardo Jr RJ (1998) Efficiently mining long patterns from databases. In: Proceedings of the ACM SIGMOD

  5. 5.

    Chang JH, Lee WS (2003) Finding recent frequent itemsets adaptively over online data streams. In: Proceedings of the 2003 international conference knowledge discovery and data mining (SIGKDD'03)

  6. 6.

    Charikar M, Chen K, Farach-Colton M (2002) Finding frequent items in data streams. In: Proceedings of the 29th international colloquium on automata, languages and programming

  7. 7.

    Cheung DW, Han J, Ng V, Wong CY (1996) Maintenance of discovered association rules in large databases: An incremental updating technique. In: Proceedings of the twelfth international conference on data engineering

  8. 8.

    Cheung DW, Lee SD, Kao B (1997) A general incremental technique for maintaining discovered association rules, In: Proceedings of the fifth international conference on database systems for advanced applications (DASFAA)

  9. 9.

    Giannella C, Han J, Robertson E, Liu C (2003) Mining frequent itemsets over arbitrary time intervals in data streams. Technical Report tr587, Indiana University

  10. 10.

    Gouda K, Zaki MJ (2001) Efficiently mining maximal frequent itemsets. In: Proceedings of the 2001 IEEE international conference on data mining

  11. 11.

    Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data

  12. 12.

    Hidber C (1999) Online association rule mining. In: Proceedings of the ACM SIGMOD international conference on management of data

  13. 13.

    Manku G, Motwani R (2002) Approximate frequency counts over data streams. In: Proceedings of the 28th international conference on very large data bases

  14. 14.

    Mannila H, Toivonen H (1997) Levelwise search and borders of theories in knowledge discovery. Data Mining Knowledge Discov 1(3):241–258

    Article  Google Scholar 

  15. 15.

    Otey ME, Parthasarathy S, Wang C, Veloso A, Meira W Jr (2004) Parallel and distributed methods for incremental frequent itemset mining. IEEE Trans Syst Man Cybern, Part B 34(6):2439–2450

    Article  Google Scholar 

  16. 16.

    Teng W-G, Chen M-S, Yu PS (2003) A regression-based temporal pattern mining scheme for data streams. In: Proceedings of 29th international conference on very large data bases (VLDB'03)

  17. 17.

    Thomas S, Bodagala S, Alsabti K, Ranka S (1997) An efficient algorithm for the incremental updation of association rules in large databases. In: Proceedings of the 1997 international conference knowledge discovery and data mining (SIGKDD'97), pp 263–266

  18. 18.

    Veloso A, Meira Jr W, de Carvalho M, Pôssas B, Parthasarathy S, Zaki MJ (2002) Mining frequent itemsets in evolving databases. In: Proceedings of the SDM

  19. 19.

    Vitter JS (1985) Random sampling with a reservoir. ACM Trans Math Softw 11(1)

  20. 20.

    Wang J, Han J, Pei J (2003) Closet+: searching for the best strategies for mining frequent closed itemsets. In: Proceedings of the 2003 international conference knowledge discovery and data mining (SIGKDD'03)

  21. 21.

    Zaki MJ, Hsiao C (2002) Charm: an efficient algorithm for closed itemset mining. In: Proceedings of the 2nd SIAM international conference on data mining

  22. 22.

    Zheng Z, Kohavi R, Mason L (2001) Real world performance of association rule algorithms. In: Proceedings of the 2001 international conference on knowledge discovery and data mining (SIGKDD'01)

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Yun Chi.

Additional information

Yun Chi is currently a Ph.D. student at the Department of Computer Science, UCLA. His main areas of research include database systems, data mining, and bioinformatics. For data mining, he is interested in mining labeled trees and graphs, mining data streams, and mining data with uncertainty.

Haixun Wang is currently a research staff member at IBM T. J. Watson Research Center. He received the B.S. and the M.S. degree, both in computer science, from Shanghai Jiao Tong University in 1994 and 1996. He received the Ph.D. degree in computer science from the University of California, Los Angeles in 2000. He has published more than 60 research papers in referred international journals and conference proceedings. He is a member of the ACM, the ACM SIGMOD, the ACM SIGKDD, and the IEEE Computer Society. He has served in program committees of international conferences and workshops, and has been a reviewer for some leading academic journals in the database field.

Philip S. Yureceived the B.S. Degree in electrical engineering from National Taiwan University, the M.S. and Ph.D. degrees in electrical engineering from Stanford University, and the M.B.A. degree from New York University. He is with the IBM Thomas J. Watson Research Center and currently manager of the Software Tools and Techniques group. His research interests include data mining, Internet applications and technologies, database systems, multimedia systems, parallel and distributed processing, and performance modeling. Dr. Yu has published more than 430 papers in refereed journals and conferences. He holds or has applied for more than 250 US patents.

Dr. Yu is a Fellow of the ACM and a Fellow of the IEEE. He is associate editors of ACM Transactions on the Internet Technology and ACM Transactions on Knowledge Discovery in Data. He is a member of the IEEE Data Engineering steering committee and is also on the steering committee of IEEE Conference on Data Mining. He was the Editor-in-Chief of IEEE Transactions on Knowledge and Data Engineering (2001–2004), an editor, advisory board member and also a guest co-editor of the special issue on mining of databases. He had also served as an associate editor of Knowledge and Information Systems. In addition to serving as program committee member on various conferences, he will be serving as the general chairman of 2006 ACM Conference on Information and Knowledge Management and the program chairman of the 2006 joint conferences of the 8th IEEE Conference on E-Commerce Technology (CEC' 06) and the 3rd IEEE Conference on Enterprise Computing, E-Commerce and E-Services (EEE' 06). He was the program chairman or co-chairs of the 11th IEEE International Conference on Data Engineering, the 6th Pacific Area Conference on Knowledge Discovery and Data Mining, the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, the 2nd IEEE International Workshop on Research Issues on Data Engineering:Transaction and Query Processing, the PAKDD Workshop on Knowledge Discovery from Advanced Databases, and the 2nd IEEE International Workshop on Advanced Issues of E-Commerce and Web-based Information Systems. He served as the general chairman of the 14th IEEE International Conference on Data Engineering and the general co-chairman of the 2nd IEEE International Conference on Data Mining. He has received several IBM honors including 2 IBM Outstanding Innovation Awards, an Outstanding Technical Achievement Award, 2 Research Division Awards and the 84th plateau of Invention Achievement Awards. He received an Outstanding Contributions Award from IEEE International Conference on Data Mining in 2003 and also an IEEE Region 1 Award for “promoting and perpetuating numerous new electrical engineering concepts" in 1999. Dr. Yu is an IBM Master Inventor.

Richard R. Muntz is a Professor and past chairman of the Computer Science Department, School of Engineering and Applied Science, UCLA. His current research interests are sensor rich environments, multimedia storage servers and database systems, distributed and parallel database systems, spatial and scientific database systems, data mining, and computer performance evaluation. He is the author of over one hundred and fifty research papers.

Dr. Muntz received the BEE from Pratt Institute in 1963, the MEE from New York University in 1966, and the Ph.D. in Electrical Engineering from Princeton University in 1969. He is a member of the Board of Directors for SIGMETRICS and past chairman of IFIP WG7.3 on performance evaluation. He was a member of the Corporate Technology Advisory Board at NCR/Teradata, a member of the Science Advisory Board of NASA's Center of Excellence in Space Data Information Systems, and a member of the Goddard Space Flight Center Visiting Committee on Information Technology. He recently chaired a National Research Council study on “The Intersection of Geospatial Information and IT” which was published in 2003. He was an associate editor for the Journal of the ACM from 1975 to 1980 and the Editor-in-Chief of ACM Computing Surveys from 1992 to 1995. He is a Fellow of the ACM and a Fellow of the IEEE.

Rights and permissions

Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License ( https://creativecommons.org/licenses/by-nc/2.0 ), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Reprints and Permissions

About this article

Cite this article

Chi, Y., Wang, H., Yu, P.S. et al. Catch the moment: maintaining closed frequent itemsets over a data stream sliding window. Knowl Inf Syst 10, 265–294 (2006). https://doi.org/10.1007/s10115-006-0003-0

Download citation

Keywords

  • Data streams
  • Sliding window
  • Closed frequent itemset
  • Incremental learning