Mining sequential patterns: Generalizations and performance improvements
The problem of mining sequential patterns was recently introduced in . We are given a database of sequences, where each sequence is a list of transactions ordered by transaction-time, and each transaction is a set of items. The problem is to discover all sequential patterns with a user-specified minimum support, where the support of a pattern is the number of data-sequences that contain the pattern. An example of a sequential pattern is“5% of customers bought ‘Foundation’ and ‘Ringworld’ in one transaction, followed by ‘Second Foundation’ in a later transaction”. We generalize the problem as follows. First, we add time constraints that specify a minimum and/or maximum time period between adjacent elements in a pattern. Second, we relax the restriction that the items in an element of a sequential pattern must come from the same transaction, instead allowing the items to be present in a set of transactions whose transaction-times are within a user-specified time window. Third, given a user-defined taxonomy (is-a hierarchy) on items, we allow sequential patterns to include items across all levels of the taxonomy.
We present GSP, a new algorithm that discovers these generalized sequential patterns. Empirical evaluation using synthetic and real-life data indicates that GSP is much faster than the AprioriAll algorithm presented in . GSP scales linearly with the number of data-sequences, and has very good scale-up properties with respect to the average data-sequence size.
KeywordsHash Function Association Rule Sequential Pattern Minimum Support Frequent Itemsets
Unable to display preview. Download preview PDF.
- 1.R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proc. of the ACM SIGMOD Conference on Management of Data, pages 207–216, Washington, D.C., May 1993.Google Scholar
- 2.R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules. In Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, September 1994.Google Scholar
- 3.R. Agrawal and R. Srikant. Mining Sequential Patterns. In Proc. of the 11th Int'l Conference on Data Engineering, Taipei, Taiwan, March 1995.Google Scholar
- 4.J. Han and Y. Fu. Discovery of multiple-level association rules from large databases. In Proc. of the 21st Int'l Conference on Very Large Databases, Zurich, Switzerland, September 1995.Google Scholar
- 5.H. Mannila, H. Toivonen, and A. I. Verkamo. Discovering frequent episodes in sequences. In Proc. of the Int'l Conference on Knowledge Discovery in Databases and Data Mining (KDD-95), Montreal, Canada, August 1995.Google Scholar
- 6.R. Srikant and R. Agrawal. Mining Generalized Association Rules. In Proc. of the 21st Int'l Conference on Very Large Databases, Zurich, Switzerland, September 1995.Google Scholar
- 7.R. Srikant and R. Agrawal. Mining Sequential Patterns: Generalizations and Performance Improvements. Research Report RJ 9994, IBM Almaden Research Center, San Jose, California, December 1995.Google Scholar
- 8.J. T.-L. Wang, G.-W. Chirn, T. G. Marr, B. Shapiro, D. Shasha, and K. Zhang. Combinatorial pattern discovery for scientific data: Some preliminary results. In Proc. of the ACM SIGMOD Conference on Management of Data, Minneapolis, May 1994.Google Scholar