ML-DS: A Novel Deterministic Sampling Algorithm for Association Rules Mining
Due to the explosive growth of data in every aspect of our life, data mining algorithms often suffer from scalability issues. One effective way to tackle this problem is to employ sampling techniques. This paper introduces, ML-DS, a novel deterministic sampling algorithm for mining association rules in large datasets. Unlike most algorithms in the literature that use randomness in sampling, our algorithm is fully deterministic. The process of sampling proceeds in stages. The size of the sample data in any stage is half that of the previous stage. In any given stage, the data is partitioned into disjoint groups of equal size. Some distance measure is used to determine the importance of each group in identifying accurate association rules. The groups are then sorted based on this measure. Only the best 50% of the groups move to the next stage. We perform as many stages of sampling as needed to produce a sample of a desired target size. The resultant sample is then employed to identify association rules. Empirical results show that our approach outperforms simple randomized sampling in accuracy and is competitive in comparison with the state-of-the-art sampling algorithms in terms of both time and accuracy.
KeywordsAssociation Rule Minimum Support Sampling Algorithm Association Rule Mining Disjoint Group
Unable to display preview. Download preview PDF.
- 2.Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proc. 20th Int. Conf. Very Large Data Bases, VLDB 1994, vol. 1215, pp. 487–499 (1994)Google Scholar
- 9.Houtsma, M., Swami, A.: Set-oriented mining of association rules. In: International Conference on Data Engineering (1993)Google Scholar
- 10.John, G., Langley, P.: Static versus dynamic sampling for data mining. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 367–370 (1996)Google Scholar
- 12.Pei, J., Han, J., Lu, H., Nishio, S., Tang, S., Yang, D.: H-mine: hyper-structure mining of frequent patterns in large databases. In: Proceedings IEEE International Conference on Data Mining, ICDM 2001, pp. 441–448 (2001)Google Scholar
- 13.Provost, F., Jensen, D., Oates, T.: Efficient progressive sampling. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 1999, pp. 23–32. ACM, New York (1999)Google Scholar
- 16.Toivonen, H.: Sampling large databases for association rules. In: Proceedings of the 22th International Conference on Very Large Data Bases, VLDB 1996, pp. 134–145. Morgan Kaufmann Publishers Inc., San Francisco (1996)Google Scholar
- 18.Zaki, M.J., Parthasarathy, S., Li, W., Ogihara, M.: Evaluation of sampling for data mining of association rules. In: Proceedings of the 7th International Workshop on Research Issues in Data Engineering, RIDE 1997, p. 42. IEEE Computer Society, Washington, DC (1997)Google Scholar
- 19.Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W.: New algorithms for fast discovery of association rules. In: Knowledge Discovery and Data Mining, pp. 283–286 (1997)Google Scholar