An Efficient Parallel and Distributed Algorithm for Counting Frequent Sets
Due to the huge increase in the number and dimension of available databases, efficient solutions for counting frequent sets are nowadays very important within the Data Mining community. Several sequential and parallel algorithms were proposed, which in many cases exhibit excellent scalability. In this paper we present ParDCI, a distributed and multithreaded algorithm for counting the occurrences of frequent sets within transactional databases. ParDCI is a parallel version of DCI (Direct Count & Intersect), a multi-strategy algorithm which is able to adapt its behavior not only to the features of the specific computing platform (e.g. available memory), but also to the features of the dataset being processed (e.g. sparse or dense datasets). ParDCI enhances previous proposals by exploiting the highly optimized counting and intersection techniques of DCI, and by relying on a multi-level parallelization approachwh ichex plicitly targets clusters of SMPs, an emerging computing platform. We focused our work on the efficient exploitation of the underlying architecture. Intra-Node multithreading effectively exploits the memory hierarchies of each SMP node, while Inter-Node parallelism exploits smart partitioning techniques aimed at reducing communication overheads. In depth experimental evaluations demonstrate that ParDCI reaches nearly optimal performances under a variety of conditions.
Unable to display preview. Download preview PDF.
- R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. Inkeri Verkamo. Fast Discovery of Association Rules in Large Databases. In Advances in Knowledge Discovery and Data Mining, pages 307–328. AAAI Press, 1996. 422Google Scholar
- R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules in Large Databases. In Proc. of the 20th VLDB Conf., pages 487–499, 1994. 422Google Scholar
- R. Baraglia, D. Laforenza, S. Orlando, P. Palmerini, and R. Perego. Implementation Issues in the Design of I/O Intensive Data Mining Applications on Clusters of Workstations. In Proc. of the 3rd Work. on High Performance Data Mining, (IPDPS-2000), Cancun, Mexico, pages 350–357. LNCS 1800 Spinger-Verlag, 2000. 426Google Scholar
- R. J. Bayardo. Efficiently Mining Long Patterns from Databases. In Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, pages 85–93, 1998. 422Google Scholar
- U. M. Fayyad, G. Piatetsky-Shapiro, P. Smith, and R. Uthurusamy, editors. Advances in Knowledge Discovery and Data Mining. AAAI Press, 1998. 421Google Scholar
- V. Ganti, J. Gehrke, and R. Ramakrishnan. Mining Very Large Databases. IEEE Computer, 32(8):38–45, 1999. 421Google Scholar
- J. Han, J. Pei, and Y. Yin. Mining Frequent Patterns without Candidate Generation. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 1–12, Dallas, Texas, USA, 2000. 426, 433Google Scholar
- S. Orlando, P. Palmerini, and R. Perego. Enhancing the Apriori Algorithm for Frequent Set Counting. In Proc. of the 3rd Int. Conf. on Data Warehousing and Knowledge Discovery, LNCS 2114, pages 71–82, Germany, 2001. 423, 425, 426Google Scholar
- S. Orlando, P. Palmerini, R. Perego, and F. Silvestri. Adaptive and Resource-Aware Mining of Frequent Sets. In Proc. of the 2002 IEEE Int. Conference on Data Mining (ICDM’02), Maebashi City, Japan, Dec. 2002. 423, 424, 428, 430, 433Google Scholar
- J. S. Park, M.-S. Chen, and P. S. Yu. An Effective Hash Based Algorithm for Mining Association Rules. In Proc. of the 1995 ACM SIGMOD Int. Conf. on Management of Data, pages 175–186, 1995. 422Google Scholar
- A. Savasere, E. Omiecinski, and S. B. Navathe. An Efficient Algorithm for Mining Association Rules in Large Databases. In Proc. of the 21th VLDB Conf., pages 432–444, Zurich, Switzerland, 1995. 422, 423Google Scholar