Massively Distributed Environments and Closed Itemset Mining: The DCIM Approach
Data analytics in general, and data mining primitives in particular, are a major source of bottlenecks in the operation of information systems. This is mainly due to their high complexity and intensive call to IO operations, particularly in massively distributed environments. Moreover, an important application of data analytics is to discover key insights from the running traces of information system in order to improve their engineering. Mining closed frequent itemsets (CFI) is one of these data mining techniques, associated with great challenges. It allows discovering itemsets with better efficiency and result compactness. However, discovering such itemsets in massively distributed data poses a number of issues that are not addressed by traditional methods. One solution for dealing with such characteristics is to take advantage of parallel frameworks like, e.g., MapReduce. We address the problem of distributed CFI mining by introducing a new parallel algorithm, called DCIM, which uses a prime number based approach. A key feature of DCIM is the deep combination of data mining properties with the principles of massive data distribution. We carried out exhaustive experiments over real world datasets to illustrate the efficiency of DCIM for large real world datasets with up to 53 million documents.
KeywordsDistributed information systems Data analytics Closed frequent itemsets
This work has been partially funded by the European Commission under the CloudDBAppliance project (grant 732051) and performed in the context of the Computational Biology Institute in Montpellier.
- 1.Moens, S., Aksehirli, E., Goethals, B.: Frequent itemset mining for big data. In: Proceedings of IEEE 2013 on Big Data, Santa Clara, CA, USA (2013)Google Scholar
- 3.Xu, W., Huang, L., Fox, A., Patterson, D., Jordan, M.: Mining console logs for large-scale system problem detection. In: Proceedings of SysML 2008, Berkeley, CA, USA (2008)Google Scholar
- 7.Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of USENIX, HotCloud 2010, Boston, MA, USA (2010)Google Scholar
- 8.Wang, S.-Q., Yang, Y.-B., Gao, Y., Chen, G.-P., Zhang, Y.: MapReduce-based closed frequent itemset mining with efficient redundancy filtering. In: Proceedings of IEEE 2012 ICDM, Brussels, Belgium (2012)Google Scholar
- 9.Zaïane, O.R., El-Hajj, M., Lu, P.: Fast parallel association rule mining without candidacy generation. In: Proceedings of IEEE 2001 ICDM, San Jose, California, USA (2001)Google Scholar
- 10.Li, E., Liu, L.: Optimization of frequent itemset mining on multiple-core processor. In: Proceedings of VLDB 2007, Vienna, Austria (2007)Google Scholar
- 11.Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y.: PFP: parallel FP-growth for query recommendation. In: Proceedings of RecSys 2008, Lausanne, Switzerland (2008)Google Scholar
- 12.Wang, J., Han, J., Pei, J.: CLOSET+: searching for the best strategies for mining frequent closed itemsets. In: Proceedings of SIG-KDD 2003, Washington, DC, USA (2003)Google Scholar
- 13.Lucchese, C., Orlando, S., Perego, R.: Fast and memory efficient mining of frequent closed itemsets. J. IEEE 2006 (2006)Google Scholar
- 14.Wang, S., Wang, L.: An implementation of FP-growth algorithm based on high level data structures of Weka-Jung framework. J. JCIT (2010)Google Scholar
- 16.Zanoni, A.: Iterative Toom-Cook methods for very unbalanced long integer multiplication. In: Proceedings ISSAC 2010, Munich, Germany (2010)Google Scholar