Advertisement

Massively Distributed Environments and Closed Itemset Mining: The DCIM Approach

  • Mehdi Zitouni
  • Reza Akbarinia
  • Sadok Ben Yahia
  • Florent Masseglia
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10253)

Abstract

Data analytics in general, and data mining primitives in particular, are a major source of bottlenecks in the operation of information systems. This is mainly due to their high complexity and intensive call to IO operations, particularly in massively distributed environments. Moreover, an important application of data analytics is to discover key insights from the running traces of information system in order to improve their engineering. Mining closed frequent itemsets (CFI) is one of these data mining techniques, associated with great challenges. It allows discovering itemsets with better efficiency and result compactness. However, discovering such itemsets in massively distributed data poses a number of issues that are not addressed by traditional methods. One solution for dealing with such characteristics is to take advantage of parallel frameworks like, e.g., MapReduce. We address the problem of distributed CFI mining by introducing a new parallel algorithm, called DCIM, which uses a prime number based approach. A key feature of DCIM is the deep combination of data mining properties with the principles of massive data distribution. We carried out exhaustive experiments over real world datasets to illustrate the efficiency of DCIM for large real world datasets with up to 53 million documents.

Keywords

Distributed information systems Data analytics Closed frequent itemsets 

Notes

Acknowledgments

This work has been partially funded by the European Commission under the CloudDBAppliance project (grant 732051) and performed in the context of the Computational Biology Institute in Montpellier.

References

  1. 1.
    Moens, S., Aksehirli, E., Goethals, B.: Frequent itemset mining for big data. In: Proceedings of IEEE 2013 on Big Data, Santa Clara, CA, USA (2013)Google Scholar
  2. 2.
    Gainaru, A., Cappello, F., Trausan-Matu, S., Kramer, B.: Event log mining tool for large scale HPC systems. In: Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011. LNCS, vol. 6852, pp. 52–64. Springer, Heidelberg (2011). doi: 10.1007/978-3-642-23400-2_6 CrossRefGoogle Scholar
  3. 3.
    Xu, W., Huang, L., Fox, A., Patterson, D., Jordan, M.: Mining console logs for large-scale system problem detection. In: Proceedings of SysML 2008, Berkeley, CA, USA (2008)Google Scholar
  4. 4.
    Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Discovering frequent closed itemsets for association rules. In: Beeri, C., Buneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 398–416. Springer, Heidelberg (1999). doi: 10.1007/3-540-49257-7_25 CrossRefGoogle Scholar
  5. 5.
    Chen, K., Zhang, L., Li, S., Ke, W.: Research on association rules parallel algorithm based on FP-growth. In: Liu, C., Chang, J., Yang, A. (eds.) ICICA 2011. CCIS, vol. 244, pp. 249–256. Springer, Heidelberg (2011). doi: 10.1007/978-3-642-27452-7_33 CrossRefGoogle Scholar
  6. 6.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. ACM J. Commun. 51, 107–113 (2008)CrossRefGoogle Scholar
  7. 7.
    Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of USENIX, HotCloud 2010, Boston, MA, USA (2010)Google Scholar
  8. 8.
    Wang, S.-Q., Yang, Y.-B., Gao, Y., Chen, G.-P., Zhang, Y.: MapReduce-based closed frequent itemset mining with efficient redundancy filtering. In: Proceedings of IEEE 2012 ICDM, Brussels, Belgium (2012)Google Scholar
  9. 9.
    Zaïane, O.R., El-Hajj, M., Lu, P.: Fast parallel association rule mining without candidacy generation. In: Proceedings of IEEE 2001 ICDM, San Jose, California, USA (2001)Google Scholar
  10. 10.
    Li, E., Liu, L.: Optimization of frequent itemset mining on multiple-core processor. In: Proceedings of VLDB 2007, Vienna, Austria (2007)Google Scholar
  11. 11.
    Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y.: PFP: parallel FP-growth for query recommendation. In: Proceedings of RecSys 2008, Lausanne, Switzerland (2008)Google Scholar
  12. 12.
    Wang, J., Han, J., Pei, J.: CLOSET+: searching for the best strategies for mining frequent closed itemsets. In: Proceedings of SIG-KDD 2003, Washington, DC, USA (2003)Google Scholar
  13. 13.
    Lucchese, C., Orlando, S., Perego, R.: Fast and memory efficient mining of frequent closed itemsets. J. IEEE 2006 (2006)Google Scholar
  14. 14.
    Wang, S., Wang, L.: An implementation of FP-growth algorithm based on high level data structures of Weka-Jung framework. J. JCIT (2010)Google Scholar
  15. 15.
    Nègre, C.: Efficient binary polynomial multiplication based on optimized Karatsuba reconstruction. J. Cryptographic Eng. 4, 91–106 (2014)CrossRefGoogle Scholar
  16. 16.
    Zanoni, A.: Iterative Toom-Cook methods for very unbalanced long integer multiplication. In: Proceedings ISSAC 2010, Munich, Germany (2010)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Mehdi Zitouni
    • 1
    • 2
  • Reza Akbarinia
    • 1
  • Sadok Ben Yahia
    • 2
  • Florent Masseglia
    • 1
  1. 1.INRIAMontpellierFrance
  2. 2.Faculté des Sciences de Tunis, LIPAH-LR 11ES14Université de Tunis ElManarTunisTunisia

Personalised recommendations