Advertisement

Optimizing the Data-Process Relationship for Fast Mining of Frequent Itemsets in MapReduce

  • Saber Salah
  • Reza Akbarinia
  • Florent MassegliaEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9166)

Abstract

Despite crucial recent advances, the problem of frequent itemset mining is still facing major challenges. This is particularly the case when: (i) the mining process must be massively distributed and; (ii) the minimum support (MinSup) is very low. In this paper, we study the effectiveness and leverage of specific data placement strategies for improving parallel frequent itemset mining (PFIM) performance in MapReduce, a highly distributed computation framework. By offering a clever data placement and an optimal organization of the extraction algorithms, we show that the itemset discovery effectiveness does not only depend on the deployed algorithms. We propose ODPR (Optimal Data-Process Relationship), a solution for fast mining of frequent itemsets in MapReduce. Our method allows discovering itemsets from massive datasets, where standard solutions from the literature do not scale. Indeed, in a massively distributed environment, the arrangement of both the data and the different processes can make the global job either completely inoperative or very effective. Our proposal has been evaluated using real-world data sets and the results illustrate a significant scale-up obtained with very low MinSup, which confirms the effectiveness of our approach.

Keywords

Frequent Itemsets Data Placement Hadoop Distribute File System Frequent Itemset Mining Input Split 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Hsinchun, C., Chiang, R.H.L., Storey, V.C.: Business intelligence and analytics: from big data to big impact. MIS Q 36(4), 1165–1188 (2012)Google Scholar
  2. 2.
    Anand, R.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2012)Google Scholar
  3. 3.
    Goethals, B.: Memory issues in frequent itemset mining. In: Haddad, H., Omicini, A., Wainwright, R.L., Liebrock, L.M.(eds.) Proceedings of the 2004 ACM Symposium on Applied Computing (SAC), Nicosia, Cyprus, March 14–17, 2004, pp. 530–534. ACM (2004)Google Scholar
  4. 4.
    White, T.: Hadoop : The Definitive Guide. O’Reilly, Beijing (2012)Google Scholar
  5. 5.
    Bizer, C., Boncz, P.A., Brodie, M.L., Erling, O.: The meaningful use of big data: four perspectives - four challenges. SIGMOD Rec. 40(4), 56–60 (2011)CrossRefGoogle Scholar
  6. 6.
    Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  7. 7.
    Hadoop (2014). http://hadoop.apache.org
  8. 8.
    Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Bocca, J.B., Jarke, M., Zaniolo, C. (eds.) Proceedings of International Conference on Very Large Data Bases (VLDB), pp. 487–499. Chile, Santiago de Chile (1994)Google Scholar
  9. 9.
    Savasere, A., Omiecinski, E., Navathe, S.B.: An efficient algorithm for mining association rules in large databases. In: Proceedings of International Conference on Very Large Data Bases (VLDB), pp. 432–444 (1995)Google Scholar
  10. 10.
    Tsay, Y.-J., Chang-Chien, Y.-W.: An efficient cluster and decomposition algorithm for mining association rules. Inf. Sci 160(1–4), 161–171 (2004)CrossRefGoogle Scholar
  11. 11.
    Even, S.: Graph Algorithms. Computer Science Press, Potomac (1979)zbMATHGoogle Scholar
  12. 12.
  13. 13.
    Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y.: Pfp: parallel fp-growth for query recommendation. In: Pu, P., Bridge, D.G., Mobasher, B., Ricci, F.(eds.) Proceedings of the ACM Conference on Recommender Systems (RecSys) Lausanne, Switzerland, pp. 107–114. ACM (2008)Google Scholar
  14. 14.
    Owen, S.: Mahout in Action. Manning Publications Co, Shelter Island, N.Y. (2012)Google Scholar
  15. 15.
  16. 16.
    English wikipedia articles. http://dumps.wikimedia.org/enwiki/latest
  17. 17.
    The clueweb09 dataset (2009). http://www.lemurproject.org/clueweb09.php/
  18. 18.
    Song, W., Yang, B., Zhangyan, X.: Index-bittablefi: an improved algorithm for mining frequent itemsets. Knowl.-Based Syst. 21(6), 507–513 (2008)CrossRefGoogle Scholar
  19. 19.
    Jayalakshmi, N., Vidhya, V., Krishnamurthy, M., Kannan, A.: Frequent itemset generation using double hashing technique. Procedia Eng. 38, 1467–1478 (2012)CrossRefGoogle Scholar
  20. 20.
    Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. SIGMODREC: ACM SIGMOD Record, 29 (2000)Google Scholar
  21. 21.
    Riondato, M., DeBrabant, J.A., Fonseca, R., Upfal, E.: Parma: a parallel randomized algorithm for approximate association rules mining in mapreduce. In: 21st ACM International Conference on Information and Knowledge Management (CIKM), Maui, HI, USA, pp. 85–94. ACM (2012)Google Scholar
  22. 22.
    Liroz-Gistau, M., Akbarinia, R., Pacitti, E., Porto, F., Valduriez, P.: Dynamic workload-based partitioning algorithms for continuously growing databases. In: Hameurlain, A., Küng, J., Wagner, R. (eds.) TLDKS XII. LNCS, vol. 8320, pp. 105–128. Springer, Heidelberg (2013) CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Saber Salah
    • 1
  • Reza Akbarinia
    • 1
  • Florent Masseglia
    • 1
    Email author
  1. 1.Inria and LIRMM, Zenith TeamUniversity of MontpellierMontpellierFrance

Personalised recommendations