Skip to main content

Optimizing the Data-Process Relationship for Fast Mining of Frequent Itemsets in MapReduce

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9166))

Abstract

Despite crucial recent advances, the problem of frequent itemset mining is still facing major challenges. This is particularly the case when: (i) the mining process must be massively distributed and; (ii) the minimum support (MinSup) is very low. In this paper, we study the effectiveness and leverage of specific data placement strategies for improving parallel frequent itemset mining (PFIM) performance in MapReduce, a highly distributed computation framework. By offering a clever data placement and an optimal organization of the extraction algorithms, we show that the itemset discovery effectiveness does not only depend on the deployed algorithms. We propose ODPR (Optimal Data-Process Relationship), a solution for fast mining of frequent itemsets in MapReduce. Our method allows discovering itemsets from massive datasets, where standard solutions from the literature do not scale. Indeed, in a massively distributed environment, the arrangement of both the data and the different processes can make the global job either completely inoperative or very effective. Our proposal has been evaluated using real-world data sets and the results illustrate a significant scale-up obtained with very low MinSup, which confirms the effectiveness of our approach.

Saber Salah—This work has been partially supported by the Inria Project Lab Hemera.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Hsinchun, C., Chiang, R.H.L., Storey, V.C.: Business intelligence and analytics: from big data to big impact. MIS Q 36(4), 1165–1188 (2012)

    Google Scholar 

  2. Anand, R.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2012)

    Google Scholar 

  3. Goethals, B.: Memory issues in frequent itemset mining. In: Haddad, H., Omicini, A., Wainwright, R.L., Liebrock, L.M.(eds.) Proceedings of the 2004 ACM Symposium on Applied Computing (SAC), Nicosia, Cyprus, March 14–17, 2004, pp. 530–534. ACM (2004)

    Google Scholar 

  4. White, T.: Hadoop : The Definitive Guide. O’Reilly, Beijing (2012)

    Google Scholar 

  5. Bizer, C., Boncz, P.A., Brodie, M.L., Erling, O.: The meaningful use of big data: four perspectives - four challenges. SIGMOD Rec. 40(4), 56–60 (2011)

    Article  Google Scholar 

  6. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  7. Hadoop (2014). http://hadoop.apache.org

  8. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Bocca, J.B., Jarke, M., Zaniolo, C. (eds.) Proceedings of International Conference on Very Large Data Bases (VLDB), pp. 487–499. Chile, Santiago de Chile (1994)

    Google Scholar 

  9. Savasere, A., Omiecinski, E., Navathe, S.B.: An efficient algorithm for mining association rules in large databases. In: Proceedings of International Conference on Very Large Data Bases (VLDB), pp. 432–444 (1995)

    Google Scholar 

  10. Tsay, Y.-J., Chang-Chien, Y.-W.: An efficient cluster and decomposition algorithm for mining association rules. Inf. Sci 160(1–4), 161–171 (2004)

    Article  Google Scholar 

  11. Even, S.: Graph Algorithms. Computer Science Press, Potomac (1979)

    MATH  Google Scholar 

  12. Patoh (2011). http://bmi.osu.edu/ umit/PaToH/manual.pdf

  13. Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y.: Pfp: parallel fp-growth for query recommendation. In: Pu, P., Bridge, D.G., Mobasher, B., Ricci, F.(eds.) Proceedings of the ACM Conference on Recommender Systems (RecSys) Lausanne, Switzerland, pp. 107–114. ACM (2008)

    Google Scholar 

  14. Owen, S.: Mahout in Action. Manning Publications Co, Shelter Island, N.Y. (2012)

    Google Scholar 

  15. Grid5000. https://www.grid5000.fr/mediawiki/index.php/Grid5000:Home

  16. English wikipedia articles. http://dumps.wikimedia.org/enwiki/latest

  17. The clueweb09 dataset (2009). http://www.lemurproject.org/clueweb09.php/

  18. Song, W., Yang, B., Zhangyan, X.: Index-bittablefi: an improved algorithm for mining frequent itemsets. Knowl.-Based Syst. 21(6), 507–513 (2008)

    Article  Google Scholar 

  19. Jayalakshmi, N., Vidhya, V., Krishnamurthy, M., Kannan, A.: Frequent itemset generation using double hashing technique. Procedia Eng. 38, 1467–1478 (2012)

    Article  Google Scholar 

  20. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. SIGMODREC: ACM SIGMOD Record, 29 (2000)

    Google Scholar 

  21. Riondato, M., DeBrabant, J.A., Fonseca, R., Upfal, E.: Parma: a parallel randomized algorithm for approximate association rules mining in mapreduce. In: 21st ACM International Conference on Information and Knowledge Management (CIKM), Maui, HI, USA, pp. 85–94. ACM (2012)

    Google Scholar 

  22. Liroz-Gistau, M., Akbarinia, R., Pacitti, E., Porto, F., Valduriez, P.: Dynamic workload-based partitioning algorithms for continuously growing databases. In: Hameurlain, A., Küng, J., Wagner, R. (eds.) TLDKS XII. LNCS, vol. 8320, pp. 105–128. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Florent Masseglia .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Salah, S., Akbarinia, R., Masseglia, F. (2015). Optimizing the Data-Process Relationship for Fast Mining of Frequent Itemsets in MapReduce. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2015. Lecture Notes in Computer Science(), vol 9166. Springer, Cham. https://doi.org/10.1007/978-3-319-21024-7_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-21024-7_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-21023-0

  • Online ISBN: 978-3-319-21024-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics