Optimizing the Data-Process Relationship for Fast Mining of Frequent Itemsets in MapReduce

Salah, Saber; Akbarinia, Reza; Masseglia, Florent

doi:10.1007/978-3-319-21024-7_15

Optimizing the Data-Process Relationship for Fast Mining of Frequent Itemsets in MapReduce

Saber Salah⁵,
Reza Akbarinia⁵ &
Florent Masseglia⁵

Conference paper
First Online: 01 January 2015

3062 Accesses
3 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9166))

Abstract

Despite crucial recent advances, the problem of frequent itemset mining is still facing major challenges. This is particularly the case when: (i) the mining process must be massively distributed and; (ii) the minimum support (MinSup) is very low. In this paper, we study the effectiveness and leverage of specific data placement strategies for improving parallel frequent itemset mining (PFIM) performance in MapReduce, a highly distributed computation framework. By offering a clever data placement and an optimal organization of the extraction algorithms, we show that the itemset discovery effectiveness does not only depend on the deployed algorithms. We propose ODPR (Optimal Data-Process Relationship), a solution for fast mining of frequent itemsets in MapReduce. Our method allows discovering itemsets from massive datasets, where standard solutions from the literature do not scale. Indeed, in a massively distributed environment, the arrangement of both the data and the different processes can make the global job either completely inoperative or very effective. Our proposal has been evaluated using real-world data sets and the results illustrate a significant scale-up obtained with very low MinSup, which confirms the effectiveness of our approach.

Saber Salah—This work has been partially supported by the Inria Project Lab Hemera.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Hsinchun, C., Chiang, R.H.L., Storey, V.C.: Business intelligence and analytics: from big data to big impact. MIS Q 36(4), 1165–1188 (2012)
Google Scholar
Anand, R.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2012)
Google Scholar
Goethals, B.: Memory issues in frequent itemset mining. In: Haddad, H., Omicini, A., Wainwright, R.L., Liebrock, L.M.(eds.) Proceedings of the 2004 ACM Symposium on Applied Computing (SAC), Nicosia, Cyprus, March 14–17, 2004, pp. 530–534. ACM (2004)
Google Scholar
White, T.: Hadoop : The Definitive Guide. O’Reilly, Beijing (2012)
Google Scholar
Bizer, C., Boncz, P.A., Brodie, M.L., Erling, O.: The meaningful use of big data: four perspectives - four challenges. SIGMOD Rec. 40(4), 56–60 (2011)
Article Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Hadoop (2014). http://hadoop.apache.org
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Bocca, J.B., Jarke, M., Zaniolo, C. (eds.) Proceedings of International Conference on Very Large Data Bases (VLDB), pp. 487–499. Chile, Santiago de Chile (1994)
Google Scholar
Savasere, A., Omiecinski, E., Navathe, S.B.: An efficient algorithm for mining association rules in large databases. In: Proceedings of International Conference on Very Large Data Bases (VLDB), pp. 432–444 (1995)
Google Scholar
Tsay, Y.-J., Chang-Chien, Y.-W.: An efficient cluster and decomposition algorithm for mining association rules. Inf. Sci 160(1–4), 161–171 (2004)
Article Google Scholar
Even, S.: Graph Algorithms. Computer Science Press, Potomac (1979)
MATH Google Scholar
Patoh (2011). http://bmi.osu.edu/ umit/PaToH/manual.pdf
Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y.: Pfp: parallel fp-growth for query recommendation. In: Pu, P., Bridge, D.G., Mobasher, B., Ricci, F.(eds.) Proceedings of the ACM Conference on Recommender Systems (RecSys) Lausanne, Switzerland, pp. 107–114. ACM (2008)
Google Scholar
Owen, S.: Mahout in Action. Manning Publications Co, Shelter Island, N.Y. (2012)
Google Scholar
Grid5000. https://www.grid5000.fr/mediawiki/index.php/Grid5000:Home
English wikipedia articles. http://dumps.wikimedia.org/enwiki/latest
The clueweb09 dataset (2009). http://www.lemurproject.org/clueweb09.php/
Song, W., Yang, B., Zhangyan, X.: Index-bittablefi: an improved algorithm for mining frequent itemsets. Knowl.-Based Syst. 21(6), 507–513 (2008)
Article Google Scholar
Jayalakshmi, N., Vidhya, V., Krishnamurthy, M., Kannan, A.: Frequent itemset generation using double hashing technique. Procedia Eng. 38, 1467–1478 (2012)
Article Google Scholar
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. SIGMODREC: ACM SIGMOD Record, 29 (2000)
Google Scholar
Riondato, M., DeBrabant, J.A., Fonseca, R., Upfal, E.: Parma: a parallel randomized algorithm for approximate association rules mining in mapreduce. In: 21st ACM International Conference on Information and Knowledge Management (CIKM), Maui, HI, USA, pp. 85–94. ACM (2012)
Google Scholar
Liroz-Gistau, M., Akbarinia, R., Pacitti, E., Porto, F., Valduriez, P.: Dynamic workload-based partitioning algorithms for continuously growing databases. In: Hameurlain, A., Küng, J., Wagner, R. (eds.) TLDKS XII. LNCS, vol. 8320, pp. 105–128. Springer, Heidelberg (2013)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Inria and LIRMM, Zenith Team, University of Montpellier, Montpellier, France
Saber Salah, Reza Akbarinia & Florent Masseglia

Authors

Saber Salah
View author publications
You can also search for this author in PubMed Google Scholar
Reza Akbarinia
View author publications
You can also search for this author in PubMed Google Scholar
Florent Masseglia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Florent Masseglia .

Editor information

Editors and Affiliations

IBaI, Leipzig, Germany
Petra Perner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Salah, S., Akbarinia, R., Masseglia, F. (2015). Optimizing the Data-Process Relationship for Fast Mining of Frequent Itemsets in MapReduce. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2015. Lecture Notes in Computer Science(), vol 9166. Springer, Cham. https://doi.org/10.1007/978-3-319-21024-7_15

Download citation

DOI: https://doi.org/10.1007/978-3-319-21024-7_15
Published: 01 July 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21023-0
Online ISBN: 978-3-319-21024-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics