MapFIM+: Memory Aware Parallelized Frequent Itemset Mining In Very Large Datasets

Duong, Khanh-Chuong; Bamha, Mostafa; Giacometti, Arnaud; Li, Dominique; Soulet, Arnaud; Vrain, Christel

doi:10.1007/978-3-662-58415-6_7

Khanh-Chuong Duong^18,19,
Mostafa Bamha¹⁹,
Arnaud Giacometti¹⁸,
Dominique Li¹⁸,
Arnaud Soulet¹⁸ &
…
Christel Vrain¹⁹

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 11310))

292 Accesses
2 Citations

Abstract

Mining frequent itemsets in large datasets has received much attention in recent years relying on MapReduce programming model. For instance, many efficient Frequent Itemset Mining (a.k.a. FIM) algorithms have been parallelized to MapReduce principle such as Parallel Apriori, Parallel FP-Growth and Dist-Eclat. However, most approaches focus on job partitioning and/or load balancing without considering the extensibility depending on required memory assumptions. Thus, a challenge in designing parallel FIM algorithms consists therefore in finding ways to guarantee that data structures used during the mining process always fit in the local memory of processing nodes during all computation steps. In this paper, we propose MapFIM+, a two-phase approach to frequent itemset mining in very large datasets benefiting both from a MapReduce-based distributed Apriori method and local in-memory FIM methods. In our approach, MapReduce is first used to generate frequent itemsets until getting local memory-fitted prefix-projected databases, and an optimized local in-memory mining process is then launched to generate all remaining frequent itemsets from each prefix-projected database on individual processing nodes. Indeed, MapFIM+ improves our previous algorithm MapFIM by using an exact evaluation of prefix-projected database sizes during the MapReduce phase. This improvement makes MapFIM+ more efficient, especially for databases leading to huge candidate sets, by significantly reducing communication and disk I/O costs. Performance evaluation shows that MapFIM+ is more efficient and more extensible than existing MapReduce based frequent itemset mining approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 16.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In our work, in order to generate each candidate once, we use a prefix-based join operation. More precisely, given two sets of frequent k-itemsets \(\mathcal {L}_{k}\) and \(\mathcal {L}'_{k}\), the join of \(\mathcal {L}_{k}\) and \(\mathcal {L}'_{k}\) is defined by: \(\mathcal {L}_{k} \bowtie \mathcal {L}'_{k} = \{ (i_1, \dots , i_k,i_{k+1}) ~|~ (i_1, \dots , i_{k-1}, i_{k}) \in \mathcal {L}_{k} \wedge (i_1, \dots , i_{k-1},i_{k+1}) \in \mathcal {L}'_{k} \wedge i_1< \dots< i_{k} < i_{k+1} \}\).
2.
In our configuration, there is no real difference of performance between Hadoop 1.2.1 and Hadoop 2.7.3.

References

Aggarwal, C.C., Han, J.: Frequent Pattern Mining. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07821-2
Book MATH Google Scholar
Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proceedings of VLDB 1994, vol. 1215, pp. 487–499 (1994)
Google Scholar
Al Hajj Hassan, M., Bamha, M.: Towards scalability and data skew handling in groupby-joins using MapReduce model. In: Proceedings of ICCS 2015, pp. 70–79 (2015)
Article Google Scholar
Al Hajj Hassan, M., Bamha, M., Loulergue, F.: Handling data-skew effects in join operations using MapReduce. In: Proceedings of ICCS 2014, pp. 145–158. IEEE (2014)
Article Google Scholar
Beedkar, K., Berberich, K., Gemulla, R., Miliaraki, I.: Closing the gap: sequence mining at scale. ACM Trans. Database Syst. 40(2), 8:1–8:44 (2015)
Article MathSciNet Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Duong, K.-C., Bamha, M., Giacometti, A., Li, D., Soulet, A., Vrain, C.: MapFIM: memory aware parallelized frequent itemset mining in very large datasets. In: Benslimane, D., Damiani, E., Grosky, W.I., Hameurlain, A., Sheth, A., Wagner, R.R. (eds.) DEXA 2017. LNCS, vol. 10438, pp. 478–495. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64468-4_36
Chapter Google Scholar
Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y: PFP: parallel FP-growth for query recommendation. In: Proceedings of RecSys 2008, pp. 107–114. ACM (2008)
Google Scholar
Li, N., Zeng, L., He, Q., Shi, Z.: Parallel implementation of apriori algorithm based on MapReduce. In: Proceedings of SNDP 2012, pp. 236–241. IEEE (2012)
Google Scholar
Lin, M.-Y., Lee, P.-Y., Hsueh, S.-C.: Apriori-based frequent itemset mining algorithms on MapReduce. In Proceedings of ICUIMC 2012, pp. 76:1–76:8 (2012)
Google Scholar
Lucchese, C., Orlando, S., Perego, R., Silvestri, F.: WebDocs: a real-life huge transactional dataset. In: FIMI, vol. 126 (2004)
Google Scholar
Apache Mahout: Scalable machine learning and data mining (2012)
Google Scholar
Makanju, A., Farzanyar, Z., An, A., Cercone, N., Hu Z.Z., Hu, Y.: Deep parallelization of parallel FP-growth using parent-child MapReduce. In: Proceedings of BigData 2016, pp. 1422–1431. IEEE (2016)
Google Scholar
Miliaraki, I., Berberich, K., Gemulla, R., Zoupanos, S.: Mind the gap: large-scale frequent sequence mining. In: Proceedings of SIGMOD 2013, pp. 797–808. ACM (2013)
Google Scholar
Moens, S., Aksehirli, E., Goethals, B.: Frequent itemset mining for big data. In Proceedings of BigData 2013, pp. 111–118. IEEE (2013)
Google Scholar
Salah, S., Akbarinia, R., Masseglia, F.: Data partitioning for fast mining of frequent itemsets in massively distributed environments. In: Chen, Q., Hameurlain, A., Toumani, F., Wagner, R., Decker, H. (eds.) DEXA 2015. LNCS, vol. 9261, pp. 303–318. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-22849-5_21
Chapter Google Scholar
Salah, S., Akbarinia, R., Masseglia, F.: Optimizing the data-process relationship for fast mining of frequent itemsets in mapreduce. In: Perner, P. (ed.) MLDM 2015. LNCS (LNAI), vol. 9166, pp. 217–231. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-21024-7_15
Chapter Google Scholar
Savasere, A., Omiecinski, E., Navathe, S.B.: An efficient algorithm for mining association rules in large databases. In: Proceedings of VLDB 1995, pp. 432–444. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1995)
Google Scholar
Uno, T., Asai, T., Uchida, Y., Arimura, H.: LCM: an efficient algorithm for enumerating frequent closed item sets. In: FIMI, vol. 90. Citeseer (2003)
Google Scholar
Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W., et al.: New algorithms for fast discovery of association rules. In: KDD, vol. 97, pp. 283–286 (1997)
Google Scholar
Zhou, L., Zhong, Z., Chang, J., Li, J., Huang, J.Z., Feng, S.: Balanced parallel FP-growth with MapReduce. In: Proceedings of YC-ICT 2010, pp. 243–246 (2010)
Google Scholar

Download references

Acknowledgement

This work is partly supported by the GIRAFON project funded by Centre-Val de Loire region (France).

Author information

Authors and Affiliations

Université de Tours, LIFAT EA 6300, Blois, France
Khanh-Chuong Duong, Arnaud Giacometti, Dominique Li & Arnaud Soulet
Université d’Orléans, INSA Centre Val de Loire, LIFO EA 4022, Orléans, France
Khanh-Chuong Duong, Mostafa Bamha & Christel Vrain

Authors

Khanh-Chuong Duong
View author publications
You can also search for this author in PubMed Google Scholar
Mostafa Bamha
View author publications
You can also search for this author in PubMed Google Scholar
Arnaud Giacometti
View author publications
You can also search for this author in PubMed Google Scholar
Dominique Li
View author publications
You can also search for this author in PubMed Google Scholar
Arnaud Soulet
View author publications
You can also search for this author in PubMed Google Scholar
Christel Vrain
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mostafa Bamha .

Editor information

Editors and Affiliations

IRIT, Paul Sabatier University, Toulouse, France
Abdelkader Hameurlain
FAW, University of Linz, Linz, Austria
Roland Wagner
IUT, University Lyon 1, Villeurbanne Cedex, France
Djamal Benslimane
University of Milan, Crema, Italy
Ernesto Damiani
University of Michigan-Dearborn, Dearborn, MI, USA
William I. Grosky

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Duong, KC., Bamha, M., Giacometti, A., Li, D., Soulet, A., Vrain, C. (2018). MapFIM+: Memory Aware Parallelized Frequent Itemset Mining In Very Large Datasets. In: Hameurlain, A., Wagner, R., Benslimane, D., Damiani, E., Grosky, W. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXXIX. Lecture Notes in Computer Science(), vol 11310. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-58415-6_7

Download citation

DOI: https://doi.org/10.1007/978-3-662-58415-6_7
Published: 23 November 2018
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-58414-9
Online ISBN: 978-3-662-58415-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics