Abstract
Despite crucial recent advances, the problem of frequent itemset mining is still facing major challenges. This is particularly the case when: (i) the mining process must be massively distributed and; (ii) the minimum support (MinSup) is very low. In this paper, we study the effectiveness and leverage of specific data placement strategies for improving parallel frequent itemset mining (PFIM) performance in MapReduce, a highly distributed computation framework. By offering a clever data placement and an optimal organization of the extraction algorithms, we show that the itemset discovery effectiveness does not only depend on the deployed algorithms. We propose ODPR (Optimal Data-Process Relationship), a solution for fast mining of frequent itemsets in MapReduce. Our method allows discovering itemsets from massive datasets, where standard solutions from the literature do not scale. Indeed, in a massively distributed environment, the arrangement of both the data and the different processes can make the global job either completely inoperative or very effective. Our proposal has been evaluated using real-world data sets and the results illustrate a significant scale-up obtained with very low MinSup, which confirms the effectiveness of our approach.
Saber Salah—This work has been partially supported by the Inria Project Lab Hemera.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Hsinchun, C., Chiang, R.H.L., Storey, V.C.: Business intelligence and analytics: from big data to big impact. MIS Q 36(4), 1165–1188 (2012)
Anand, R.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2012)
Goethals, B.: Memory issues in frequent itemset mining. In: Haddad, H., Omicini, A., Wainwright, R.L., Liebrock, L.M.(eds.) Proceedings of the 2004 ACM Symposium on Applied Computing (SAC), Nicosia, Cyprus, March 14–17, 2004, pp. 530–534. ACM (2004)
White, T.: Hadoop : The Definitive Guide. O’Reilly, Beijing (2012)
Bizer, C., Boncz, P.A., Brodie, M.L., Erling, O.: The meaningful use of big data: four perspectives - four challenges. SIGMOD Rec. 40(4), 56–60 (2011)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Hadoop (2014). http://hadoop.apache.org
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Bocca, J.B., Jarke, M., Zaniolo, C. (eds.) Proceedings of International Conference on Very Large Data Bases (VLDB), pp. 487–499. Chile, Santiago de Chile (1994)
Savasere, A., Omiecinski, E., Navathe, S.B.: An efficient algorithm for mining association rules in large databases. In: Proceedings of International Conference on Very Large Data Bases (VLDB), pp. 432–444 (1995)
Tsay, Y.-J., Chang-Chien, Y.-W.: An efficient cluster and decomposition algorithm for mining association rules. Inf. Sci 160(1–4), 161–171 (2004)
Even, S.: Graph Algorithms. Computer Science Press, Potomac (1979)
Patoh (2011). http://bmi.osu.edu/ umit/PaToH/manual.pdf
Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y.: Pfp: parallel fp-growth for query recommendation. In: Pu, P., Bridge, D.G., Mobasher, B., Ricci, F.(eds.) Proceedings of the ACM Conference on Recommender Systems (RecSys) Lausanne, Switzerland, pp. 107–114. ACM (2008)
Owen, S.: Mahout in Action. Manning Publications Co, Shelter Island, N.Y. (2012)
Grid5000. https://www.grid5000.fr/mediawiki/index.php/Grid5000:Home
English wikipedia articles. http://dumps.wikimedia.org/enwiki/latest
The clueweb09 dataset (2009). http://www.lemurproject.org/clueweb09.php/
Song, W., Yang, B., Zhangyan, X.: Index-bittablefi: an improved algorithm for mining frequent itemsets. Knowl.-Based Syst. 21(6), 507–513 (2008)
Jayalakshmi, N., Vidhya, V., Krishnamurthy, M., Kannan, A.: Frequent itemset generation using double hashing technique. Procedia Eng. 38, 1467–1478 (2012)
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. SIGMODREC: ACM SIGMOD Record, 29 (2000)
Riondato, M., DeBrabant, J.A., Fonseca, R., Upfal, E.: Parma: a parallel randomized algorithm for approximate association rules mining in mapreduce. In: 21st ACM International Conference on Information and Knowledge Management (CIKM), Maui, HI, USA, pp. 85–94. ACM (2012)
Liroz-Gistau, M., Akbarinia, R., Pacitti, E., Porto, F., Valduriez, P.: Dynamic workload-based partitioning algorithms for continuously growing databases. In: Hameurlain, A., Küng, J., Wagner, R. (eds.) TLDKS XII. LNCS, vol. 8320, pp. 105–128. Springer, Heidelberg (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Salah, S., Akbarinia, R., Masseglia, F. (2015). Optimizing the Data-Process Relationship for Fast Mining of Frequent Itemsets in MapReduce. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2015. Lecture Notes in Computer Science(), vol 9166. Springer, Cham. https://doi.org/10.1007/978-3-319-21024-7_15
Download citation
DOI: https://doi.org/10.1007/978-3-319-21024-7_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21023-0
Online ISBN: 978-3-319-21024-7
eBook Packages: Computer ScienceComputer Science (R0)