Abstract
Frequent itemset mining presents one of the fundamental building blocks in data mining. However, despite the crucial recent advances that have been made in data mining literature, few of both standard and improved solutions scale. This is particularly the case when (1) the quantity of data tends to be very large and/or (2) the minimum support is very low. In this paper, we address the problem of parallel frequent itemset mining (PFIM) in very large databases and study the impact and effectiveness of using specific data placement strategies in a massively distributed environment. By offering a clever data placement and an optimal organization of the extraction algorithms, we show that the arrangement of both the data and the different processes can make the global job either completely inoperative or very effective. In this setting, we propose two different highly scalable, PFIM algorithms, namely P2S (parallel-2-steps) and PATD (parallel absolute top-down). P2S algorithm allows discovering itemsets from large databases in two simple, yet efficient parallel jobs, while PATD renders the mining process of very large databases more simple and compact. Its mining process is made up of only one parallel job, which dramatically reduces the running time, the communication cost and the energy power consumption overhead in a distributed computational platform. Our different proposed approaches have been extensively evaluated on massive real-world data sets. The experimental results confirm the effectiveness and scalability of our proposals by the important scale-up obtained with very low minimum supports compared to other alternatives.
Similar content being viewed by others
References
Agrawal R, Shafer JC (1996) Parallel mining of association rules. IEEE Trans Knowl Data Eng 8(6):962–969. doi:10.1109/69.553164
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Bocca JB, Jarke M, Zaniolo C (eds) Proceedings of international conference on very large data bases (VLDB), Santiago de Chile, pp. 487–499
Amazon (n.d.). http://snap.stanford.edu/data/web-Amazon-links.html
Anand R (2012) Mining of massive datasets. Cambridge University Press, New York, Cambridge
Berry M (2004) Survey of text mining clustering, classification, and retrieval. Springer, New York, NY
Bizer C, Boncz PA, Brodie ML, Erling O (2011) The meaningful use of big data: four perspectives - four challenges. SIGMOD Rec 40(4):56–60. doi:10.1145/2094114.2094129
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
English Wikipedia Articles (2014). http://dumps.wikimedia.org/enwiki/latest
Even S (1979) Graph algorithms. Computer Science Press, Potomac
Fan W, Bifet A (2013) Mining big data: Current status, and forecast to the future. SIGKDD Explor Newsl 14(2):1–5. doi:10.1145/2481244.2481246
Grid5000 (n.d.).https://www.grid5000.fr/mediawiki/index.php/Grid5000:Home
Hadoop (2014). http://hadoop.apache.org
Han, Pei, Yin (2000) Mining frequent patterns without candidate generation, SIGMODREC: ACM SIGMOD Rec 29:1–12
Kumar V, Grama A, Gupta A, Karypis G (1994) Introduction to parallel computing: design and analysis of algorithms. Benjamin-Cummings Publishing Co., Inc., Redwood City
Labrinidis A, Jagadish HV (2012) Challenges and opportunities with big data. Proc VLDB Endow 5(12):2032–2033. doi:10.14778/2367502.2367572
Li H, Wang Y, Zhang D, Zhang M, Chang EY (2008) Pfp: parallel fp-growth for query recommendation. In: Pu P, Bridge DG, Mobasher B, Ricci F (eds) Proceedings of the 2008 ACM conference on recommender systems, RecSys 2008, Lausanne, October 23-25, 2008, ACM, pp 107–114. doi:10.1145/1454008.1454027
Liroz-Gistau M, Akbarinia R, Pacitti E, Porto F, Valduriez P (2014) Dynamic Workload-Based Partitioning Algorithms for Continuously Growing Databases, Transactions on Large-Scale Data- and Knowledge-Centered Systems, p 105. http://hal-lirmm.ccsd.cnrs.fr/lirmm-00906966
Moens S, Aksehirli E, Goethals B (2013) Frequent itemset mining for big data. In: Big Data, 2013 IEEE international conference on, pp 111–118
Owen S (2012) Mahout in action. Manning Publications Co, Shelter Island
PaToH (2011). http://bmi.osu.edu/~umit/PaToH/manual.pdf
Riondato M, DeBrabant JA, Fonseca R, Upfal E (2012) Parma: a parallel randomized algorithm for approximate association rules mining in mapreduce. In: 21st ACM international conference on information and knowledge management (CIKM), Maui, HI, USA. ACM, pp 85–94. http://dl.acm.org/citation.cfm?id=2396761
Salah S, Akbarinia R, Masseglia F (2015) Data partitioning for fast mining of frequent itemsets in massively distributed environments. In: 26th international conference on dand expert systems applications (DEXA)
Savasere A, Omiecinski E, Navathe SB (1995) An efficient algorithm for mining association rules in large databases. In: Proceedings of international conference on very large data bases (VLDB), pp 432–444
Song W, Yang B, Xu Z (2008) Index-bittablefi: an improved algorithm for mining frequent itemsets. Knowl Based Syst 21(6):507–513
The ClueWeb09 Dataset (2009). http://www.lemurproject.org/clueweb09.php/
Tsay Y-J, Chang-Chien Y-W (2004) An efficient cluster and decomposition algorithm for mining association rules. Inf Sci Inf Comput Sci 160(1–4):161–171. doi:10.1016/j.ins.2003.08.013
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: Cluster computing with working sets. In: Proceedings of the 2Nd USENIX conference on hot topics in cloud computing, HotCloud’10, USENIX Association, Berkeley, pp 10–10. http://dl.acm.org/citation.cfm?id=1863103.1863113
Zaki MJ (1999) Parallel and distributed association mining: a survey. IEEE Concur 7(4):14–25. doi:10.1109/4434.806975
Zaki MJ, Parthasarathy S, Ogihara M, Li W (1997) Parallel algorithms for discovery of association rules. Data Min Knowl Discov 1(4):343–373. doi:10.1023/A:1009773317876
Acknowledgements
Experiments presented in this paper were carried out using the Grid’5000 experimental testbed, being developed under the INRIA ALADDIN development action with support from CNRS, RENATER and several universities as well as other funding bodies (see https://www.grid5000.fr).
Author information
Authors and Affiliations
Corresponding author
Appendix: Two rounds of closed itemset mining
Appendix: Two rounds of closed itemset mining
Closed itemsets present a compact representation of the whole set of frequent itemsets. Very efficient solutions have proposed for their extraction, with orders of magnitude in performance improvements. Therefore, one would be tempted to apply frequent closed itemset mining algorithms as a local solution for pattern extraction in the first step of a 2-job schema mining algorithm such as P2S or PATD. However, frequent closed itemset has very constraining characteristics that may prevent from using them as a 2-job approach. Let us consider the illustration given by Example 5 to show that a global frequent closed itemset might never be a local frequent closed itemset (in no partition).
Example 5
Table 2 presents a database \(\mathcal{D}\) with 12 transactions. Suppose we divide \(\mathcal{D}\) into four data partitions \(P_{1}=\{T_{1}, T_{2}, T_{3}\}\), \(P_{2}=\{T_{4}, T_{5}, T_{6}\}\), \(P_{3}=\{T_{7}, T_{8}, T_{9}\}\) and \(P_{4}=\{T_{10}, T_{11}, T_{12}\}\), where \(\mathcal{D}=P_{1} \cup P_{2} \cup P_{3} \cup P_{4}\).
The set of global frequent closed itemsets on \(\mathcal{D}\), along with their support is: \(\{(C:12) (A,C:8) (C,D:8) (C,E:8)\}\). In this set, the only itemset that is also a local frequent closed itemset is (C, D), because it is supported by 3 transactions in \(P_3\) and it has no superset with the same support. All the remaining global frequent closed itemsets are either unfrequent or not closed in the partitions. Let us consider, for instance, the itemset (C). In partition \(P_1\), it has the same support as (A, C, E). In partition \(P_2\), it has the same support as (C, F). In partition \(P_3\), it has the same support as (C, D). In partition \(P_4\), it has the same support as (B, C).
This simple counterexample shows that, even though closed frequent itemsets are an appealing research track for distributed environments, their usage calls for particular care. It would be interesting to investigate their properties in a distributed scheme like PATD, but it calls for proofs, or counterexamples, of their compatibility with such a scheme.
Rights and permissions
About this article
Cite this article
Salah, S., Akbarinia, R. & Masseglia, F. Data placement in massively distributed environments for fast parallel mining of frequent itemsets. Knowl Inf Syst 53, 207–237 (2017). https://doi.org/10.1007/s10115-017-1041-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-017-1041-5