Data placement in massively distributed environments for fast parallel mining of frequent itemsets

Salah, Saber; Akbarinia, Reza; Masseglia, Florent

doi:10.1007/s10115-017-1041-5

Data placement in massively distributed environments for fast parallel mining of frequent itemsets

Regular Paper
Published: 24 March 2017

Volume 53, pages 207–237, (2017)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Saber Salah¹,
Reza Akbarinia¹ &
Florent Masseglia¹

324 Accesses
2 Citations
Explore all metrics

Abstract

Frequent itemset mining presents one of the fundamental building blocks in data mining. However, despite the crucial recent advances that have been made in data mining literature, few of both standard and improved solutions scale. This is particularly the case when (1) the quantity of data tends to be very large and/or (2) the minimum support is very low. In this paper, we address the problem of parallel frequent itemset mining (PFIM) in very large databases and study the impact and effectiveness of using specific data placement strategies in a massively distributed environment. By offering a clever data placement and an optimal organization of the extraction algorithms, we show that the arrangement of both the data and the different processes can make the global job either completely inoperative or very effective. In this setting, we propose two different highly scalable, PFIM algorithms, namely P2S (parallel-2-steps) and PATD (parallel absolute top-down). P2S algorithm allows discovering itemsets from large databases in two simple, yet efficient parallel jobs, while PATD renders the mining process of very large databases more simple and compact. Its mining process is made up of only one parallel job, which dramatically reduces the running time, the communication cost and the energy power consumption overhead in a distributed computational platform. Our different proposed approaches have been extensively evaluated on massive real-world data sets. The experimental results confirm the effectiveness and scalability of our proposals by the important scale-up obtained with very low minimum supports compared to other alternatives.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data preprocessing: methods and prospects

Article Open access 01 November 2016

Big data analytics on Apache Spark

Article 13 October 2016

The big data system, components, tools, and technologies: a survey

Article 18 September 2018

References

Agrawal R, Shafer JC (1996) Parallel mining of association rules. IEEE Trans Knowl Data Eng 8(6):962–969. doi:10.1109/69.553164
Article Google Scholar
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Bocca JB, Jarke M, Zaniolo C (eds) Proceedings of international conference on very large data bases (VLDB), Santiago de Chile, pp. 487–499
Amazon (n.d.). http://snap.stanford.edu/data/web-Amazon-links.html
Anand R (2012) Mining of massive datasets. Cambridge University Press, New York, Cambridge
Google Scholar
Berry M (2004) Survey of text mining clustering, classification, and retrieval. Springer, New York, NY
MATH Google Scholar
Bizer C, Boncz PA, Brodie ML, Erling O (2011) The meaningful use of big data: four perspectives - four challenges. SIGMOD Rec 40(4):56–60. doi:10.1145/2094114.2094129
Article Google Scholar
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
English Wikipedia Articles (2014). http://dumps.wikimedia.org/enwiki/latest
Even S (1979) Graph algorithms. Computer Science Press, Potomac
MATH Google Scholar
Fan W, Bifet A (2013) Mining big data: Current status, and forecast to the future. SIGKDD Explor Newsl 14(2):1–5. doi:10.1145/2481244.2481246
Article Google Scholar
Grid5000 (n.d.).https://www.grid5000.fr/mediawiki/index.php/Grid5000:Home
Hadoop (2014). http://hadoop.apache.org
Han, Pei, Yin (2000) Mining frequent patterns without candidate generation, SIGMODREC: ACM SIGMOD Rec 29:1–12
Kumar V, Grama A, Gupta A, Karypis G (1994) Introduction to parallel computing: design and analysis of algorithms. Benjamin-Cummings Publishing Co., Inc., Redwood City
MATH Google Scholar
Labrinidis A, Jagadish HV (2012) Challenges and opportunities with big data. Proc VLDB Endow 5(12):2032–2033. doi:10.14778/2367502.2367572
Article Google Scholar
Li H, Wang Y, Zhang D, Zhang M, Chang EY (2008) Pfp: parallel fp-growth for query recommendation. In: Pu P, Bridge DG, Mobasher B, Ricci F (eds) Proceedings of the 2008 ACM conference on recommender systems, RecSys 2008, Lausanne, October 23-25, 2008, ACM, pp 107–114. doi:10.1145/1454008.1454027
Liroz-Gistau M, Akbarinia R, Pacitti E, Porto F, Valduriez P (2014) Dynamic Workload-Based Partitioning Algorithms for Continuously Growing Databases, Transactions on Large-Scale Data- and Knowledge-Centered Systems, p 105. http://hal-lirmm.ccsd.cnrs.fr/lirmm-00906966
Moens S, Aksehirli E, Goethals B (2013) Frequent itemset mining for big data. In: Big Data, 2013 IEEE international conference on, pp 111–118
Owen S (2012) Mahout in action. Manning Publications Co, Shelter Island
Google Scholar
PaToH (2011). http://bmi.osu.edu/~umit/PaToH/manual.pdf
Riondato M, DeBrabant JA, Fonseca R, Upfal E (2012) Parma: a parallel randomized algorithm for approximate association rules mining in mapreduce. In: 21st ACM international conference on information and knowledge management (CIKM), Maui, HI, USA. ACM, pp 85–94. http://dl.acm.org/citation.cfm?id=2396761
Salah S, Akbarinia R, Masseglia F (2015) Data partitioning for fast mining of frequent itemsets in massively distributed environments. In: 26th international conference on dand expert systems applications (DEXA)
Savasere A, Omiecinski E, Navathe SB (1995) An efficient algorithm for mining association rules in large databases. In: Proceedings of international conference on very large data bases (VLDB), pp 432–444
Song W, Yang B, Xu Z (2008) Index-bittablefi: an improved algorithm for mining frequent itemsets. Knowl Based Syst 21(6):507–513
Article Google Scholar
The ClueWeb09 Dataset (2009). http://www.lemurproject.org/clueweb09.php/
Tsay Y-J, Chang-Chien Y-W (2004) An efficient cluster and decomposition algorithm for mining association rules. Inf Sci Inf Comput Sci 160(1–4):161–171. doi:10.1016/j.ins.2003.08.013
Google Scholar
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: Cluster computing with working sets. In: Proceedings of the 2Nd USENIX conference on hot topics in cloud computing, HotCloud’10, USENIX Association, Berkeley, pp 10–10. http://dl.acm.org/citation.cfm?id=1863103.1863113
Zaki MJ (1999) Parallel and distributed association mining: a survey. IEEE Concur 7(4):14–25. doi:10.1109/4434.806975
Article Google Scholar
Zaki MJ, Parthasarathy S, Ogihara M, Li W (1997) Parallel algorithms for discovery of association rules. Data Min Knowl Discov 1(4):343–373. doi:10.1023/A:1009773317876
Article Google Scholar

Download references

Acknowledgements

Experiments presented in this paper were carried out using the Grid’5000 experimental testbed, being developed under the INRIA ALADDIN development action with support from CNRS, RENATER and several universities as well as other funding bodies (see https://www.grid5000.fr).

Author information

Authors and Affiliations

Inria and Lirmm, Montpellier, France
Saber Salah, Reza Akbarinia & Florent Masseglia

Authors

Saber Salah
View author publications
You can also search for this author in PubMed Google Scholar
Reza Akbarinia
View author publications
You can also search for this author in PubMed Google Scholar
Florent Masseglia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Saber Salah.

Appendix: Two rounds of closed itemset mining

Table 2 Database \(\mathcal{D}\)

Full size table

Closed itemsets present a compact representation of the whole set of frequent itemsets. Very efficient solutions have proposed for their extraction, with orders of magnitude in performance improvements. Therefore, one would be tempted to apply frequent closed itemset mining algorithms as a local solution for pattern extraction in the first step of a 2-job schema mining algorithm such as P2S or PATD. However, frequent closed itemset has very constraining characteristics that may prevent from using them as a 2-job approach. Let us consider the illustration given by Example 5 to show that a global frequent closed itemset might never be a local frequent closed itemset (in no partition).

Example 5

Table 2 presents a database \(\mathcal{D}\) with 12 transactions. Suppose we divide \(\mathcal{D}\) into four data partitions \(P_{1}=\{T_{1}, T_{2}, T_{3}\}\), \(P_{2}=\{T_{4}, T_{5}, T_{6}\}\), \(P_{3}=\{T_{7}, T_{8}, T_{9}\}\) and \(P_{4}=\{T_{10}, T_{11}, T_{12}\}\), where \(\mathcal{D}=P_{1} \cup P_{2} \cup P_{3} \cup P_{4}\).

The set of global frequent closed itemsets on \(\mathcal{D}\), along with their support is: \(\{(C:12) (A,C:8) (C,D:8) (C,E:8)\}\). In this set, the only itemset that is also a local frequent closed itemset is (C, D), because it is supported by 3 transactions in \(P_3\) and it has no superset with the same support. All the remaining global frequent closed itemsets are either unfrequent or not closed in the partitions. Let us consider, for instance, the itemset (C). In partition \(P_1\), it has the same support as (A, C, E). In partition \(P_2\), it has the same support as (C, F). In partition \(P_3\), it has the same support as (C, D). In partition \(P_4\), it has the same support as (B, C).

This simple counterexample shows that, even though closed frequent itemsets are an appealing research track for distributed environments, their usage calls for particular care. It would be interesting to investigate their properties in a distributed scheme like PATD, but it calls for proofs, or counterexamples, of their compatibility with such a scheme.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Salah, S., Akbarinia, R. & Masseglia, F. Data placement in massively distributed environments for fast parallel mining of frequent itemsets. Knowl Inf Syst 53, 207–237 (2017). https://doi.org/10.1007/s10115-017-1041-5

Download citation

Received: 25 November 2015
Accepted: 11 February 2017
Published: 24 March 2017
Issue Date: October 2017
DOI: https://doi.org/10.1007/s10115-017-1041-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data placement in massively distributed environments for fast parallel mining of frequent itemsets

Abstract

Access this article

Similar content being viewed by others

Big data preprocessing: methods and prospects

Big data analytics on Apache Spark

The big data system, components, tools, and technologies: a survey

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix: Two rounds of closed itemset mining

Example 5

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Data placement in massively distributed environments for fast parallel mining of frequent itemsets

Abstract

Access this article

Similar content being viewed by others

Big data preprocessing: methods and prospects

Big data analytics on Apache Spark

The big data system, components, tools, and technologies: a survey

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix: Two rounds of closed itemset mining

Appendix: Two rounds of closed itemset mining

Example 5

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation