Skip to main content
Log in

Data placement in massively distributed environments for fast parallel mining of frequent itemsets

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Frequent itemset mining presents one of the fundamental building blocks in data mining. However, despite the crucial recent advances that have been made in data mining literature, few of both standard and improved solutions scale. This is particularly the case when (1) the quantity of data tends to be very large and/or (2) the minimum support is very low. In this paper, we address the problem of parallel frequent itemset mining (PFIM) in very large databases and study the impact and effectiveness of using specific data placement strategies in a massively distributed environment. By offering a clever data placement and an optimal organization of the extraction algorithms, we show that the arrangement of both the data and the different processes can make the global job either completely inoperative or very effective. In this setting, we propose two different highly scalable, PFIM algorithms, namely P2S (parallel-2-steps) and PATD (parallel absolute top-down). P2S algorithm allows discovering itemsets from large databases in two simple, yet efficient parallel jobs, while PATD renders the mining process of very large databases more simple and compact. Its mining process is made up of only one parallel job, which dramatically reduces the running time, the communication cost and the energy power consumption overhead in a distributed computational platform. Our different proposed approaches have been extensively evaluated on massive real-world data sets. The experimental results confirm the effectiveness and scalability of our proposals by the important scale-up obtained with very low minimum supports compared to other alternatives.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Agrawal R, Shafer JC (1996) Parallel mining of association rules. IEEE Trans Knowl Data Eng 8(6):962–969. doi:10.1109/69.553164

    Article  Google Scholar 

  2. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Bocca JB, Jarke M, Zaniolo C (eds) Proceedings of international conference on very large data bases (VLDB), Santiago de Chile, pp. 487–499

  3. Amazon (n.d.). http://snap.stanford.edu/data/web-Amazon-links.html

  4. Anand R (2012) Mining of massive datasets. Cambridge University Press, New York, Cambridge

    Google Scholar 

  5. Berry M (2004) Survey of text mining clustering, classification, and retrieval. Springer, New York, NY

    MATH  Google Scholar 

  6. Bizer C, Boncz PA, Brodie ML, Erling O (2011) The meaningful use of big data: four perspectives - four challenges. SIGMOD Rec 40(4):56–60. doi:10.1145/2094114.2094129

    Article  Google Scholar 

  7. Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

    Article  Google Scholar 

  8. English Wikipedia Articles (2014). http://dumps.wikimedia.org/enwiki/latest

  9. Even S (1979) Graph algorithms. Computer Science Press, Potomac

    MATH  Google Scholar 

  10. Fan W, Bifet A (2013) Mining big data: Current status, and forecast to the future. SIGKDD Explor Newsl 14(2):1–5. doi:10.1145/2481244.2481246

    Article  Google Scholar 

  11. Grid5000 (n.d.).https://www.grid5000.fr/mediawiki/index.php/Grid5000:Home

  12. Hadoop (2014). http://hadoop.apache.org

  13. Han, Pei, Yin (2000) Mining frequent patterns without candidate generation, SIGMODREC: ACM SIGMOD Rec 29:1–12

  14. Kumar V, Grama A, Gupta A, Karypis G (1994) Introduction to parallel computing: design and analysis of algorithms. Benjamin-Cummings Publishing Co., Inc., Redwood City

    MATH  Google Scholar 

  15. Labrinidis A, Jagadish HV (2012) Challenges and opportunities with big data. Proc VLDB Endow 5(12):2032–2033. doi:10.14778/2367502.2367572

    Article  Google Scholar 

  16. Li H, Wang Y, Zhang D, Zhang M, Chang EY (2008) Pfp: parallel fp-growth for query recommendation. In: Pu P, Bridge DG, Mobasher B, Ricci F (eds) Proceedings of the 2008 ACM conference on recommender systems, RecSys 2008, Lausanne, October 23-25, 2008, ACM, pp 107–114. doi:10.1145/1454008.1454027

  17. Liroz-Gistau M, Akbarinia R, Pacitti E, Porto F, Valduriez P (2014) Dynamic Workload-Based Partitioning Algorithms for Continuously Growing Databases, Transactions on Large-Scale Data- and Knowledge-Centered Systems, p 105. http://hal-lirmm.ccsd.cnrs.fr/lirmm-00906966

  18. Moens S, Aksehirli E, Goethals B (2013) Frequent itemset mining for big data. In: Big Data, 2013 IEEE international conference on, pp 111–118

  19. Owen S (2012) Mahout in action. Manning Publications Co, Shelter Island

    Google Scholar 

  20. PaToH (2011). http://bmi.osu.edu/~umit/PaToH/manual.pdf

  21. Riondato M, DeBrabant JA, Fonseca R, Upfal E (2012) Parma: a parallel randomized algorithm for approximate association rules mining in mapreduce. In: 21st ACM international conference on information and knowledge management (CIKM), Maui, HI, USA. ACM, pp 85–94. http://dl.acm.org/citation.cfm?id=2396761

  22. Salah S, Akbarinia R, Masseglia F (2015) Data partitioning for fast mining of frequent itemsets in massively distributed environments. In: 26th international conference on dand expert systems applications (DEXA)

  23. Savasere A, Omiecinski E, Navathe SB (1995) An efficient algorithm for mining association rules in large databases. In: Proceedings of international conference on very large data bases (VLDB), pp 432–444

  24. Song W, Yang B, Xu Z (2008) Index-bittablefi: an improved algorithm for mining frequent itemsets. Knowl Based Syst 21(6):507–513

    Article  Google Scholar 

  25. The ClueWeb09 Dataset (2009). http://www.lemurproject.org/clueweb09.php/

  26. Tsay Y-J, Chang-Chien Y-W (2004) An efficient cluster and decomposition algorithm for mining association rules. Inf Sci Inf Comput Sci 160(1–4):161–171. doi:10.1016/j.ins.2003.08.013

    Google Scholar 

  27. Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: Cluster computing with working sets. In: Proceedings of the 2Nd USENIX conference on hot topics in cloud computing, HotCloud’10, USENIX Association, Berkeley, pp 10–10. http://dl.acm.org/citation.cfm?id=1863103.1863113

  28. Zaki MJ (1999) Parallel and distributed association mining: a survey. IEEE Concur 7(4):14–25. doi:10.1109/4434.806975

    Article  Google Scholar 

  29. Zaki MJ, Parthasarathy S, Ogihara M, Li W (1997) Parallel algorithms for discovery of association rules. Data Min Knowl Discov 1(4):343–373. doi:10.1023/A:1009773317876

    Article  Google Scholar 

Download references

Acknowledgements

Experiments presented in this paper were carried out using the Grid’5000 experimental testbed, being developed under the INRIA ALADDIN development action with support from CNRS, RENATER and several universities as well as other funding bodies (see https://www.grid5000.fr).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Saber Salah.

Appendix: Two rounds of closed itemset mining

Appendix: Two rounds of closed itemset mining

Table 2 Database \(\mathcal{D}\)

Closed itemsets present a compact representation of the whole set of frequent itemsets. Very efficient solutions have proposed for their extraction, with orders of magnitude in performance improvements. Therefore, one would be tempted to apply frequent closed itemset mining algorithms as a local solution for pattern extraction in the first step of a 2-job schema mining algorithm such as P2S or PATD. However, frequent closed itemset has very constraining characteristics that may prevent from using them as a 2-job approach. Let us consider the illustration given by Example 5 to show that a global frequent closed itemset might never be a local frequent closed itemset (in no partition).

Example 5

Table 2 presents a database \(\mathcal{D}\) with 12 transactions. Suppose we divide \(\mathcal{D}\) into four data partitions \(P_{1}=\{T_{1}, T_{2}, T_{3}\}\), \(P_{2}=\{T_{4}, T_{5}, T_{6}\}\), \(P_{3}=\{T_{7}, T_{8}, T_{9}\}\) and \(P_{4}=\{T_{10}, T_{11}, T_{12}\}\), where \(\mathcal{D}=P_{1} \cup P_{2} \cup P_{3} \cup P_{4}\).

The set of global frequent closed itemsets on \(\mathcal{D}\), along with their support is: \(\{(C:12) (A,C:8) (C,D:8) (C,E:8)\}\). In this set, the only itemset that is also a local frequent closed itemset is (CD), because it is supported by 3 transactions in \(P_3\) and it has no superset with the same support. All the remaining global frequent closed itemsets are either unfrequent or not closed in the partitions. Let us consider, for instance, the itemset (C). In partition \(P_1\), it has the same support as (ACE). In partition \(P_2\), it has the same support as (CF). In partition \(P_3\), it has the same support as (CD). In partition \(P_4\), it has the same support as (BC).

This simple counterexample shows that, even though closed frequent itemsets are an appealing research track for distributed environments, their usage calls for particular care. It would be interesting to investigate their properties in a distributed scheme like PATD, but it calls for proofs, or counterexamples, of their compatibility with such a scheme.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Salah, S., Akbarinia, R. & Masseglia, F. Data placement in massively distributed environments for fast parallel mining of frequent itemsets. Knowl Inf Syst 53, 207–237 (2017). https://doi.org/10.1007/s10115-017-1041-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-017-1041-5

Keywords

Navigation