Advertisement

Knowledge and Information Systems

, Volume 53, Issue 1, pp 207–237 | Cite as

Data placement in massively distributed environments for fast parallel mining of frequent itemsets

  • Saber Salah
  • Reza Akbarinia
  • Florent Masseglia
Regular Paper
  • 205 Downloads

Abstract

Frequent itemset mining presents one of the fundamental building blocks in data mining. However, despite the crucial recent advances that have been made in data mining literature, few of both standard and improved solutions scale. This is particularly the case when (1) the quantity of data tends to be very large and/or (2) the minimum support is very low. In this paper, we address the problem of parallel frequent itemset mining (PFIM) in very large databases and study the impact and effectiveness of using specific data placement strategies in a massively distributed environment. By offering a clever data placement and an optimal organization of the extraction algorithms, we show that the arrangement of both the data and the different processes can make the global job either completely inoperative or very effective. In this setting, we propose two different highly scalable, PFIM algorithms, namely P2S (parallel-2-steps) and PATD (parallel absolute top-down). P2S algorithm allows discovering itemsets from large databases in two simple, yet efficient parallel jobs, while PATD renders the mining process of very large databases more simple and compact. Its mining process is made up of only one parallel job, which dramatically reduces the running time, the communication cost and the energy power consumption overhead in a distributed computational platform. Our different proposed approaches have been extensively evaluated on massive real-world data sets. The experimental results confirm the effectiveness and scalability of our proposals by the important scale-up obtained with very low minimum supports compared to other alternatives.

Keywords

Frequent Itemsets Massive Distribution Data Placement MapReduce 

Notes

Acknowledgements

Experiments presented in this paper were carried out using the Grid’5000 experimental testbed, being developed under the INRIA ALADDIN development action with support from CNRS, RENATER and several universities as well as other funding bodies (see https://www.grid5000.fr).

References

  1. 1.
    Agrawal R, Shafer JC (1996) Parallel mining of association rules. IEEE Trans Knowl Data Eng 8(6):962–969. doi: 10.1109/69.553164 CrossRefGoogle Scholar
  2. 2.
    Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Bocca JB, Jarke M, Zaniolo C (eds) Proceedings of international conference on very large data bases (VLDB), Santiago de Chile, pp. 487–499Google Scholar
  3. 3.
  4. 4.
    Anand R (2012) Mining of massive datasets. Cambridge University Press, New York, CambridgeGoogle Scholar
  5. 5.
    Berry M (2004) Survey of text mining clustering, classification, and retrieval. Springer, New York, NYzbMATHGoogle Scholar
  6. 6.
    Bizer C, Boncz PA, Brodie ML, Erling O (2011) The meaningful use of big data: four perspectives - four challenges. SIGMOD Rec 40(4):56–60. doi: 10.1145/2094114.2094129 CrossRefGoogle Scholar
  7. 7.
    Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRefGoogle Scholar
  8. 8.
    English Wikipedia Articles (2014). http://dumps.wikimedia.org/enwiki/latest
  9. 9.
    Even S (1979) Graph algorithms. Computer Science Press, PotomaczbMATHGoogle Scholar
  10. 10.
    Fan W, Bifet A (2013) Mining big data: Current status, and forecast to the future. SIGKDD Explor Newsl 14(2):1–5. doi: 10.1145/2481244.2481246 CrossRefGoogle Scholar
  11. 11.
  12. 12.
    Hadoop (2014). http://hadoop.apache.org
  13. 13.
    Han, Pei, Yin (2000) Mining frequent patterns without candidate generation, SIGMODREC: ACM SIGMOD Rec 29:1–12Google Scholar
  14. 14.
    Kumar V, Grama A, Gupta A, Karypis G (1994) Introduction to parallel computing: design and analysis of algorithms. Benjamin-Cummings Publishing Co., Inc., Redwood CityzbMATHGoogle Scholar
  15. 15.
    Labrinidis A, Jagadish HV (2012) Challenges and opportunities with big data. Proc VLDB Endow 5(12):2032–2033. doi: 10.14778/2367502.2367572 CrossRefGoogle Scholar
  16. 16.
    Li H, Wang Y, Zhang D, Zhang M, Chang EY (2008) Pfp: parallel fp-growth for query recommendation. In: Pu P, Bridge DG, Mobasher B, Ricci F (eds) Proceedings of the 2008 ACM conference on recommender systems, RecSys 2008, Lausanne, October 23-25, 2008, ACM, pp 107–114. doi: 10.1145/1454008.1454027
  17. 17.
    Liroz-Gistau M, Akbarinia R, Pacitti E, Porto F, Valduriez P (2014) Dynamic Workload-Based Partitioning Algorithms for Continuously Growing Databases, Transactions on Large-Scale Data- and Knowledge-Centered Systems, p 105. http://hal-lirmm.ccsd.cnrs.fr/lirmm-00906966
  18. 18.
    Moens S, Aksehirli E, Goethals B (2013) Frequent itemset mining for big data. In: Big Data, 2013 IEEE international conference on, pp 111–118Google Scholar
  19. 19.
    Owen S (2012) Mahout in action. Manning Publications Co, Shelter IslandGoogle Scholar
  20. 20.
  21. 21.
    Riondato M, DeBrabant JA, Fonseca R, Upfal E (2012) Parma: a parallel randomized algorithm for approximate association rules mining in mapreduce. In: 21st ACM international conference on information and knowledge management (CIKM), Maui, HI, USA. ACM, pp 85–94. http://dl.acm.org/citation.cfm?id=2396761
  22. 22.
    Salah S, Akbarinia R, Masseglia F (2015) Data partitioning for fast mining of frequent itemsets in massively distributed environments. In: 26th international conference on dand expert systems applications (DEXA)Google Scholar
  23. 23.
    Savasere A, Omiecinski E, Navathe SB (1995) An efficient algorithm for mining association rules in large databases. In: Proceedings of international conference on very large data bases (VLDB), pp 432–444Google Scholar
  24. 24.
    Song W, Yang B, Xu Z (2008) Index-bittablefi: an improved algorithm for mining frequent itemsets. Knowl Based Syst 21(6):507–513CrossRefGoogle Scholar
  25. 25.
    The ClueWeb09 Dataset (2009). http://www.lemurproject.org/clueweb09.php/
  26. 26.
    Tsay Y-J, Chang-Chien Y-W (2004) An efficient cluster and decomposition algorithm for mining association rules. Inf Sci Inf Comput Sci 160(1–4):161–171. doi: 10.1016/j.ins.2003.08.013 Google Scholar
  27. 27.
    Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: Cluster computing with working sets. In: Proceedings of the 2Nd USENIX conference on hot topics in cloud computing, HotCloud’10, USENIX Association, Berkeley, pp 10–10. http://dl.acm.org/citation.cfm?id=1863103.1863113
  28. 28.
    Zaki MJ (1999) Parallel and distributed association mining: a survey. IEEE Concur 7(4):14–25. doi: 10.1109/4434.806975 CrossRefGoogle Scholar
  29. 29.
    Zaki MJ, Parthasarathy S, Ogihara M, Li W (1997) Parallel algorithms for discovery of association rules. Data Min Knowl Discov 1(4):343–373. doi: 10.1023/A:1009773317876 CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London 2017

Authors and Affiliations

  • Saber Salah
    • 1
  • Reza Akbarinia
    • 1
  • Florent Masseglia
    • 1
  1. 1.Inria and LirmmMontpellierFrance

Personalised recommendations