Advertisement

Data Partitioning for Fast Mining of Frequent Itemsets in Massively Distributed Environments

  • Saber Salah
  • Reza Akbarinia
  • Florent MassegliaEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9261)

Abstract

Frequent itemset mining (FIM) is one of the fundamental cornerstones in data mining. While, the problem of FIM has been thoroughly studied, few of both standard and improved solutions scale. This is mainly the case when (i) the amount of data tends to be very large and/or (ii) the minimum support (MinSup) threshold is very low. In this paper, we propose a highly scalable, parallel frequent itemset mining (PFIM) algorithm, namely Parallel Absolute Top Down (PATD). PATD algorithm renders the mining process of very large databases (up to Terabytes of data) simple and compact. Its mining process is made up of only one parallel job, which dramatically reduces the mining runtime, the communication cost and the energy power consumption overhead, in a distributed computational platform. Based on a clever and efficient data partitioning strategy, namely Item Based Data Partitioning (IBDP), PATD algorithm mines each data partition independently, relying on an absolute minimum support (AMinSup) instead of a relative one. PATD has been extensively evaluated using real-world data sets. Our experimental results suggest that PATD algorithm is significantly more efficient and scalable than alternative approaches.

Keywords

Machine learning Data mining Frequent itemset Big data MapReduce 

References

  1. 1.
    Labrinidis, A., Jagadish, H.V.: Challenges and opportunities with big data. Proc. VLDB Endow. 5(12), 2032–2033 (2012)CrossRefGoogle Scholar
  2. 2.
    Berry, M.: Survey of Text Mining Clustering, Classification, and Retrieval. Springer, New York (2004)Google Scholar
  3. 3.
    Fan, W., Bifet, A.: Mining big data: current status, and forecast to the future. SIGKDD Explor. Newsl. 14(2), 1–5 (2013)CrossRefzbMATHGoogle Scholar
  4. 4.
    Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  5. 5.
    Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2010, CA, USA, p. 10. Berkeley (2010)Google Scholar
  6. 6.
  7. 7.
    Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Bocca, J.B., Jarke, M., Zaniolo, C. (eds.) Proceedings of International Conference on Very Large Data Bases (VLDB), pp. 487–499. Santiago de Chile, Chile (1994)Google Scholar
  8. 8.
    Savasere, A., Omiecinski, E., Navathe, S.B. An efficient algorithm for mining association rules in large databases. In: Proceedings of International Conference on Very Large Data Bases (VLDB), pp. 432–444 (1995)Google Scholar
  9. 9.
    Tsay, Y.-J., Chang-Chien, Y.-W.: An efficient cluster and decomposition algorithm for mining association rules. Inf. Sci. Inf. Comput. Sci. 160(1–4), 161–171 (2004)zbMATHGoogle Scholar
  10. 10.
    Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y. Pfp: parallel fp-growth for query recommendation. In: Pu, P., Bridge, D.G., Mobasher, B., Ricci, F. (eds.) Proceedings of the ACM Conference on Recommender Systems (RecSys), Lausanne, Switzerland, pp. 107–114. ACM (2008)Google Scholar
  11. 11.
    Owen, S.: Mahout in Action. Manning Publications Co., Shelter Island (2012)Google Scholar
  12. 12.
  13. 13.
    English wikipedia articles (2014). http://dumps.wikimedia.org/enwiki/latest
  14. 14.
    The clueweb09 dataset (2009). http://www.lemurproject.org/clueweb09.php/
  15. 15.
    Song, W., Yang, B., Zhangyan, X.: Index-bittablefi: an improved algorithm for mining frequent itemsets. Knowl. Based Syst. 21(6), 507–513 (2008)CrossRefGoogle Scholar
  16. 16.
    Han, J., Pei, J., Yin, J.: Mining frequent patterns without candidate generation. SIGMODREC ACM SIGMOD Rec. 29, 1–12 (2000)CrossRefGoogle Scholar
  17. 17.
    Riondato, M., DeBrabant, J.A., Fonseca, R., Upfal, E.: Parma: a parallel randomized algorithm for approximate association rules mining in mapreduce. In: 21st ACM International Conference on Information and Knowledge Management (CIKM), Maui, HI, USA, pp. 85–94. ACM (2012)Google Scholar
  18. 18.
    Anand, R.: Mining of Massive Datasets. Cambridge University Press, New York (2012)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Saber Salah
    • 1
  • Reza Akbarinia
    • 1
  • Florent Masseglia
    • 1
    Email author
  1. 1.Zenith Team, INRIA and LIRMMUniversity of MontpellierMontpellierFrance

Personalised recommendations