Skip to main content

Data Partitioning for Fast Mining of Frequent Itemsets in Massively Distributed Environments

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9261))

Abstract

Frequent itemset mining (FIM) is one of the fundamental cornerstones in data mining. While, the problem of FIM has been thoroughly studied, few of both standard and improved solutions scale. This is mainly the case when (i) the amount of data tends to be very large and/or (ii) the minimum support (MinSup) threshold is very low. In this paper, we propose a highly scalable, parallel frequent itemset mining (PFIM) algorithm, namely Parallel Absolute Top Down (PATD). PATD algorithm renders the mining process of very large databases (up to Terabytes of data) simple and compact. Its mining process is made up of only one parallel job, which dramatically reduces the mining runtime, the communication cost and the energy power consumption overhead, in a distributed computational platform. Based on a clever and efficient data partitioning strategy, namely Item Based Data Partitioning (IBDP), PATD algorithm mines each data partition independently, relying on an absolute minimum support (AMinSup) instead of a relative one. PATD has been extensively evaluated using real-world data sets. Our experimental results suggest that PATD algorithm is significantly more efficient and scalable than alternative approaches.

S. Slah—This work has been partially supported by the Inria Project Lab Hemera.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Labrinidis, A., Jagadish, H.V.: Challenges and opportunities with big data. Proc. VLDB Endow. 5(12), 2032–2033 (2012)

    Article  Google Scholar 

  2. Berry, M.: Survey of Text Mining Clustering, Classification, and Retrieval. Springer, New York (2004)

    Google Scholar 

  3. Fan, W., Bifet, A.: Mining big data: current status, and forecast to the future. SIGKDD Explor. Newsl. 14(2), 1–5 (2013)

    Article  MATH  Google Scholar 

  4. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  5. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2010, CA, USA, p. 10. Berkeley (2010)

    Google Scholar 

  6. Hadoop

    Google Scholar 

  7. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Bocca, J.B., Jarke, M., Zaniolo, C. (eds.) Proceedings of International Conference on Very Large Data Bases (VLDB), pp. 487–499. Santiago de Chile, Chile (1994)

    Google Scholar 

  8. Savasere, A., Omiecinski, E., Navathe, S.B. An efficient algorithm for mining association rules in large databases. In: Proceedings of International Conference on Very Large Data Bases (VLDB), pp. 432–444 (1995)

    Google Scholar 

  9. Tsay, Y.-J., Chang-Chien, Y.-W.: An efficient cluster and decomposition algorithm for mining association rules. Inf. Sci. Inf. Comput. Sci. 160(1–4), 161–171 (2004)

    MATH  Google Scholar 

  10. Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y. Pfp: parallel fp-growth for query recommendation. In: Pu, P., Bridge, D.G., Mobasher, B., Ricci, F. (eds.) Proceedings of the ACM Conference on Recommender Systems (RecSys), Lausanne, Switzerland, pp. 107–114. ACM (2008)

    Google Scholar 

  11. Owen, S.: Mahout in Action. Manning Publications Co., Shelter Island (2012)

    Google Scholar 

  12. Grid5000. https://www.grid5000.fr/mediawiki/index.php/Grid5000:Home

  13. English wikipedia articles (2014). http://dumps.wikimedia.org/enwiki/latest

  14. The clueweb09 dataset (2009). http://www.lemurproject.org/clueweb09.php/

  15. Song, W., Yang, B., Zhangyan, X.: Index-bittablefi: an improved algorithm for mining frequent itemsets. Knowl. Based Syst. 21(6), 507–513 (2008)

    Article  Google Scholar 

  16. Han, J., Pei, J., Yin, J.: Mining frequent patterns without candidate generation. SIGMODREC ACM SIGMOD Rec. 29, 1–12 (2000)

    Article  Google Scholar 

  17. Riondato, M., DeBrabant, J.A., Fonseca, R., Upfal, E.: Parma: a parallel randomized algorithm for approximate association rules mining in mapreduce. In: 21st ACM International Conference on Information and Knowledge Management (CIKM), Maui, HI, USA, pp. 85–94. ACM (2012)

    Google Scholar 

  18. Anand, R.: Mining of Massive Datasets. Cambridge University Press, New York (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Florent Masseglia .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Salah, S., Akbarinia, R., Masseglia, F. (2015). Data Partitioning for Fast Mining of Frequent Itemsets in Massively Distributed Environments. In: Chen, Q., Hameurlain, A., Toumani, F., Wagner, R., Decker, H. (eds) Database and Expert Systems Applications. Globe DEXA 2015 2015. Lecture Notes in Computer Science(), vol 9261. Springer, Cham. https://doi.org/10.1007/978-3-319-22849-5_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-22849-5_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-22848-8

  • Online ISBN: 978-3-319-22849-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics