PartEclat: an improved Eclat-based frequent itemset mining algorithm on spark clusters using partition technique

Raj, Shashi; Ramesh, Dharavath

doi:10.1007/s10586-022-03673-5

PartEclat: an improved Eclat-based frequent itemset mining algorithm on spark clusters using partition technique

Published: 03 August 2022

Volume 25, pages 4463–4480, (2022)
Cite this article

Cluster Computing Aims and scope Submit manuscript

233 Accesses
1 Citation
Explore all metrics

Abstract

Frequent itemset mining (FIM) is one of the prominent techniques to extract knowledge from transactional databases. Finding frequent itemsets becomes tiresome while dealing with large-scale datasets due to the enormous demand for computational resources. Cluster-based versions of FIM algorithms become a natural choice to make FIM algorithms efficient and scalable for large-scale datasets and to meet the ever-growing data needs. A variety of MapReduce-based algorithms on the Hadoop cluster has been developed to meet the expectations. Due to the iterative nature of the FIM algorithms, they are still unable to perform adequately. Bottlenecks associated with MapReduce-based FIMs include challenges originating from adapting FIM algorithms in parallel and distributed contexts, as well as reading and publishing intermediate results to HDFS, which necessitates significant disc and communication I/O. Many FIM algorithms have been redesigned on Spark to achieve efficiency for large-scale datasets by utilizing its in-memory processing capabilities. However, Spark-based FIM adaptations still face the same challenges except achieving memory efficiency. The limitations associated with basic FIM algorithms, such as repeated scanning of input data, massive candidate generation, and maintaining large conditional trees in memory, still need to be addressed. Also, tuning of Spark’s shuffle behavior becomes important to control the communication overhead. In this paper, we propose a Spark-based algorithm, namely PartEclat. It uses the Eclat method in combination with the partitioning technique to gain efficiency and scalability. Vertical data format helps to avoid repeated scanning of input data to calculate individual support values. It utilizes the benefits of the partitioning technique to limit the movement of key-value pairs across the cluster nodes (shuffle overheads) during iterations. Also, it helps to deal efficiently with the memory requirements to handle large Transaction ID (TID) sets and computational overhead imposed in computing the intersection of these large TID sets. In-depth experiments with benchmark datasets have been conducted to gain insight into the efficiency and scalability performance of the algorithm. Experimental results show that the proposed scheme outperforms other Spark-based Eclat variants by reducing network and computing loads by approx. 25–40%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Applying Partition Method to Adopt Spark-Based Eclat Algorithm for Large Transactional Datasets

A Spark-based Apriori algorithm with reduced shuffle overhead

Article 27 March 2020

EAFIM: efficient apriori-based frequent itemset mining algorithm on Spark for big transactional data

Article 07 April 2020

References

Aggarwal, C.C., Bhuiyan, M.A., Hasan, M.A.: Frequent pattern mining algorithms A survey. Frequent pattern mining Springer, Cham (2014)
Book Google Scholar
Han, J., Kamber, M., Pei, J.: Data mining concepts and techniques third edition. Morgan Kaufmann Series Data Manag Sys 5(4), 83–124 (2011)
Google Scholar
Zhao, Qiankun, and Sourav S. Bhowmick (2003) Association rule mining: A survey. Nanyang Technological University, Singapore 135
Agrawal, R., Srikant, R.: VLDB 1215, 487–499 (1994)
Google Scholar
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. ACM SIGMOD Rec. 29(2), 1–12 (2000)
Article Google Scholar
Zaki, M.J.: Scalable algorithms for association mining. IEEE Trans. Knowl. Data Eng. 12(3), 372–390 (2000)
Article Google Scholar
Fan, W., Bifet, A.: Mining big data: current status, and forecast to the future. ACM SIGKDD Explorations Newsl 14(2), 1–5 (2013)
Article Google Scholar
Wu, X., Zhu, X., Gong-Qing, Wu., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2013)
Google Scholar
Zaki, M.J.: Parallel and distributed association mining: A survey. IEEE Concurr. 7(4), 14–25 (1999)
Article Google Scholar
Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)
Article Google Scholar
The Apache Software Foundation. “Apache Spark: Lightning-fast cluster computing”, March 2021, http://spark.apache.org/. Spark1.6.0
Li, Ning, Li Zeng, Qing He, and Zhongzhi Shi. Parallel implementation of apriori algorithm based on mapreduce. In 2012 13th ACIS international conference on software engineering, artificial intelligence, networking and parallel/distributed computing, pp. 236–241. IEEE, 2012.
Singh, Sudhakar, Rakhi Garg, and P. K. Mishra (2017) Review of apriori based algorithms on mapreduce framework." arXiv preprint arXiv:1702.06284
Luna, J.M., Padillo, F., Pechenizkiy, M., Ventura, S.: Apriori versions based on mapreduce for mining frequent patterns on big data. IEEE Transact Cybern 48(10), 2851–2865 (2017)
Article Google Scholar
Raj, S., Ramesh, D., Sethi, K.K.: A Spark-based Apriori algorithm with reduced shuffle overhead. J Supercomput 77, 133–151 (2021)
Article Google Scholar
Castro, E.P.S., Maia, T.D., Pereira, M.R., Esmin, A.A.A., Pereira, D.A.: Review and comparison of Apriori algorithm implementations on Hadoop-MapReduce and Spark. The Knowledge Engineering Review (2018). https://doi.org/10.1017/S0269888918000127
Article Google Scholar
Li, Haoyuan, Yi Wang, Dong Zhang, Ming Zhang, and Edward Y. Chang (2008). Pfp: parallel fp-growth for query recommendation. In Proceedings of the 2008 ACM conference on Recommender systems, pp. 107–114
Agarwal, R.C., Aggarwal, C.C., Prasad, V.V.V.: A tree projection algorithm for generation of frequent item sets. J. Parallel Distributed Comput. 61(3), 350–371 (2001)
Article MATH Google Scholar
Grahne, G., Zhu, J.: Fast algorithms for frequent itemset mining using fp-trees. IEEE Trans. Knowl. Data Eng. 17(10), 1347–1362 (2005)
Article Google Scholar
Zhou, Le, Zhiyong Zhong, Jin Chang, Junjie Li, Joshua Zhexue Huang, and Shengzhong Feng (2010) Balanced parallel fp-growth with mapreduce. In 2010 IEEE youth conference on information, computing and telecommunications, pp. 243–246. IEEE
Deng, Lingling, and Yuansheng Lou (2015) Improvement and research of FP-growth algorithm based on distributed spark." In 2015 International Conference on Cloud Computing and Big Data (CCBD), pp. 105–108. IEEE
Zaki, Mohammed J., and Karam Gouda (2003) Fast vertical mining using diffsets." In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 326–335
Liu, J., Yongsheng, Wu., Zhou, Q., Fung, B., Chen, F., Binxiao, Yu.: Parallel eclat for opportunistic mining of frequent itemsets. In Database and Expert Systems Applications, Springer, Cham (2015)
Book Google Scholar
Singh, P., Sudhakar Singh, P.K., Mishra, and Rakhi Garg,: RDD-Eclat: approaches to parallelize Eclat algorithm on spark RDD framework. In: International Conference on Computer Networks and Inventive Communication Technologies, pp. 755–768. Springer, Cham (2019)
Google Scholar
Zhang, C., Tian, P., Zhang, X., Jiang, Z.L., Yao, L., Wang, X.: Fast eclat algorithms based on minwise hashing for large scale transactions. IEEE Internet Things J. 6(2), 3948–3961 (2018)
Article Google Scholar
Lin, Xueyan (2014) Mr-apriori Association rules algorithm based on mapreduce. In 2014 IEEE 5th international conference on software engineering and service science, pp. 141–144. IEEE
Lin, Ming-Yen, Pei-Yu Lee, and Sue-Chen Hsueh (2012) Apriori-based frequent itemset mining algorithms on MapReduce. In Proceedings of the 6th international conference on ubiquitous information management and communication, pp. 1–8
Chon, K.-W., Kim, M.-S.: BIGMiner: a fast and scalable distributed frequent pattern miner for big data. Clust. Comput. 21(3), 1507–1520 (2018)
Article Google Scholar
Chen, Zhuang, Shibang Cai, Qiulin Song, and Chonglai Zhu (2011) An improved Apriori algorithm based on pruning optimization and transaction reduction. In 2011 2nd International Conference on Artificial Intelligence, Management Science and Electronic Commerce (AIMSEC), pp. 1908–1911. IEEE,
Xun, Y., Zhang, J., Qin, X.: Fidoop: Parallel mining of frequent itemsets using mapreduce. IEEE transac Sys, Man, and Cybernetics 46(3), 313–325 (2015)
Article Google Scholar
Moens, Sandy, Emin Aksehirli, and Bart Goethals (2013) Frequent itemset mining for big data. In 2013 IEEE international conference on big data, 111–118. IEEE
Zhang, F., Liu, M., Gui, F., Shen, W., Shami, A., Ma, Y.: A distributed frequent itemset mining algorithm using spark for big data analytics. Clust. Comput. 18(4), 1493–1501 (2015)
Article Google Scholar
Qiu, Hongjian, Rong Gu, Chunfeng Yuan, and Yihua Huang (2014) Yafim: a parallel frequent itemset mining algorithm with spark. In 2014 IEEE International Parallel & Distributed Processing Symposium Workshops, pp. 1664–1671. IEEE
Rathee, Sanjay, Manohar Kaul, and Arti Kashyap (2015) R-Apriori: an efficient apriori based algorithm on spark. In Proceedings of the 8th workshop on Ph. D. Workshop in information and knowledge management, pp. 27–34
Sethi, K.K., Ramesh, D.: HFIM a Spark-based hybrid frequent itemset mining algorithm for big data processing. J. Supercomput. 73(8), 3652–3668 (2017)
Article Google Scholar
Rathee, S., Kashyap, A.: Adaptive-Miner: an efficient distributed association rule mining algorithm on Spark. J. Big Data 5(1), 1–17 (2018)
Article Google Scholar
Raj, S., Dharavath Ramesh, M., Sreenu, and Krishan Kumar Sethi,: EAFIM Efficient apriori-based frequent itemset mining algorithm on Spark for big transactional data. Knowledge Informat Syst 62(9), 3565–3583 (2020)
Article Google Scholar
Shi, Xiujin, Shaozong Chen, and Hui Yang (2017) DFPS: Distributed FP-growth algorithm based on Spark." In 2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), pp. 1725–1731. IEEE
Xiao, W., Juan, Hu.: SWEclat: a frequent itemset mining algorithm over streaming data using Spark Streaming. J. Supercomput. 76(10), 7619–7634 (2020)
Article Google Scholar
Yang, Shaosong, Guoyan Xu, Zhijian Wang, and Fachao Zhou (2015) The parallel improved Apriori algorithm research based on Spark In 2015 Ninth International Conference on Frontier of Computer Science and Technology, pp. 354–359. IEEE
Frequent Itemset Mining Dataset repository. February, 2021, http://fimi.uantwerpen.be/data/
SPMF: An Open-Source Data Mining Library, February, 2021. http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php
UCI Machine Learning Repository, February, 2021. https://archive.ics.uci.edu/ml/datasets.php

Download references

Acknowledgements

The authors would like to express their heartiest thanks to the Department of Computer Science & Engineering, Indian Institute of Technology (ISM), Dhanbad, India, for their research support.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Indian Institute of Technology (ISM), Dhanbad, 826004, Jharkhand, India
Shashi Raj & Dharavath Ramesh

Authors

Shashi Raj
View author publications
You can also search for this author in PubMed Google Scholar
Dharavath Ramesh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dharavath Ramesh.

Ethics declarations

Conflict of interest

The authors declare that they do not have any commercial or associative interest representing a conflict of interest connected with the work submitted.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Raj, S., Ramesh, D. PartEclat: an improved Eclat-based frequent itemset mining algorithm on spark clusters using partition technique. Cluster Comput 25, 4463–4480 (2022). https://doi.org/10.1007/s10586-022-03673-5

Download citation

Received: 12 December 2021
Revised: 10 June 2022
Accepted: 21 June 2022
Published: 03 August 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s10586-022-03673-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PartEclat: an improved Eclat-based frequent itemset mining algorithm on spark clusters using partition technique

Abstract

Access this article

Similar content being viewed by others

Applying Partition Method to Adopt Spark-Based Eclat Algorithm for Large Transactional Datasets

A Spark-based Apriori algorithm with reduced shuffle overhead

EAFIM: efficient apriori-based frequent itemset mining algorithm on Spark for big transactional data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

PartEclat: an improved Eclat-based frequent itemset mining algorithm on spark clusters using partition technique

Abstract

Access this article

Similar content being viewed by others

Applying Partition Method to Adopt Spark-Based Eclat Algorithm for Large Transactional Datasets

A Spark-based Apriori algorithm with reduced shuffle overhead

EAFIM: efficient apriori-based frequent itemset mining algorithm on Spark for big transactional data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation