HFIM: a Spark-based hybrid frequent itemset mining algorithm for big data processing

Sethi, Krishan Kumar; Ramesh, Dharavath

doi:10.1007/s11227-017-1963-4

HFIM: a Spark-based hybrid frequent itemset mining algorithm for big data processing

Published: 30 January 2017

Volume 73, pages 3652–3668, (2017)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

1369 Accesses
47 Citations
4 Altmetric
Explore all metrics

Abstract

Frequent itemset mining is one of the data mining techniques applied to discover frequent patterns, used in prediction, association rule mining, classification, etc. Apriori algorithm is an iterative algorithm, which is used to find frequent itemsets from transactional dataset. It scans complete dataset in each iteration to generate the large frequent itemsets of different cardinality, which seems better for small data but not feasible for big data. The MapReduce framework provides the distributed environment to run the Apriori on big transactional data. However, MapReduce is not suitable for iterative process and declines the performance. We introduce a novel algorithm named Hybrid Frequent Itemset Mining (HFIM), which utilizes the vertical layout of dataset to solve the problem of scanning the dataset in each iteration. Vertical dataset carries information to find support of each itemsets. Moreover, we also include some enhancements to reduce number of candidate itemsets. The proposed algorithm is implemented over Spark framework, which incorporates the concept of resilient distributed datasets and performs in-memory processing to optimize the execution time of operation. We compare the performance of HFIM with another Spark-based implementation of Apriori algorithm for various datasets. Experimental results show that the HFIM performs better in terms of execution time and space consumption.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

EAFIM: efficient apriori-based frequent itemset mining algorithm on Spark for big transactional data

Article 07 April 2020

A Spark-based Apriori algorithm with reduced shuffle overhead

Article 27 March 2020

A Review of Scalable Algorithms for Frequent Itemset Mining for Big Data Using Hadoop and Spark

References

Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Khan SU (2015) The rise of “big data” on cloud computing: review and open research issues. Infor Syst 47:98–115
Article Google Scholar
Philip Chen CL, Zhang CY (2014) Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf. Sci. 275:314–347
Article Google Scholar
Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques. Elsevier, New York
MATH Google Scholar
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceeding VLDB ’94 of 20th International Conference Very Large Data Bases, vol 1215, pp 487–499
Bayardo RJ Jr (1998) Efficiently mining long patterns from databases. ACM Sigmod Record 27(2):85–93
Article Google Scholar
Pacheco PS (1997) Parallel programming with MPI. Morgan Kaufmann, San Francisco
MATH Google Scholar
Apache Hadoop [Online] Available: http://hadoop.apache.org. Accessed 22 Feb 2015
Isard M, Budiu M, Yu Y, Birrell A, Fetterly D (2007) Dryad: Distributed data-parallel programs from sequential building blocks. In: ACM SIGOPS Operative System Review pp 59–72
Karau H, Konwinski A, Wendell P, Zaharia M (2015) Learning spark: lightning-fast big data analysis. O’Reilly Media, Inc
Apache Spark [Online]. Available: http://spark.apache.org/
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Stoica I (2012) Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. USENIX Association
http://www.openstack.org
http://cassandra.apache.org
Luper D, Cameron D, Miller J, Arabnia HR (2007) Spatial and Temporal Target Association through Semantic Analysis and GPS Data Mining. In: IKE (vol 7, pp 25–28)
Jafri R, Ali SA, Arabnia HR, Fatima S (2014) Computer vision-based object recognition for the visually impaired in an indoors environment: a survey. Vis Comput 30(11):1197–1222
Article Google Scholar
Arabnia HR, Fang WC, Lee C, Zhang Y (2010) Context-aware middleware and intelligent agents for smart environments. IEEE Intell Syst 25(2):10–11
Article Google Scholar
Ter Mors A, Valk J, Witteveen C, Arabnia HR, Mun Y (2004) Coordinating autonomous planners
Jafri R, Arabnia HR (2008) Fusion of face and gait for automatic human recognition. In: IEEE Fifth International Conference on Information Technology: New Generations, ITNG 2008 (pp 167–173)
Rahbarinia B, Pedram MM, Arabnia HR, Alavi Z (2010) A multi-objective scheme to hide sequential patterns. In: IEEE the 2nd International Conference on Computer and Automation Engineering (ICCAE), 2010 (vol 1, pp 153–158)
Jafri R, Ali SA, Arabnia HR (2013) Computer vision-based object recognition for the visually impaired using visual tags. In: Proceedings of the International Conference on Image Processing, Computer Vision, and Pattern Recognition (IPCV 2013: Las Vegas, USA), pp 400–406
Ye Y, Chiang CC (2006) A parallel Apriori algorithm for frequent itemsets mining. In Proceedings of Fourth International Conference Software Engineering Research Management and applications SERA 2006:87–94
Google Scholar
Bodon F (2010) A fast apriori implementation. In: Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI’03), vol 90
Bodon F (2004) Surprising Results of Trie-based FIM Algorithms. FIMI
Lin MY, Lee PY, Hsueh SC (2012) Apriori-based frequent itemset mining algorithms on MapReduce. In: Proceedings of 6th International Conference Ubiquitous Information Management Communication–ICUIMC ’12. 1
Li N, Zeng L, He Q, Shi Z (2012) Parallel Implementation of Apriori Algorithm Based on MapReduce. In: ACIS International Conference Software Engineering, Artificial Intelligence Networking and Parallel/Distributed Computing, pp 236–241
Yu Run-Ming et al (2014) An efficient Frequent Patterns Mining Algorithm based on MapReduce Framework, Software Intelligence Technologies and Applications & International Conference on Frontiers of Internet of Things
Moens S, Aksehirli E, Goethals B (2013) Frequent Itemset Mining for Big Data, 2013 IEEE International Conference Big Data, pp 111–118. doi:10.1109/BigData.6691742
Lin X (2014) MR-Apriori: Association Rules Algorithm Based on MapReduce. In: 5th IEEE International Conference on Software Engineering and Service Science (ICSESS)
Yang XY, Liu Z, Fu Y (2010) MapReduce as a programming model for association rules algorithm on Hadoop. In: IEEE 3rd International Conference on Information Sciences and Interaction Sciences (ICIS)
Qiu H, Gu R, Yuan C, Huang Y (2014) YAFIM: A parallel frequent itemset mining algorithm with Spark. In: Proceedings of International Parallel Distribution Process of Symposium IPDPS, pp 1664–1671
Yang S, Xu G, Wang Z, Zhou F (2015) The Parallel Improved Apriori Algorithm Research Based on Spark. In: Proceedings of 2015 9th International Conference Frontier of Computer Science and Technology FCST 2015, pp 354–359
Rathee S, Kaul M, Kashyap A (2015) R-Apriori: an efficient Apriori based algorithm on Spark. In: Proceedings of the 8th Workshop on Ph.D. Workshop in Information and Knowledge Management. ACM, pp 27–34
Gui F, Ma Y, Zhang F, Liu M, Li F, Shen W, Bai H (2015) A distributed frequent itemset mining algorithm based on Spark. In: IEEE 19th International Conference Computer Supported Cooperative Work Design, vol 18, pp 271–275
Zaki MJ, et al (1997) Parallel algorithms for discovery of association rules. In: Data Mining and Knowledge Discovery 1.4, pp 343–373
Asuncion A, Newman D (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml/. Accessed 4 May 2015
Synthetic Data Generation Code for Associations and Sequential Patterns. Intelligent Information Systems, IBM Almaden Research Center. http://www.almaden.ibm.com/software/quest/Resources/index.shtml. Accessed 4 Nov 2015
Brijs T (2013) Retail market basket data set. In: Workshop on Frequent Itemset Mining Implementations (FIMI’03). http://fimi.ua.ac.be/data/retail.dat. Accessed 12 Nov 2015
Dharavath R et al (2014) An Apriori-Based Vertical Fragmentation Technique for Heterogeneous Distributed Database Transactions. Intelligent Computing, Networking, and Informatics. Springer India, pp 687–695

Download references

Acknowledgements

The authors wish to express their appreciation to the editor and anonymous referees for many helpful suggestions that significantly improve this paper. This research work is supported by Department of Computer Science & Engineering, Indian Institute of Technology (ISM), Dhanbad, India. The authors would also like to express their gratitude and heartiest thanks to the Department of Computer Science and Engineering, Indian Institute of Technology (ISM), Dhanbad, India, for providing their research support.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Indian Institute of Technology (ISM), Dhanbad, Jharkhand, 826004, India
Krishan Kumar Sethi & Dharavath Ramesh

Authors

Krishan Kumar Sethi
View author publications
You can also search for this author in PubMed Google Scholar
Dharavath Ramesh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dharavath Ramesh.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest with respect to the above research work.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sethi, K.K., Ramesh, D. HFIM: a Spark-based hybrid frequent itemset mining algorithm for big data processing. J Supercomput 73, 3652–3668 (2017). https://doi.org/10.1007/s11227-017-1963-4

Download citation

Published: 30 January 2017
Issue Date: August 2017
DOI: https://doi.org/10.1007/s11227-017-1963-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

HFIM: a Spark-based hybrid frequent itemset mining algorithm for big data processing

Abstract

Access this article

Similar content being viewed by others

EAFIM: efficient apriori-based frequent itemset mining algorithm on Spark for big transactional data

A Spark-based Apriori algorithm with reduced shuffle overhead

A Review of Scalable Algorithms for Frequent Itemset Mining for Big Data Using Hadoop and Spark

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Keywords

Navigation

HFIM: a Spark-based hybrid frequent itemset mining algorithm for big data processing

Abstract

Access this article

Similar content being viewed by others

EAFIM: efficient apriori-based frequent itemset mining algorithm on Spark for big transactional data

A Spark-based Apriori algorithm with reduced shuffle overhead

A Review of Scalable Algorithms for Frequent Itemset Mining for Big Data Using Hadoop and Spark

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation