A distributed frequent itemset mining algorithm using Spark for Big Data analytics

Zhang, Feng; Liu, Min; Gui, Feng; Shen, Weiming; Shami, Abdallah; Ma, Yunlong

doi:10.1007/s10586-015-0477-1

A distributed frequent itemset mining algorithm using Spark for Big Data analytics

Published: 28 October 2015

Volume 18, pages 1493–1501, (2015)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Feng Zhang^1,3,
Min Liu¹,
Feng Gui¹,
Weiming Shen²,
Abdallah Shami³ &
…
Yunlong Ma¹

2250 Accesses
59 Citations
Explore all metrics

Abstract

Frequent itemset mining is an essential step in the process of association rule mining. Conventional approaches for mining frequent itemsets in big data era encounter significant challenges when computing power and memory space are limited. This paper proposes an efficient distributed frequent itemset mining algorithm (DFIMA) which can significantly reduce the amount of candidate itemsets by applying a matrix-based pruning approach. The proposed algorithm has been implemented using Spark to further improve the efficiency of iterative computation. Numeric experiment results using standard benchmark datasets by comparing the proposed algorithm with the existing algorithm, parallel FP-growth, show that DFIMA has better efficiency and scalability. In addition, a case study has been carried out to validate the feasibility of DFIMA.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Trends and Future Perspective Challenges in Big Data

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Article 12 April 2024

Rashmin Gajera, Suresh Patel, … Ayush Solanki

Big data analytics on Apache Spark

Article 13 October 2016

Salman Salloum, Ruslan Dautov, … Joshua Zhexue Huang

References

Sandhu, R., Sood, S.K.: Scheduling of big data applications on distributed cloud based on QoS parameters. Clust. Comput. 18, 1–12 (2014). doi:10.1007/s10586-014-0416-6
Google Scholar
Han, L., Ong, H.Y.: Parallel data intensive applications using MapReduce: a data mining case study in biomedical sciences. Clust. Comput. 18(1), 403–418 (2015). doi:10.1007/s10586-014-0405-9
Article Google Scholar
Chen, Y., Li, F., Fan, J.: Mining association rules in big data with NGEP. Clust. Comput. 18, 1–9 (2015). doi:10.1007/s10586-014-0419-3
Article Google Scholar
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, vol. 1215, pp. 487–499 (1994)
Agrawal, R., Shafer, J.C.: Parallel mining of association rules. IEEE Trans. Knowl. Data Eng. 8(6), 962–969 (1996). doi:10.1109/69.553164
Article Google Scholar
Grahne, G., Zhu, J.: Fast algorithms for frequent itemset mining using fp-trees. IEEE Trans. Knowl. Data Eng. 17(10), 1347–1362 (2005). doi:10.1109/TKDE.2005.166
Article Google Scholar
Mohamed, M.H., Darwieesh, M.M.: Efficient mining frequent itemsets algorithms. Int. J. Mach. Learn. Cybern. 5(6), 823–833 (2014). doi:10.1007/s13042-013-0172-6
Article Google Scholar
Totad, S.G., Geeta, R.B., Reddy, P.P.: Batch incremental processing for FP-tree construction using FP-Growth algorithm. Knowl. Inf. Syst. 33(2), 475–490 (2012). doi:10.1007/s10115-012-0514-9
Article Google Scholar
Zhen-yu, L., Wei-xiang, X., Xumin, L.: Efficiently using matrix in mining maximum frequent itemset. In: WKDD’10. Third International Conference on Knowledge Discovery and Data Mining, 2010, pp. 50–54. IEEE (2010)
Ye, Y., Chiang, C. C.: A parallel apriori algorithm for frequent itemsets mining. In: Fourth International Conference on, Software Engineering Research, Management and Applications, 2006, pp. 87–94. IEEE (2006)
Upadhyaya, S.: Parallel approaches to machine learning—a comprehensive survey. J. Parallel Distrib. Comput. 73(3), 284–292 (2013). doi:10.1016/j.jpdc.2012.11.001
Article Google Scholar
Lin, M.Y., Lee, P.Y., Hsueh, S.C.: Apriori-based frequent itemset mining algorithms on MapReduce. In: Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication, ACM 76 (2012). doi:10.1145/2184751.2184842
Moens, S., Aksehirli, E., Goethals, B.: Frequent itemset mining for big data. In: IEEE International Conference on Big Data, pp. 111–118, IEEE (2013)
Pacheco, P.S.: Parallel Programming with MPI. Morgan Kaufmann Publishers Inc, San Francisco (1997)
Google Scholar
Li, S., Hoefler, T., Hu, C., et al.: Improved MPI collectives for MPI processes in shared address spaces. Clust. Comput. 17(4), 1139–1155 (2014). doi:10.1007/s10586-014-0361-4
Article Google Scholar
Otey, M.E., Wang, C., Parthasarathy, S., et al.: Mining frequent itemsets in distributed and dynamic databases. In: Third IEEE International Conference on Data Mining, ICDM 2003, pp. 617–620. IEEE (2003)
Kaosar, M.G., Xu, Z., Yi, X.: Distributed Association rule mining with minimum communication overhead. In: Proceedings of the Eighth Australasian Data Mining Conference, vol. 101, pp. 17–23. Australian Computer Society Inc (2009)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008). doi:10.1145/1327452.1327492
Article Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M.J., et al.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, pp. 10–10 (2010)
Jiang, H., Chen, Y., Qiao, Z., et al.: Scaling up MapReduce-based big data processing on Multi-GPU systems. Clust. Comput. 18(1), 369–383 (2015). doi:10.1007/s10586-014-0400-1
Article Google Scholar
Inokuchi, A., Washio, T., Motoda, H.: An apriori-based algorithm for mining frequent substructures from graph data. Lect. Notes Comput. Sci. 1910, 13–23 (2000)
Article Google Scholar
Pramudiono, I., Kitsuregawa, M.: Parallel FP-growth on PC cluster. In: Advances in Knowledge Discovery and Data Mining, pp. 467–473. Springer, Berlin (2003)
Gu, H., Hang, H., Lv, Q., et al.: Fusing text and frienships for location inference in online social networks. In: 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol. 1, pp. 158–165. IEEE (2012)
Gu, H., Xie, X., Lv, Q., et al.: Etree: effective and efficient event modeling for real-time online social media networks. In: 2011 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol. 1, pp. 300–307. IEEE (2011)
Priyadarsini, S., Viswanathan, R.: Web usage mining for better understanding of user pattern to improve productivity of E-business. Int. J. Appl. Eng. Res. 9(11), 1753–1763 (2014)
Google Scholar
Boukerche, A., Samarah, S.: A novel algorithm for mining association rules in wireless ad hoc sensor networks. IEEE Trans. Parallel Distrib. Syst. 19(7), 865–877 (2008)
Article Google Scholar
Zhou, L., Wang, X.: Research of the FP-growth algorithm based on cloud environments. J. Softw. 9(3), 676–683 (2014). doi:10.4304/jsw.9.3.676-683
Google Scholar
Li, H., Wang, Y., Zhang, D., et al.: Pfp: parallel fp-growth for query recommendation. In: Proceedings of the 2008 ACM Conference on Recommender Systems, ACM 107–114 (2008). doi:10.1145/1454008.1454027
Yu, K.M., Zhou, J., Hsiao, W.C.: Load balancing approach parallel algorithm for frequent pattern mining. In: Parallel Computing Technologies, pp. 623–631. Springer, Berlin (2007)
Pei, J., Han, J., Mao, R.: CLOSET: an efficient algorithm for mining frequent closed itemsets. In: ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, vol. 4(2), pp. 21–30 (2000)
Chen, M., Gao, X., Li, H.: An efficient parallel FP-Growth algorithm. In: CyberC’09. International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery, 2009, pp. 283–286. IEEE (2009)
Farzanyar, Z., Cercone, N.: Efficient mining of frequent itemsets in social network data based on MapReduce framework. In: Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ACM 1183–1188 (2013). doi:10.1145/2492517.2500301
Fumarola, F., Malerba, D.: A parallel algorithm for approximate frequent itemset mining using MapReduce. In: 2014 International Conference on High Performance Computing & Simulation (HPCS), pp. 335–342. IEEE (2014)
Moens, S., Aksehirli, E., Goethals, B.: Frequent itemset mining for big data. In: 2013 IEEE International Conference on Big Data, pp. 111–118. IEEE (2013)
Yu, K., Zhou, J., Zhou, J., et al.: A load-balanced distributed parallel mining algorithm. Expert Syst. Appl. 37(3), 2459–2464 (2010). doi:10.1016/j.eswa.2009.07.074
Article MathSciNet Google Scholar
Ozkural, E., Ucar, B., Aykanat, C.: Parallel frequent item set mining with selective item replication. IEEE Trans. Parallel Distrib. Syst. 23(10), 1632–1640 (2011). doi:10.1109/TPDS.2011.32
Article Google Scholar
Aouad, L.M., Le-Khac, N.A., Kechadi, T.M.: Performance study of distributed Apriori-like frequent itemsets mining. Knowl. Inf. Syst. 23(1), 55–72 (2010). doi:10.1007/s10115-009-0205-3
Article Google Scholar
Chen, Z., Cai, S., Song, Q., et al.: An improved Apriori algorithm based on pruning optimization and transaction reduction. In: 2011 2nd International Conference on Artificial Intelligence, Management Science and Electronic Commerce (AIMSEC), pp. 908–1911. IEEE (2011)
Zaharia, M., Chowdhury, M., Das, T., et al.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, USENIX Association 2-2 (2012)
Haveliwala, T.H.: Topic-sensitive pagerank: a context-sensitive ranking algorithm for web search. IEEE Trans. Knowl. Data Eng. 15(4), 784–796 (2003). doi:10.1109/TKDE.2003.1208999
Article Google Scholar
Xin, R.S., Rosen, J., Zaharia, M., et al.: Shark: SQL and rich analytics at scale. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, ACM 13–24 (2013). doi:10.1145/2463676.2465288
Goethals, B., Zaki, M.J.: FIMI’03: Workshop on frequent itemset mining implementations. In: Third IEEE International Conference on Data Mining Workshop on Frequent Itemset Mining Implementations, pp.1–13. IEEE (2003)

Download references

Acknowledgments

The research work presented in this paper is partially supported by the Scientific Research Projects of the NSFC (Grant No. 61173015) and the Fundamental Research Funds for the Central Universities.

Author information

Authors and Affiliations

School of Electronics and Information Engineering, Tongji University, Shanghai, 201804, China
Feng Zhang, Min Liu, Feng Gui & Yunlong Ma
The Key Laboratory of Embedded System and Service Computing, Tongji University, Shanghai, China
Weiming Shen
Department of Electrical and Computer Engineering, Western University, London, ON, N6A 5B9, Canada
Feng Zhang & Abdallah Shami

Authors

Feng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Min Liu
View author publications
You can also search for this author in PubMed Google Scholar
Feng Gui
View author publications
You can also search for this author in PubMed Google Scholar
Weiming Shen
View author publications
You can also search for this author in PubMed Google Scholar
Abdallah Shami
View author publications
You can also search for this author in PubMed Google Scholar
Yunlong Ma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yunlong Ma.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, F., Liu, M., Gui, F. et al. A distributed frequent itemset mining algorithm using Spark for Big Data analytics. Cluster Comput 18, 1493–1501 (2015). https://doi.org/10.1007/s10586-015-0477-1

Download citation

Received: 01 June 2015
Revised: 07 July 2015
Accepted: 17 August 2015
Published: 28 October 2015
Issue Date: December 2015
DOI: https://doi.org/10.1007/s10586-015-0477-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

A distributed frequent itemset mining algorithm using Spark for Big Data analytics

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Big data analytics on Apache Spark

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A distributed frequent itemset mining algorithm using Spark for Big Data analytics

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Big data analytics on Apache Spark

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation