Abstract
High utility itemset (HUI) mining is a powerful data mining technique to discover profitable patterns. The utility of an item is computed by using two measures named quantity and per-unit profit. All existing HUI mining algorithms consider a single value of external utility (per unit profit) for the entire database. However, the per-unit profit of items might fluctuate over time in many applications. This research introduces three novel strategies to comprise the external utilities of items as input for the HUI mining algorithm. Traditional HUI mining algorithms have been developed for the standalone system and do not fit for big data processing due to the limited computing resources (CPU, memory). Big data are efficiently processed on distributed frameworks like Apache Hadoop, Spark, etc. This paper introduces a distributed HUI mining algorithm named Spark-based Top-k high utility itemset (k-SHUI) miner. We also propose a fair load distribution strategy to divide the search space equally among the cluster nodes. The k-SHUI produces top-k HUIs without the requirement of the minimum utility threshold. We conducted extensive experiments on six real-life datasets to compare the proposed algorithm's performance with the existing algorithm. The experimental results demonstrate that the proposed algorithm outperforms the existing algorithms.
Similar content being viewed by others
References
Hashem, I.A.T., Yaqoob, I., Anuar, N.B., Mokhtar, S., Gani, A., Khan, S.U.: The rise of big data on cloud computing review and open research issues. Inf. Syst. 47, 98–115 (2015)
Chen, C.P., Zhang, C.-Y.: Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf. Sci. 275, 314–347 (2014)
Arora, S., Bala, A.: A survey: ICT enabled energy efficiency techniques for big data applications. Clust. Comput. 23(2), 775–796 (2020)
Pacheco, P.: Parallel Programming with MPI. Morgan Kaufmann, San Francisco (1997)
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, pp. 59–72 (2007)
White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Inc., Beijing (2012)
Chan, R., Yang, Q., Shen, Y.-D.: Mining high utility itemsets. In: Third IEEE International Conference on Data Mining, 2003. ICDM 2003, pp. 19–26. IEEE (2003)
Yao, H., Hamilton, H.J., Butz, C.J.: A foundational approach to mining itemset utilities from databases. In: Proceedings of the 2004 SIAM International Conference on Data Mining, pp. 482–486. SIAM (2004)
Zhang, C., Han, M., Sun, R., Du, S., Shen, M.: A survey of key technologies for high utility patterns mining. IEEE Access 8, 55798–55814 (2020)
Liu, Y., Liao, W.-K., Choudhary, A.: A two-phase algorithm for fast discovery of high utility itemsets. In: Pacific–Asia Conference on Knowledge Discovery and Data Mining, pp. 689–695. Springer (2005)
Li, Y.-C., Yeh, J.-S., Chang, C.-C.: Isolated items discarding strategy for discovering high utility itemsets. Data Knowl. Eng. 64(1), 198–217 (2008)
Ahmed, C.F., Tanbeer, S.K., Jeong, B.-S., Lee, Y.-K.: Efficient tree structures for high utility pattern mining in incremental databases. IEEE Trans. Knowl. Data Eng. 21(12), 1708–1721 (2009)
Tseng, V.S., Wu, C.-W., Shie, B.-E., Yu, P.S.: Up-growth: an efficient algorithm for high utility itemset mining. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 253–262. ACM (2010)
Tseng, V.S., Shie, B.-E., Wu, C.-W., Philip, S.Y.: Efficient algorithms for mining high utility itemsets from transactional databases. IEEE Trans. Knowl. Data Eng. 25(8), 1772–1786 (2012)
Yun, U., Ryang, H., Ryu, K.H.: High utility itemset mining with techniques for reducing overestimated utilities and pruning candidates. Expert Syst. Appl. 41(8), 3861–3878 (2014)
Liu, M., Qu, J.: Mining high utility itemsets without candidate generation. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 55–64. ACM (2012)
Fournier-Viger, P., Wu, C.-W., Zida, S., Tseng, V.S.: FHM: faster high utility itemset mining using estimated utility co-occurrence pruning. In: International Symposium on Methodologies for Intelligent Systems, pp. 83–92. Springer (2014)
Krishnamoorthy, S.: Pruning strategies for mining high utility itemsets. Expert Syst. Appl. 42(5), 2371–2381 (2015)
Zida, S., Fournier-Viger, P., Lin, J.C.-W., Wu, C.-W., Tseng, V.S.: EFIM: a highly efficient algorithm for high-utility itemset mining. In: Mexican International Conference on Artificial Intelligence, pp. 530–546. Springer (2015)
Krishnamoorthy, S.: Hminer: Efficiently mining high utility itemsets. Expert Syst. Appl. 90, 168–183 (2017)
Chu, C.-J., Tseng, V.S., Liang, T.: An efficient algorithm for mining high utility itemsets with negative item values in large databases. Appl. Math. Comput. 215(2), 767–778 (2009)
Lan, G.-C., Hong, T.-P., Huang, J.-P., Tseng, V.S.: On-shelf utility mining with negative item values. Expert Syst. Appl. 41(7), 3450–3459 (2014)
Lin, J.C.-W., Fournier-Viger, P., Gan, W.: FHN: an efficient algorithm for mining high-utility itemsets with negative unit profits. Knowl. Based Syst. 111, 283–298 (2016)
Fournier-Viger, P., Zida, S.: FOSHU: faster on-shelf high utility itemset mining—with or without negative unit profit. In: Proceedings of the 30th Annual ACM Symposium on Applied Computing, pp. 857–864 (2015)
Wu, C.W., Shie, B.-E., Tseng, V.S., Yu, P.S.: Mining top-k high utility itemsets. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 78–86 (2012)
Ryang, H., Yun, U.: Top-k high utility pattern mining with effective threshold raising strategies. Knowl. Based Syst. 76, 109–126 (2015)
Tseng, V.S., Wu, C.-W., Fournier-Viger, P., Philip, S.Y.: Efficient algorithms for mining top-k high utility itemsets. IEEE Trans. Knowl. Data Eng. 28(1), 54–67 (2015)
Duong, Q.-H., Liao, B., Fournier-Viger, P., Dam, T.-L.: An efficient algorithm for mining the top-k high utility itemsets, using novel threshold raising and pruning strategies. Knowl. Based Syst. 104, 106–122 (2016)
Krishnamoorthy, S.: A Comparative Study of Top-K High Utility Itemset Mining Methods, pp. 47–74. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-04921-8
Karau, H., Konwinski, A., Wendell, P., Zaharia, M.: Learning Spark: Lightning-Fast Big Data Analysis. O’Reilly Media, Inc., Beijing (2015)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Presented as Part of the 9th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 12), pp. 15–28 (2012)
Lin, Y.C., Wu, C.-W., Tseng, V.S.: Mining high utility itemsets in big data. In: Pacific–Asia Conference on Knowledge Discovery and Data Mining, pp. 649–661. Springer (2015)
Chen, Y., An, A.: Approximate parallel high utility itemset mining. Big Data Res. 6, 26–42 (2016)
Sethi, K.K., Ramesh, D., Sreenu, M.: Parallel high average-utility itemset mining using better search space division approach. In: International Conference on Distributed Computing and Internet Technology, pp. 108–124. Springer (2019)
Sethi, K.K., Ramesh, D., Edla, D.R.: P-fhm+: Parallel high utility itemset mining algorithm for big data processing. Procedia Comput. Sci. 132, 918–927 (2018)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Man Jr., E.C., Garey, M., Johnson, D.: Approximation algorithms for bin packing: a survey. In: Approximation Algorithms for NP-Hard Problems, pp. 46–93 (1996)
Rymon, R.: Search Through Systematic Set Enumeration, pp 539–550. University of Pennsylvania (1992)
Fournier-Viger, P., Gomariz, A., Gueniche, T., Soltani, A., Wu, C.-W., Tseng, V.S.: SPMF: a Java open-source pattern mining library. J. Mach. Learn. Res. 15(1), 3389–3393 (2014)
Acknowledgements
This research work was supported by the Indian Institute of Technology (Indian School of Mines), Dhanbad, Government of India. The authors wish to express their gratitude and heartiest thanks to the Department of Computer Science and Engineering, Indian Institute of Technology (ISM), Dhanbad, India, for their research support.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors do not have any conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sethi, K.K., Ramesh, D. & Trivedi, M.C. A Spark-based high utility itemset mining with multiple external utilities. Cluster Comput 25, 889–909 (2022). https://doi.org/10.1007/s10586-021-03442-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-021-03442-w