Skip to main content
Log in

A Spark-based high utility itemset mining with multiple external utilities

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

High utility itemset (HUI) mining is a powerful data mining technique to discover profitable patterns. The utility of an item is computed by using two measures named quantity and per-unit profit. All existing HUI mining algorithms consider a single value of external utility (per unit profit) for the entire database. However, the per-unit profit of items might fluctuate over time in many applications. This research introduces three novel strategies to comprise the external utilities of items as input for the HUI mining algorithm. Traditional HUI mining algorithms have been developed for the standalone system and do not fit for big data processing due to the limited computing resources (CPU, memory). Big data are efficiently processed on distributed frameworks like Apache Hadoop, Spark, etc. This paper introduces a distributed HUI mining algorithm named Spark-based Top-k high utility itemset (k-SHUI) miner. We also propose a fair load distribution strategy to divide the search space equally among the cluster nodes. The k-SHUI produces top-k HUIs without the requirement of the minimum utility threshold. We conducted extensive experiments on six real-life datasets to compare the proposed algorithm's performance with the existing algorithm. The experimental results demonstrate that the proposed algorithm outperforms the existing algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Hashem, I.A.T., Yaqoob, I., Anuar, N.B., Mokhtar, S., Gani, A., Khan, S.U.: The rise of big data on cloud computing review and open research issues. Inf. Syst. 47, 98–115 (2015)

    Article  Google Scholar 

  2. Chen, C.P., Zhang, C.-Y.: Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf. Sci. 275, 314–347 (2014)

    Article  Google Scholar 

  3. Arora, S., Bala, A.: A survey: ICT enabled energy efficiency techniques for big data applications. Clust. Comput. 23(2), 775–796 (2020)

    Article  Google Scholar 

  4. Pacheco, P.: Parallel Programming with MPI. Morgan Kaufmann, San Francisco (1997)

    MATH  Google Scholar 

  5. Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, pp. 59–72 (2007)

  6. White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Inc., Beijing (2012)

    Google Scholar 

  7. Chan, R., Yang, Q., Shen, Y.-D.: Mining high utility itemsets. In: Third IEEE International Conference on Data Mining, 2003. ICDM 2003, pp. 19–26. IEEE (2003)

  8. Yao, H., Hamilton, H.J., Butz, C.J.: A foundational approach to mining itemset utilities from databases. In: Proceedings of the 2004 SIAM International Conference on Data Mining, pp. 482–486. SIAM (2004)

  9. Zhang, C., Han, M., Sun, R., Du, S., Shen, M.: A survey of key technologies for high utility patterns mining. IEEE Access 8, 55798–55814 (2020)

    Article  Google Scholar 

  10. Liu, Y., Liao, W.-K., Choudhary, A.: A two-phase algorithm for fast discovery of high utility itemsets. In: Pacific–Asia Conference on Knowledge Discovery and Data Mining, pp. 689–695. Springer (2005)

  11. Li, Y.-C., Yeh, J.-S., Chang, C.-C.: Isolated items discarding strategy for discovering high utility itemsets. Data Knowl. Eng. 64(1), 198–217 (2008)

    Article  Google Scholar 

  12. Ahmed, C.F., Tanbeer, S.K., Jeong, B.-S., Lee, Y.-K.: Efficient tree structures for high utility pattern mining in incremental databases. IEEE Trans. Knowl. Data Eng. 21(12), 1708–1721 (2009)

    Article  Google Scholar 

  13. Tseng, V.S., Wu, C.-W., Shie, B.-E., Yu, P.S.: Up-growth: an efficient algorithm for high utility itemset mining. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 253–262. ACM (2010)

  14. Tseng, V.S., Shie, B.-E., Wu, C.-W., Philip, S.Y.: Efficient algorithms for mining high utility itemsets from transactional databases. IEEE Trans. Knowl. Data Eng. 25(8), 1772–1786 (2012)

    Article  Google Scholar 

  15. Yun, U., Ryang, H., Ryu, K.H.: High utility itemset mining with techniques for reducing overestimated utilities and pruning candidates. Expert Syst. Appl. 41(8), 3861–3878 (2014)

    Article  Google Scholar 

  16. Liu, M., Qu, J.: Mining high utility itemsets without candidate generation. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 55–64. ACM (2012)

  17. Fournier-Viger, P., Wu, C.-W., Zida, S., Tseng, V.S.: FHM: faster high utility itemset mining using estimated utility co-occurrence pruning. In: International Symposium on Methodologies for Intelligent Systems, pp. 83–92. Springer (2014)

  18. Krishnamoorthy, S.: Pruning strategies for mining high utility itemsets. Expert Syst. Appl. 42(5), 2371–2381 (2015)

    Article  Google Scholar 

  19. Zida, S., Fournier-Viger, P., Lin, J.C.-W., Wu, C.-W., Tseng, V.S.: EFIM: a highly efficient algorithm for high-utility itemset mining. In: Mexican International Conference on Artificial Intelligence, pp. 530–546. Springer (2015)

  20. Krishnamoorthy, S.: Hminer: Efficiently mining high utility itemsets. Expert Syst. Appl. 90, 168–183 (2017)

    Article  Google Scholar 

  21. Chu, C.-J., Tseng, V.S., Liang, T.: An efficient algorithm for mining high utility itemsets with negative item values in large databases. Appl. Math. Comput. 215(2), 767–778 (2009)

    MATH  Google Scholar 

  22. Lan, G.-C., Hong, T.-P., Huang, J.-P., Tseng, V.S.: On-shelf utility mining with negative item values. Expert Syst. Appl. 41(7), 3450–3459 (2014)

    Article  Google Scholar 

  23. Lin, J.C.-W., Fournier-Viger, P., Gan, W.: FHN: an efficient algorithm for mining high-utility itemsets with negative unit profits. Knowl. Based Syst. 111, 283–298 (2016)

    Article  Google Scholar 

  24. Fournier-Viger, P., Zida, S.: FOSHU: faster on-shelf high utility itemset mining—with or without negative unit profit. In: Proceedings of the 30th Annual ACM Symposium on Applied Computing, pp. 857–864 (2015)

  25. Wu, C.W., Shie, B.-E., Tseng, V.S., Yu, P.S.: Mining top-k high utility itemsets. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 78–86 (2012)

  26. Ryang, H., Yun, U.: Top-k high utility pattern mining with effective threshold raising strategies. Knowl. Based Syst. 76, 109–126 (2015)

    Article  Google Scholar 

  27. Tseng, V.S., Wu, C.-W., Fournier-Viger, P., Philip, S.Y.: Efficient algorithms for mining top-k high utility itemsets. IEEE Trans. Knowl. Data Eng. 28(1), 54–67 (2015)

    Article  Google Scholar 

  28. Duong, Q.-H., Liao, B., Fournier-Viger, P., Dam, T.-L.: An efficient algorithm for mining the top-k high utility itemsets, using novel threshold raising and pruning strategies. Knowl. Based Syst. 104, 106–122 (2016)

    Article  Google Scholar 

  29. Krishnamoorthy, S.: A Comparative Study of Top-K High Utility Itemset Mining Methods, pp. 47–74. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-04921-8

    Book  Google Scholar 

  30. Karau, H., Konwinski, A., Wendell, P., Zaharia, M.: Learning Spark: Lightning-Fast Big Data Analysis. O’Reilly Media, Inc., Beijing (2015)

    Google Scholar 

  31. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Presented as Part of the 9th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 12), pp. 15–28 (2012)

  32. Lin, Y.C., Wu, C.-W., Tseng, V.S.: Mining high utility itemsets in big data. In: Pacific–Asia Conference on Knowledge Discovery and Data Mining, pp. 649–661. Springer (2015)

  33. Chen, Y., An, A.: Approximate parallel high utility itemset mining. Big Data Res. 6, 26–42 (2016)

    Article  Google Scholar 

  34. Sethi, K.K., Ramesh, D., Sreenu, M.: Parallel high average-utility itemset mining using better search space division approach. In: International Conference on Distributed Computing and Internet Technology, pp. 108–124. Springer (2019)

  35. Sethi, K.K., Ramesh, D., Edla, D.R.: P-fhm+: Parallel high utility itemset mining algorithm for big data processing. Procedia Comput. Sci. 132, 918–927 (2018)

    Article  Google Scholar 

  36. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  37. Man Jr., E.C., Garey, M., Johnson, D.: Approximation algorithms for bin packing: a survey. In: Approximation Algorithms for NP-Hard Problems, pp. 46–93 (1996)

  38. Rymon, R.: Search Through Systematic Set Enumeration, pp 539–550. University of Pennsylvania (1992)

  39. Fournier-Viger, P., Gomariz, A., Gueniche, T., Soltani, A., Wu, C.-W., Tseng, V.S.: SPMF: a Java open-source pattern mining library. J. Mach. Learn. Res. 15(1), 3389–3393 (2014)

    MATH  Google Scholar 

Download references

Acknowledgements

This research work was supported by the Indian Institute of Technology (Indian School of Mines), Dhanbad, Government of India. The authors wish to express their gratitude and heartiest thanks to the Department of Computer Science and Engineering, Indian Institute of Technology (ISM), Dhanbad, India, for their research support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dharavath Ramesh.

Ethics declarations

Conflict of interest

The authors do not have any conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sethi, K.K., Ramesh, D. & Trivedi, M.C. A Spark-based high utility itemset mining with multiple external utilities. Cluster Comput 25, 889–909 (2022). https://doi.org/10.1007/s10586-021-03442-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-021-03442-w

Keywords

Navigation