Advertisement

On Efficient Mining of Frequent Itemsets from Big Uncertain Databases

  • Ahsan Shah
  • Zahid Halim
Article
  • 51 Downloads

Abstract

In the current era of information, communication, and technology the data is being generated at an exponential rate. This provides machine learning and data mining algorithms an opportunity to learn from huge data repositories. However, at the same time, the big data poses many challenges. Data uncertainty being the key concern of the modern data mining systems. This work addresses the problem of extracting frequent itemsets from such large uncertain databases to assist the decision makers in understanding the non-trivial data trends. The usual technique utilized to find frequent itemsets from uncertain databases is known as the Possible Word Semantics (PWS). However, as the database size increases, PWS suffers from performance issues. Therefore, there is a need for efficient frequent pattern mining algorithms. This work presents three techniques to address the issue at hand, namely: 3D linked array-based strategy, connected tree technique, and average probability-based setup with the support of a tree data structure. The objective here is to minimize computational cost by traversing the database only once. The 3D linked array-based solution scans the database only once and stores the support information of the item and its association with other items within the 3D array. For the tree-based method, 1D array is associated with each node of the tree, comprising of support information of the database items and their associations with other items. The average probability-based approach computes the average probability factor and utilizes it to map the uncertain database to a tree. The current proposal addresses attribute uncertainty as well as the tuple uncertainty to map large uncertain databases to the proposed data structures. In addition to introducing the three data structures, this work also presents algorithms to extract frequent itemsets. The proposal is compared with four recent works done in this domain for uncertain data, namely, mining threshold-based (MB) technique, frequent itemsets using nodesets (FIN), prepost + , and uncertain apriori (UApriori). Experiments are performed utilizing four benchmark datasets. The results obtained suggest better performance of the three techniques presented here, while consuming 60% less execution time.

Keywords

Frequent itemsets mining Efficient data structures Uncertain databases 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Notes

Acknowledgments

The authors wish to thank GIK Institute for providing research facilities. This work was sponsored by the GIK Institute graduate research fund under GA-1 scheme.

References

  1. 1.
    Aggarwal, C.C., Philip, S.Y.: A survey of uncertain data algorithms and applications. IEEE Trans. Knowl. Data Eng. 21(5), 609–623 (2009)CrossRefGoogle Scholar
  2. 2.
    Alencar, N., Brayner, A., Filho, J.A., Lopes, H.: Dac scan: a novel scan operator for exploiting SSD internal parallelism. Concurr. Comput. Pract. Exper. 29(8), e4031 (2017)CrossRefGoogle Scholar
  3. 3.
    Cheng, R., Kalashnikov, D.V., Prabhakar, S.: Evaluating probabilistic queries over imprecise data. In: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pp. 551–562 (2003)Google Scholar
  4. 4.
    Cormode, G., Garofalakis, M.: Sketching probabilistic data streams. In: Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pp. 281–292 (2007)Google Scholar
  5. 5.
    Chui, C.K., Kao, B., Hung, E.: Mining frequent itemsets from uncertain data. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 47–58 (2007)Google Scholar
  6. 6.
    Dalvi, N., Suciu, D.: Efficient query evaluation on probabilistic databases. VLDB J. Int. J. Very Large Data Bases 16(4), 523–544 (2007)CrossRefGoogle Scholar
  7. 7.
    Deshpande, A., Guestrin, C., Madden, S.R., Hellerstein, J.M., Hong, W.: Model-driven data acquisition in sensor networks. In: Proceedings of the Thirtieth international conference on Very large data bases-Volume, vol. 30, pp. 588–599 (2004)Google Scholar
  8. 8.
    Deng, Z.H., Lv, S.L.: Fast mining frequent itemsets using Nodesets. Expert Syst. Appl. 41(10), 4505–4512 (2014)CrossRefGoogle Scholar
  9. 9.
    Deng, Z.H., Lv, S.L.: PrePost + : An efficient N-lists-based algorithm for mining frequent itemsets via Children–Parent Equivalence pruning. Expert Syst. Appl. 42(13), 5424–5432 (2015)CrossRefGoogle Scholar
  10. 10.
    Djenouri, Y., Belhadi, A., Fournier-Viger, P.: Extracting useful knowledge from event logs: A frequent itemset mining approach. Knowl.-Based Syst. 139, 132–148 (2018)CrossRefGoogle Scholar
  11. 11.
    Han, J., Pei, J., Yin, Y., Mao, R.: Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min. Knowl. Discov. 8(1), 53–87 (2004)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. ACM Sigmod Record. 29(2), 1–12 (2000)CrossRefGoogle Scholar
  13. 13.
    Hsieh, T.J.: A micro-view-based data mining approach to diagnose the aging status of heating coils. Knowl.-Based Syst. 143, 10–18 (2017)CrossRefGoogle Scholar
  14. 14.
    Huang, J., Antova, L., Koch, C., Olteanu, D.: MayBMS: a probabilistic database management system. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 1071–1074 (2009)Google Scholar
  15. 15.
    Hu, W., Chen, T., Shah, S.L.: Detection of frequent alarm patterns in industrial alarm floods using itemset mining methods. IEEE Trans. Ind. Electron. 65(9), 7290–7300 (2018)CrossRefGoogle Scholar
  16. 16.
    Jampani, R., Xu, F., Wu, M., Perez, L.L., Jermaine, C., Haas, P.J.: MCDB: A Monte Carlo Approach to managing uncertain data. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 687–700 (2008)Google Scholar
  17. 17.
    Karim, M.R., Cochez, M., Beyan, O.D., Ahmed, C.F., Decker, S.: Mining maximal frequent patterns in transactional databases and dynamic data streams: a spark-based approach. Inform. Sci. 432, 278–300 (2018)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Lee, G., Yun, U., Ryang, H.: An uncertainty-based approach: frequent itemset mining from uncertain data with different item importance. Knowl.-Based Syst. 90, 239–256 (2015)CrossRefGoogle Scholar
  19. 19.
    Leung, C.K.S., MacKinnon, R.K.: Fast algorithms for frequent itemset mining from uncertain data. In: IEEE International Conference on Data Mining (ICDM), pp. 893–898 (2014)Google Scholar
  20. 20.
    Leung, C.K.S., Mateo, M.A.F., Brajczuk, D.A.: A tree-based approach for frequent pattern mining from uncertain data. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 653–661 (2008)Google Scholar
  21. 21.
    Li, H., Zhang, N.: Probabilistic maximal frequent itemset mining over uncertain databases. In: International Conference on Database Systems for Advanced Applications, pp. 149–163 (2016)Google Scholar
  22. 22.
    Lin, C.W., Hong, T.P.: A new mining approach for uncertain databases using CUFP trees. Expert Syst. Appl. 39(4), 4084–4093 (2012)CrossRefGoogle Scholar
  23. 23.
    Liu, H., Zhang, X., Zhang, X., Cui, Y.: Self-adapted mixture distance measure for clustering uncertain data. Knowl.-Based Syst. 126, 33–47 (2017)CrossRefGoogle Scholar
  24. 24.
    Muhammad, T., Halim, Z.: Employing artificial neural networks for constructing metadata-based model to automatically select an appropriate data visualization technique. Appl. Soft Comput. 49, 365–384 (2016)CrossRefGoogle Scholar
  25. 25.
    Nasiri, S., Zenkert, J., Fathi, M.: Improving CBR adaptation for recommendation of associated references in a knowledge-based learning assistant system. Neurocomputing. 250, 5–17 (2017)CrossRefGoogle Scholar
  26. 26.
    Ren, J., Lee, S.D., Chen, X., Kao, B., Cheng, R., Cheung, D.: Naive bayes classification of uncertain data. In: Ninth IEEE International Conference on Data Mining, 2009. ICDM’09, pp. 944–949 (2009)Google Scholar
  27. 27.
    Shen, J., Zhu, K.: An uncertain single machine scheduling problem with periodic maintenance. Knowl.-Based Syst. 144, 32–41 (2017)CrossRefGoogle Scholar
  28. 28.
    Sistla, A.P., Wolfson, O., Chamberlain, S., Dao, S.: Querying the uncertain position of moving objects. In: Temporal databases: research and practice, pp. 310–337 (1998)Google Scholar
  29. 29.
    Stieglitz, S., Mirbabaie, M., Ross, B., Neuberger, C.: Social media analytics–Challenges in topic discovery, data collection, and data preparation. Int. J. Inf. Manag. 39, 156–168 (2018)CrossRefGoogle Scholar
  30. 30.
    Sun, X., Lim, L., Wang, S.: An approximation algorithm of mining frequent itemsets from uncertain dataset. Int. J. Adv. Comput. Technol. 4(3), 42–49 (2012)Google Scholar
  31. 31.
    Swami, D., Sahoo, B.: Storage Size Estimation for Schemaless Big Data Applications: A JSON-based Overview. In: Intelligent Communication and Computational Technologies, pp. 315–323 (2018)Google Scholar
  32. 32.
    Tong, W., Leung, C.K., Liu, D., Yu, J.: Probabilistic frequent pattern mining by PUH-mine. In: Asia-Pacific Web Conference, pp. 768–780 (2015)Google Scholar
  33. 33.
    van Rijsbergen, C.J.: Information retrieval butterworth (1979)Google Scholar
  34. 34.
    Wang, L., Cheung, D.W.L., Cheng, R., Lee, S.D., Yang, X.S.: Efficient mining of frequent item sets on large uncertain databases. IEEE Trans. Knowl. Data Eng. 24(12), 2170–2183 (2012)CrossRefGoogle Scholar
  35. 35.
    Yang, J., Zhang, Y., Wei, Y.: An improved vertical algorithm for frequent itemset mining from uncertain database. In: Intelligent Human-Machine Systems and Cybernetics (IHMSC), vol. 1, pp. 355–358 (2017)Google Scholar
  36. 36.
    Zhang, Y., Qiu, M., Tsai, C.W., Hassan, M.M., Alamri, A.: Health-CPS: Healthcare cyber-physical system assisted by cloud and big data. IEEE Syst. J. 11(1), 88–95 (2017)CrossRefGoogle Scholar

Copyright information

© Springer Nature B.V. 2018

Authors and Affiliations

  1. 1.Department of Computer ScienceNational University of Computer and Emerging SciencesKarachiPakistan
  2. 2.The Machine Intelligence Research Group (MInG)Ghulam Ishaq Khan Institute of Engineering Sciences and TechnologyTopiPakistan

Personalised recommendations