On Efficient Mining of Frequent Itemsets from Big Uncertain Databases
In the current era of information, communication, and technology the data is being generated at an exponential rate. This provides machine learning and data mining algorithms an opportunity to learn from huge data repositories. However, at the same time, the big data poses many challenges. Data uncertainty being the key concern of the modern data mining systems. This work addresses the problem of extracting frequent itemsets from such large uncertain databases to assist the decision makers in understanding the non-trivial data trends. The usual technique utilized to find frequent itemsets from uncertain databases is known as the Possible Word Semantics (PWS). However, as the database size increases, PWS suffers from performance issues. Therefore, there is a need for efficient frequent pattern mining algorithms. This work presents three techniques to address the issue at hand, namely: 3D linked array-based strategy, connected tree technique, and average probability-based setup with the support of a tree data structure. The objective here is to minimize computational cost by traversing the database only once. The 3D linked array-based solution scans the database only once and stores the support information of the item and its association with other items within the 3D array. For the tree-based method, 1D array is associated with each node of the tree, comprising of support information of the database items and their associations with other items. The average probability-based approach computes the average probability factor and utilizes it to map the uncertain database to a tree. The current proposal addresses attribute uncertainty as well as the tuple uncertainty to map large uncertain databases to the proposed data structures. In addition to introducing the three data structures, this work also presents algorithms to extract frequent itemsets. The proposal is compared with four recent works done in this domain for uncertain data, namely, mining threshold-based (MB) technique, frequent itemsets using nodesets (FIN), prepost + , and uncertain apriori (UApriori). Experiments are performed utilizing four benchmark datasets. The results obtained suggest better performance of the three techniques presented here, while consuming 60% less execution time.
KeywordsFrequent itemsets mining Efficient data structures Uncertain databases
Unable to display preview. Download preview PDF.
The authors wish to thank GIK Institute for providing research facilities. This work was sponsored by the GIK Institute graduate research fund under GA-1 scheme.
- 3.Cheng, R., Kalashnikov, D.V., Prabhakar, S.: Evaluating probabilistic queries over imprecise data. In: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pp. 551–562 (2003)Google Scholar
- 4.Cormode, G., Garofalakis, M.: Sketching probabilistic data streams. In: Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pp. 281–292 (2007)Google Scholar
- 5.Chui, C.K., Kao, B., Hung, E.: Mining frequent itemsets from uncertain data. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 47–58 (2007)Google Scholar
- 7.Deshpande, A., Guestrin, C., Madden, S.R., Hellerstein, J.M., Hong, W.: Model-driven data acquisition in sensor networks. In: Proceedings of the Thirtieth international conference on Very large data bases-Volume, vol. 30, pp. 588–599 (2004)Google Scholar
- 14.Huang, J., Antova, L., Koch, C., Olteanu, D.: MayBMS: a probabilistic database management system. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 1071–1074 (2009)Google Scholar
- 16.Jampani, R., Xu, F., Wu, M., Perez, L.L., Jermaine, C., Haas, P.J.: MCDB: A Monte Carlo Approach to managing uncertain data. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 687–700 (2008)Google Scholar
- 19.Leung, C.K.S., MacKinnon, R.K.: Fast algorithms for frequent itemset mining from uncertain data. In: IEEE International Conference on Data Mining (ICDM), pp. 893–898 (2014)Google Scholar
- 20.Leung, C.K.S., Mateo, M.A.F., Brajczuk, D.A.: A tree-based approach for frequent pattern mining from uncertain data. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 653–661 (2008)Google Scholar
- 21.Li, H., Zhang, N.: Probabilistic maximal frequent itemset mining over uncertain databases. In: International Conference on Database Systems for Advanced Applications, pp. 149–163 (2016)Google Scholar
- 26.Ren, J., Lee, S.D., Chen, X., Kao, B., Cheng, R., Cheung, D.: Naive bayes classification of uncertain data. In: Ninth IEEE International Conference on Data Mining, 2009. ICDM’09, pp. 944–949 (2009)Google Scholar
- 28.Sistla, A.P., Wolfson, O., Chamberlain, S., Dao, S.: Querying the uncertain position of moving objects. In: Temporal databases: research and practice, pp. 310–337 (1998)Google Scholar
- 30.Sun, X., Lim, L., Wang, S.: An approximation algorithm of mining frequent itemsets from uncertain dataset. Int. J. Adv. Comput. Technol. 4(3), 42–49 (2012)Google Scholar
- 31.Swami, D., Sahoo, B.: Storage Size Estimation for Schemaless Big Data Applications: A JSON-based Overview. In: Intelligent Communication and Computational Technologies, pp. 315–323 (2018)Google Scholar
- 32.Tong, W., Leung, C.K., Liu, D., Yu, J.: Probabilistic frequent pattern mining by PUH-mine. In: Asia-Pacific Web Conference, pp. 768–780 (2015)Google Scholar
- 33.van Rijsbergen, C.J.: Information retrieval butterworth (1979)Google Scholar
- 35.Yang, J., Zhang, Y., Wei, Y.: An improved vertical algorithm for frequent itemset mining from uncertain database. In: Intelligent Human-Machine Systems and Cybernetics (IHMSC), vol. 1, pp. 355–358 (2017)Google Scholar