Abstract
The mechanism of Frequent Itemset Mining can be performed by using sequential algorithms like Apriori on a standalone system, or it can be applied using parallel algorithms like Count Distribution on a distributed system. Due to communication overhead in parallel algorithms and exponential candidate generation, many algorithms were developed for calculating frequent items either over the certain or uncertain database. Yet not a single algorithm is developed so far which can cover the requirement of generating frequent itemset by combining both the databases. We had proposed earlier MasterApriori algorithm which is used to calculate Approximate Frequent Items for a combination of certain and uncertain databases with the support of Apriori for Certain and Expected support based UApriori for the uncertain database. In this paper, the researcher would like to extend the former work by using Poisson and Normal Distribution based UApriori for the uncertain database. In proposed algorithms, there is only one-time communication between sites where data is distributed, which reduce the communication overhead. Scalability and efficiency of proposed algorithms are then checked by using standard, and synthetic databases. The performances were then measured by comparing time taken and a number of frequent items generated by each algorithm.
Similar content being viewed by others
References
Aggarwal CC, Han J (2014) Frequent pattern mining. Springer, Cham. https://doi.org/10.1007/978-3-319-07821-2
Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. ACM SIGMOD Rec 22:207–216. https://doi.org/10.1145/170036.170072
Han J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam
Zaki MJ (2000) Scalable algorithms for association mining. IEEE Trans Knowl Data Eng 12:372–390. https://doi.org/10.1109/69.846291
Wang C, Parthasarathy S (2006) Summarizing itemset patterns using probabilistic models. In: Proc 12th ACM SIGKDD int conf knowl discov data min—KDD’06 730. https://doi.org/10.1145/1150402.1150495
Zhang S, Wu X, Zhang C, Lu J (2008) Computing the minimum-support for mining frequent patterns. Knowl Inf Syst 15:233–257. https://doi.org/10.1007/s10115-007-0081-7
Bernecker T, Cheng R, Cheung DW et al (2013) Model-based probabilistic Frequent Itemset Mining. Knowl Inf Syst 37:181–217. https://doi.org/10.1007/s10115-012-0561-2
Cheng R, Kalashnikov D, Prabhakar S (2003) Evaluating probabilistic queries over imprecise data. In: Proc 2003 ACM SIGMOD Int Conf Manag data, pp 551–562. https://doi.org/10.1145/872819.872823
Dalvi N, Suciu D (2004) Efficient query evaluation on probabilistic databases. In: VLDB. pp 864–875
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: VLDB’94, proceedings of 20th international conference on very large data bases, September 12–15, 1994, Santiago de Chile, Chile. pp 487–499
Aggarwal CC, Li Y, Wang J, Wang J (2009) Frequent pattern mining with uncertain data. In: Proc 15th ACM SIGKDD int conf knowl discov data min—KDD’09 29. https://doi.org/10.1145/1557019.1557030
Aggarwal CC, Yu PS (2009) A survey of uncertain data algorithms and applications. IEEE Trans Knowl Data Eng 21:609–623. https://doi.org/10.1109/TKDE.2008.190
Aggarwal CC (2009) Managing and Mining Uncertain Data. Manag Min Uncertain Data 35:45–76. https://doi.org/10.1007/978-0-387-09690-2
Aggarwal CC, Yu PS (2008) Outlier detection with uncertain data. In: Proc SIAM Int Conf Data Min (SDM 2008), vol 2, pp 483–493
Huang J, Antova L, Koch C, Olteanu D (2009) MayBMS: a probabilistic database management system. In: Proc 2009 ACM SIGMOD Int Conf Manag data, pp 1071–1074. https://doi.org/10.1145/1559845.1559984
Hua M, Pei J (2008) Ranking queries on uncertain data: a probabilistic threshold approach. In: Proc 2008 ACM SIGMOD Int Conf Manag data, pp 673–686. https://doi.org/10.1145/1376616.1376685
Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proc 2000 ACM SIGMOD Int Conf Manag data—SIGMOD’00, pp 1–12. https://doi.org/10.1145/342009.335372
Tong Y, Chen L, Ding B (2012) Discovering threshold-based frequent closed itemsets over probabilistic data. In: 2012 IEEE 28th international conference on data engineering. IEEE, pp 270–281
Word Health Organization (1998) Essential drugs monitor. Essent Drugs Monit 30:2. https://doi.org/10.1007/BF02722352
Caulder CR, Mehta B, Bookstaver PB et al (2015) Impact of Drug shortages on health system pharmacies in the southeastern United States. Hosp Pharm 50:279–286. https://doi.org/10.1310/hpj5004-279
Santos EP (2017) Over 300 M worth of medicine, hospital equipment “wasted” in 2016. Report of the Commission on Audit (COA), Department of Health(DOH)—CNN Philippines
Goethals B (2003) Frequent Itemset Mining implementations repository. http://fimi.ua.ac.be/. Accessed 24 Jan 2018
Tong Y, Chen L, Cheng Y, Yu PS (2012) Mining frequent itemsets over uncertain databases. Proc VLDB Endow 5:1650–1661. https://doi.org/10.14778/2350229.2350277
Lawrence B, Miller TR, Eduard Z, Lawrence BA (2014) The economic and societal impact of motor vehicle crashes, 2010. 30, Report number: DOT HS 812 013
Geurts K, Wets G, Brijs T, Vanhoof K (2003) Profiling of high-frequency accident locations by use of association rules. Transp Res Rec J Transp Res Board 1840:123–130. https://doi.org/10.3141/1840-14
Strand R, Oughton D (2009) Risk and uncertainty as a research ethics challenge. National Committees for Research Ethics in Norway. ISBN: 978-82-7682-056-0
Han E, Karypis G, Kumar V (1997) Scalable parallel data mining for association rules. ACM 1997:277–288
Han Eui-Hong, Karypis G, Kumar V (2000) Scalable parallel data mining for association rules. IEEE Trans Knowl Data Eng 12:337–352. https://doi.org/10.1109/69.846289
Wazir S, Ahmad T, Sufyan Beg MM (2018) Frequent itemset mining for a combination of certain and uncertain databases. In: 6th world conference on soft computing (WConSC2016) Berkeley, California, USA. pp 25–39
Conci A, Castro EMM (2002) Image mining by content. Expert Syst Appl 23:377–383. https://doi.org/10.1016/S0957-4174(02)00073-8
Chen YL, Tang K, Shen RJ, Hu YH (2005) Market basket analysis in a multiple store environment. Decis Support Syst 40:339–354. https://doi.org/10.1016/j.dss.2004.04.009
Cheung DW, Ng VT, Fu AW, Yongjian Fu (1996) Efficient mining of association rules in distributed databases. IEEE Trans Knowl Data Eng 8:911–922. https://doi.org/10.1109/69.553158
Agrawal R, Shafer JC (1996) Parallel mining of association rules. IEEE Trans Knowl Data Eng 8:962–969. https://doi.org/10.1109/69.553164
Joshi MV, Han E-HS, Karypis G, Kumar V (2002) Efficient parallel algorithms for mining associations. Springer, Berlin, pp 83–126
Cheung D, Han J, Ng V (1996) A fast distributed algorithm for mining association rules. In: Parallel Distrib Inf Syst 1996, Fourth Int Conf, vol 56, pp 31–42
Cheung DW, Xiao Y (1999) Effect of data distribution in parallel mining of associations. Data Min Knowl Discov 3:291–314. https://doi.org/10.1023/A:1009836926181
Calders T, Garboni C, Goethals B (2010) Approximation of frequentness probability of itemsets in uncertain data. In: Proc—IEEE int conf data mining, ICDM 749–754. https://doi.org/10.1109/icdm.2010.42
Calders T, Garboni C, Goethals B (2010) Efficient pattern mining of uncertain data with sampling. In: PAKDD. pp 480–487
Sun L, Cheng R, Cheung DW, Cheng J (2010) Mining uncertain data with probabilistic guarantees. In: Proc 16th ACM SIGKDD int conf knowl discov data min—KDD’10, pp 273. https://doi.org/10.1145/1835804.1835841
Wang L, Cheng R, Lee SD, Cheung DW-L (2010) Accelerating probabilistic Frequent Itemset Mining: a model-based approach. Cikm, pp 429–438. https://doi.org/10.1145/1871437.1871494
Zhang Q, Li F, Yi K (2008) Finding frequent items in probabilistic data. In: Proc 2008 ACM SIGMOD int conf manag data—SIGMOD’08 819. https://doi.org/10.1145/1376616.1376698
Chui CK, Kao B, Hung E (2007) Mining frequent itemsets from uncertain data. Adv Knowl Discov Data Min 44:47–58
Chui CK, Kao B (2008) A decremental approach for mining frequent itemsets from uncertain data. In: PAKDD. pp 64–75
Leung CKS, Mateo MAF, Brajczuk DA (2008) A tree-based approach for frequent pattern mining from uncertain data. Lect Notes Comput Sci (Incl Subser Lect Not Artif Intell Lect Not Bioinform) 5012 LNAI:653–661. https://doi.org/10.1007/978-3-540-68125-0_61
Bernecker T, Kriegel H-P, Renz M et al (2009) Probabilistic frequent itemset mining in uncertain databases. In: 15th ACM SIGKDD conference on knowledge discovery and data mining, Paris, France. pp 119–127
Le Cam L (1960) An approximation theorem for the Poisson binomial distribution. Pac J Math 10:1181–1197
Hodges JL, Cam Le (1959) The poisson approximation to the poisson binomial distribution. Ann Math Stat Inst Math Stat Probab Lett 31:737–740. https://doi.org/10.1016/0167-7152(91)90170-v
Feller W (1945) The fundamental limit theorems in probability. Bull Am Math Soc 51:800–832. https://doi.org/10.1090/S0002-9904-1945-08448-1
Feller W (1968) An introduction to probability theory and its applications, vol I. xviii + 509. Wiley, Amsterdam
Fournier-Viger SPMF (2018) A Java open-source data mining library. http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php. Accessed 25 Jan 2018
Tong C, Chen L, Yu P (2012) UFIMT: an uncertain Frequent Itemset Mining toolbox. Proc ACM KDD Conf 1210. https://doi.org/10.1145/2339530.2339767
Zadeh LA (1978) Fuzzy sets as a basis for a theory of possibility. Fuzzy Set Syst 1:3–28
Zadeh LA (1968) Probability measures of fuzzy events. J Math Anal Appl 23:421–427
Zadeh LA (2006) Fuzzy sets and possibility distribution. StndFuzz 195:47–58
Zadeh LA (1984) Fuzzy probabilities. Inf Process Manag 20(3):363–372
Dubois D, Prade H (2004) Probability-possibility transformations, triangular fuzzy sets and probabilistic inequalities. Reliab Comput 10:273–297
Weng CH, Chen YL (2010) Mining fuzzy association rules from uncertain data. Knowl Inf Syst 23:129–152
Hong TP, Kuo CS, Chi SC (1999) Mining association rules from quantitative data. Intell Data Anal 3:363–376
Wazir S, Sufyan Beg MM, Ahmad,T (2017) Mining the frequent itemsets for a database with certain and uncertain transactions. In: 21st world multiconference on systemics, cybernetics and informatics (WMSCI 2017), Orlando, USA
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wazir, S., Beg, M.M.S. & Ahmad, T. Comprehensive mining of frequent itemsets for a combination of certain and uncertain databases. Int. j. inf. tecnol. 12, 1205–1216 (2020). https://doi.org/10.1007/s41870-019-00310-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41870-019-00310-0