Data Mining and Knowledge Discovery

, Volume 30, Issue 4, pp 928–963 | Cite as

Mining significant association rules from uncertain data

Article

Abstract

In association rule mining, the trade-off between avoiding harmful spurious rules and preserving authentic ones is an ever critical barrier to obtaining reliable and useful results. The statistically sound technique for evaluating statistical significance of association rules is superior in preventing spurious rules, yet can also cause severe loss of true rules in presence of data error. This study presents a new and improved method for statistical test on association rules with uncertain erroneous data. An original mathematical model was established to describe data error propagation through computational procedures of the statistical test. Based on the error model, a scheme combining analytic and simulative processes was designed to correct the statistical test for distortions caused by data error. Experiments on both synthetic and real-world data show that the method significantly recovers the loss in true rules (reduces type-2 error) due to data error occurring in original statistically sound method. Meanwhile, the new method maintains effective control over the familywise error rate, which is the distinctive advantage of the original statistically sound technique. Furthermore, the method is robust against inaccurate data error probability information and situations not fulfilling the commonly accepted assumption on independent error probabilities of different data items. The method is particularly effective for rules which were most practically meaningful yet sensitive to data error. The method proves promising in enhancing values of association rule mining results and helping users make correct decisions.

Keywords

Pattern discovery Association rules Statistical evaluation Uncertain data 

References

  1. Aggarwal CC, Li Y, Wang J, Wang J (2009) Frequent pattern mining with uncertain data. In: Proceedings of 17th international conference on knowledge discovery and data mining (KDD 2009), pp 29–38Google Scholar
  2. Agrawal R, Imielinski T, Swami A (1993) Mining associations between sets of items in massive databases. In: Proceedings of 1993 ACM-SIGMOD international conference on management of data, pp 207–216Google Scholar
  3. Agresti A (1992) A survey of exact inference for contingency tables. Stat Sci 7(1):131–153MathSciNetCrossRefMATHGoogle Scholar
  4. Bastide Y, Pasquier N, Taouil R, Stumme G, Lakhal L (2000) Mining minimal non-redundant association rules using frequent closed itemsets. In: Proceedings of first international conference on computational logic, pp 972–986Google Scholar
  5. Bay SD, Pazzani MJ (2001) Detecting group differences: mining contrast sets. Data Min Knowl Disc 5(3):213–246CrossRefMATHGoogle Scholar
  6. Bayardo RJ Jr, Agrawal R, Gunopulos D (2000) Constraint-based rule mining in large, dense databases. Data Min Knowl Disc 4(2/3):217–240CrossRefGoogle Scholar
  7. Ben-Israel A, Greville TNE (2003) Generalized inverses: theory and applications. Springer, New YorkMATHGoogle Scholar
  8. Bishop G (2009) Assessing the likely quality of the statistical longitudinal census dataset. Research paper, Australian Bureau of StatisticsGoogle Scholar
  9. Brin S, Motwani R, Silverstein C (1997) Beyond market baskets: generalizing association rules to correlations. In: SIGMOD 1997, proceedings ACM SIGMOD international conference on management of data, pp 265–276Google Scholar
  10. Calders T, Garboni C, Goethals B (2010) Approximation of frequentness probability of itemsets in uncertain data. In: Proceedings of IEEE international conference on data mining (ICDM 2010), pp 749–754Google Scholar
  11. Carvalho JV, Ruiz DD (2013) Discovering frequent itemsets on uncertain data: a systematic review. In: Proceedings of 9th international conference on machine learning and data mining, pp 390–404Google Scholar
  12. Chui CK, Kao B (2008) A decremental approach for mining frequent itemsets from uncertain data. In: Proceedings of 12th Pacific-Asia conference on knowledge discovery and data mining (PAKDD 2008), pp 64–75Google Scholar
  13. Chui CK, Kao B, Hung E (2007) Mining frequent itemsets from uncertain data. In: Proceedings of 11th Pacific-Asia conference on knowledge discovery and data mining (PAKDD 2007), pp 47–58Google Scholar
  14. Foody GM (2002) Status of land cover classification accuracy assessment. Remote Sens Environ 80:185–201CrossRefGoogle Scholar
  15. Fosu GB (2001) Evaluation of population census data through demographic analysis. In: Symposium on global review of 2000 round of population and housing censuses: mid-decade assessment and future prospects. http://unstats.un.org/unsd/demographic/meetings/egm/symposium2001/docs/symposium_11.htm#_Toc7406238. Accessed 22 July 2015
  16. Gray B, Orlowska M (1998) CCAIIA: clustering categorical attributes into interesting association rules. In: Proceedings of 2nd Pacific-Asia conference on knowledge discovery and data mining (PAKDD’98), pp 132–143Google Scholar
  17. Hollister JW, Gonzalez ML, Paul JF, August PV, Copeland JL (2004) Assessing the accuracy of National Land Cover Dataset area estimates at multiple spatial extents. Photogramm Eng Remote Sensing 70:405–414CrossRefGoogle Scholar
  18. International Business Machines (1996) IBM intelligent miner user’s guide, version 1, release 1Google Scholar
  19. Jones N, Lewis D (eds, with Aitken A, Hörngren J, Zilhão MJ) (2003) Handbook on improving quality by analysis of process variables. Final report, EurostatGoogle Scholar
  20. Mennis J, Liu JW (2005) Mining association rules in spatio-temporal data: an analysis of urban socioeconomic and land cover change. Trans GIS 9(1):5–17CrossRefGoogle Scholar
  21. McDonald JH (2014) Handbook of biological statistics, 3rd edn. Sparky House Publishing, BaltimoreGoogle Scholar
  22. Shaffer JP (1995) Multiple hypothesis testing. Annu Rev Psychol 46:561–584CrossRefGoogle Scholar
  23. Liu B, Hsu W, Ma Y (1999) Pruning and summarizing the discovered associations. In: Proceedings of 5th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’99), pp 125–134Google Scholar
  24. Liu B, Hsu W, Ma Y (2001) Identifying non-actionable association rules. In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’01), pp 329–334Google Scholar
  25. Megiddo N, Srikant R (1998) Discovering predictive association rules. In: Proceedings of 4th international conference on knowledge discovery and data mining (KDD ’98), pp 27–78Google Scholar
  26. Olson CE (2008) Is 80% accuracy good enough? In: Proceedings of 17th William T. pecora memorial remote sensing symposium. http://www.asprs.org/a/publications/proceedings/pecora17/0026.pdf. Accessed 27 Feb 2014
  27. Piatetsky-Shapiro G (1991) Discovery, analysis, and presentation of strong rules. In: Piatetsky-Shapiro G, Frawley J (eds) Knowledge discovery in databases. AAAI/MIT Press, Menlo Park, pp 229–248Google Scholar
  28. Penrose R (1955) A generalized inverse for matrices. Math Proc Cambridge Philos 51:406–413MathSciNetCrossRefMATHGoogle Scholar
  29. Rao CR, Mitra SK (1972) Generalized inverse of a matrix and its applications. In: Proceedings of the sixth Berkeley symposium on mathematical statistics and probability, volume 1: theory of statistics, pp 601–620Google Scholar
  30. Smith JH, Stehman SV, Wickham JD, Yang L (2003) Effects of landscape characteristics on land-cover class accuracy. Remote Sens Environ 84:342–349CrossRefGoogle Scholar
  31. Srikant R, Agrawal R (1995) Mining generalized association rules. In: Proceedings of 21st international conference on very large data bases, pp 407–419Google Scholar
  32. Stehman SV, Wickham JD, Wade TG, Smith JH (2008) Designing a multi-objective, multi-support accuracy assessment of the 2001 National Land Cover Data (NLCD 2001) of the conterminous United States. Photogramm Eng Remote Sensing 74:1561–1571CrossRefGoogle Scholar
  33. Sun L, Cheng R, Cheung DW, Cheng J (2010) Mining uncertain data with probabilistic guarantees. In: Proceedings of 17th international conference on knowledge discovery and data mining (KDD 2010), pp 273–282Google Scholar
  34. Taussky O (1949) A recurring theorem on determinants. Am Math Mon 56(10):672–676MathSciNetCrossRefMATHGoogle Scholar
  35. The Executive Office for Administration and Finance, Commonwealth of Massachusetts (2012) MassGIS datalayers. http://www.mass.gov/anf/research-and-tech/it-serv-and-support/application-serv/office-of-geographic-information-massgis/datalayers/layerlist.html. Accessed 26 Sept 2013
  36. Ting KM (2011) Confusion matrix. In: Sammut C, Webb GI (eds) Encyclopedia of machine learning, 1st edn. Springer, New YorkGoogle Scholar
  37. Tong Y, Chen L, Ding B (2012) Discovering threshold-based frequent closed itemsets over probabilistic data. In: Proceedings of 28th international conference on data engineering, pp 270–281Google Scholar
  38. Webb GI (2007) Discovering significant patterns. Mach Learn 68:1–33CrossRefGoogle Scholar
  39. Webb GI, Zhang S (2005) \(K\)-optimal rule discovery. Data Min Knowl Disc 10(1):39–79MathSciNetCrossRefGoogle Scholar
  40. Yang L, Stehman SV, Smith JH, Wickham JD (2001) Thematic accuracy of MRLC land cover for eastern United States. Remote Sens Environ 76:418–422CrossRefGoogle Scholar
  41. Zaki MJ (2000) Generating non-redundant association rules. In: Proceedings of 6th ACM SIGKDD international conference on knowledge discovery and data mining (KDD-2000), pp 34–43Google Scholar
  42. Zhang H, Padmanabhan B, Tuzhilin A (2004) On the discovery of significant statistical quantitative rules. In: Proceedings of 10th international conference on knowledge discovery and data mining (KDD 2004), pp 374–383Google Scholar
  43. Zhu XQ, Wu XD (2006) Error awareness data mining. In: 2006 IEEE international conference on granular computing, pp 269–274Google Scholar
  44. Zhu XQ, Wu XD, Yang Y (2004) Error detection and impact-sensitive instance ranking in noisy datasets. In: Proceedings of 19th national conference on artificial intelligence, pp 378–383Google Scholar

Copyright information

© The Author(s) 2016

Authors and Affiliations

  1. 1.Department of Land Surveying and Geo-InformaticsThe Hong Kong Polytechnic UniversityHung Hom, KowloonPeople’s Republic of China
  2. 2.Faculty of Information TechnologyMonash UniversityMelbourneAustralia

Personalised recommendations