Machine Learning

, Volume 68, Issue 1, pp 1–33 | Cite as

Discovering Significant Patterns

Article

Abstract

Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some user-specified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type-1 error, that is, of finding patterns that appear due to chance alone to satisfy the constraints on the sample data. This paper proposes techniques to overcome this problem by applying well-established statistical practices. These allow the user to enforce a strict upper limit on the risk of experimentwise error. Empirical studies demonstrate that standard pattern discovery techniques can discover numerous spurious patterns when applied to random data and when applied to real-world data result in large numbers of patterns that are rejected when subjected to sound statistical evaluation. They also reveal that a number of pragmatic choices about how such tests are performed can greatly affect their power.

Keywords

Pattern discovery Statistical evaluation Association rules 

References

  1. Agrawal, R., & Srikant, R. (1995). Mining sequential patterns. In Eleventh international conference on data engineering (pp. 3–14). Taipei, Taiwan. Google Scholar
  2. Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining associations between sets of items in massive databases. In Proceedings of the 1993 ACM-SIGMOD international conference on management of data (pp. 207–216). Washington, DC. Google Scholar
  3. Agresti, A. (1992). A survey of exact inference for contingency tables. Statistical Science, 7(1), 131–153. MATHCrossRefMathSciNetGoogle Scholar
  4. Aumann, Y., & Lindell, Y. (1999). A statistical theory for quantitative association rules. In Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining (KDD-99) (pp. 261–270). Google Scholar
  5. Bastide, Y., Pasquier, N., Taouil, R., Stumme, G., & Lakhal, L. (2000). Mining minimal non-redundant association rules using frequent closed itemsets. In First international conference on computational logic—CL 2000 (pp. 972–986). Berlin: Springer. Google Scholar
  6. Bay, S. D., & Pazzani, M. J. (2001). Detecting group differences: Mining contrast sets. Data Mining and Knowledge Discovery, 5(3), 213–246. MATHCrossRefGoogle Scholar
  7. Bayardo, R. J., Jr., Agrawal, R., & Gunopulos, D. (2000). Constraint-based rule mining in large, dense databases. Data Mining and Knowledge Discovery, 4(2/3), 217–240. CrossRefGoogle Scholar
  8. Benjamini, Y., & Hochberg, Y. (1995) Controlling the false discovery rate: A new and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B, 57, 289–300. MATHMathSciNetGoogle Scholar
  9. Benjamini, Y., & Yekutieli, D. (2001) The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics, 29(4), 1165–1188. MATHCrossRefMathSciNetGoogle Scholar
  10. Brijs, T., Swinnen, G., Vanhoof, K., & Wets, G. (1999). Using association rules for product assortment decisions: A case study. In Knowledge discovery and data mining (pp. 254–260). Google Scholar
  11. Brin, S., Motwani, R. & Silverstein, C. (1997). Beyond market baskets: Generalizing association rules to correlations. In J. Peckham (Ed.), SIGMOD 1997, proceedings ACM SIGMOD international conference on management of data (pp. 265–276). New York: ACM. CrossRefGoogle Scholar
  12. Calders, T., & Goethals, B. (2002). Mining all non-derivable frequent itemsets. In Proceedings of the 6th European conference on principles and practice of knowledge discovery in databases, PKDD 2002 (pp. 74–85). Berlin: Springer. Google Scholar
  13. Dong, G., & Li, J. (1999). Efficient mining of emerging patterns: Discovering trends and differences. In Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining (KDD-99) (pp. 15–18). New York: ACM. Google Scholar
  14. DuMouchel, W., & Pregibon, D. (2001). Empirical Bayes screening for multi-item associations. In KDD-2001: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining (pp. 76–76). New York: ACM. Google Scholar
  15. Hettich, S., & Bay, S. D. (2006). The UCI KDD archive. From http://kdd.ics.uci.edu. Irvine, CA: University of California, Department of Information and Computer Science.
  16. Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6, 65–70. MathSciNetGoogle Scholar
  17. International Business Machines. (1996). IBM intelligent miner user’s guide, version 1, release 1. Google Scholar
  18. Jaroszewicz, S., & Simovici, D. A. (2004). Interestingness of frequent itemsets using Bayesian networks as background knowledge. In R. Kohavi, J. Gehrke, & J. Ghosh (Eds.), KDD-2004: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 178–186). New York: ACM. CrossRefGoogle Scholar
  19. Jensen, D. D., & Cohen, P. R. (2000) Multiple comparisons in induction algorithms. Machine Learning 38(3), 309–338. MATHCrossRefGoogle Scholar
  20. Johnson, R., (1984). Elementary statistics. Boston: Duxbury. Google Scholar
  21. Klösgen, W. (1996). Explora: A multipattern and multistrategy discovery assistant. In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, & R. Uthurusamy (Eds.), Advances in knowledge discovery and data mining (pp. 249–271). Menlo Park: AAAI. Google Scholar
  22. Kuramochi, M., & Karypis, G. (2001). Frequent subgraph discovery. In Proceedings of the 2001 IEEE international conference on data mining (ICDM-01) (pp. 313–320). Google Scholar
  23. Liu, B., Hsu, W., & Ma, Y. (1999). Pruning and summarizing the discovered associations. In Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining (KDD-99) (pp. 125–134). New York: AAAI. CrossRefGoogle Scholar
  24. Megiddo, N., & Srikant, R. (1998). Discovering predictive association rules. In Proceedings of the fourth international conference on knowledge discovery and data mining (KDD-98) (pp. 27–78). Menlo Park: AAAI. Google Scholar
  25. Michalski, R. S. (1983). A theory and methodology of inductive learning. In R. S. Michalski, J. G. Carbonell, & T. M. Mitchell (Eds.), Machine learning: An artificial intelligence approach (pp. 83–129). Berlin: Springer. Google Scholar
  26. Newman, D. J., Hettich, S., Blake, C., & Merz, C. J. (2006). UCI repository of machine learning databases [Machine-readable data repository]. University of California, Department of Information and Computer Science, Irvine, CA. Google Scholar
  27. Piatetsky-Shapiro, G. (1991). Discovery, analysis, and presentation of strong rules. In G. Piatetsky-Shapiro, J. Frawley (Eds.), Knowledge discovery in databases (pp. 229–248). Menlo Park: AAAI/MIT Press. Google Scholar
  28. Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Mateo: Kaufmann. Google Scholar
  29. Quinlan, J. R., & Cameron-Jones, R. M. (1995). Oversearching and layered search in empirical learning. In IJCAI’95 (pp. 1019–1024). Los Altos: Kaufmann. Google Scholar
  30. Scheffer, T. (1995). Finding association rules that trade support optimally against confidence. Intelligent Data Analysis, 9(4), 381–395. Google Scholar
  31. Scheffer, T., & Wrobel, S. (2002). Finding the most interesting patterns in a database quickly by using sequential sampling. Journal of Machine Learning Research, 3, 833–862. CrossRefMathSciNetGoogle Scholar
  32. Shaffer, J. P. (1995). Multiple hypothesis testing. Annual Review of Psychology, 46, 561–584. CrossRefGoogle Scholar
  33. Turney, P. D. (2000). Types of cost in inductive concept learning. In Workshop on cost-sensitive learning at the seventeenth international conference on machine learning (pp. 15–21). Stanford University, CA. Google Scholar
  34. Webb, G. I. (1995). OPUS: An efficient admissible algorithm for unordered search. Journal of Artificial Intelligence Research, 3, 431–465. MATHMathSciNetGoogle Scholar
  35. Webb, G. I. (2001). Discovering associations with numeric variables. In Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining (KDD-2001) (pp. 383–388). New York: The Association for Computing Machinery. CrossRefGoogle Scholar
  36. Webb, G. I. (2002). Magnum Opus Version 1.3. Software, G.I. Webb & Associates, Melbourne, Australia. Google Scholar
  37. Webb, G. I. (2003). Preliminary investigations into statistically valid exploratory rule discovery. In Proceedings of the Australasian data mining workshop (AusDM03) (pp. 1–9). University of Technology, Sydney. Google Scholar
  38. Webb, G. I. (2005). Magnum Opus Version 3.0.1. Software, G.I. Webb & Associates, Melbourne, Australia. Google Scholar
  39. Webb, G. I. (2006). Discovering significant rules. In Proceedings of the twelfth ACM SIGKDD international conference on knowledge discovery and data mining, KDD-2006. (pp. 434–443). New York: ACM. CrossRefGoogle Scholar
  40. Webb, G. I., & Zhang, S. (2005). K-optimal rule discovery. Data Mining and Knowledge Discovery, 10(1), 39–79. CrossRefMathSciNetGoogle Scholar
  41. Zaki, M. J. (2000). Generating non-redundant association rules. In Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining (KDD-2000) (pp. 34–43). New York: ACM. CrossRefGoogle Scholar
  42. Zhang, H., Padmanabhan, B., & Tuzhilin, A. (2004). On the discovery of significant statistical quantitative rules. In Proceedings of the tenth international conference on knowledge discovery and data mining (KDD-2004) (pp. 374–383). New York: ACM. CrossRefGoogle Scholar
  43. Zheng, Z., Kohavi, R., & Mason, L. (2001). Real world performance of association rule algorithms. In Proceedings of the seventh international conference on knowledge discovery and data mining (KDD-2001) (pp. 401–406). New York: ACM. CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2007

Authors and Affiliations

  1. 1.Faculty of Information TechnologyMonash UniversityClaytonAustralia

Personalised recommendations