Discovering Significant Patterns

An Erratum to this article is available

Abstract

Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some user-specified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type-1 error, that is, of finding patterns that appear due to chance alone to satisfy the constraints on the sample data. This paper proposes techniques to overcome this problem by applying well-established statistical practices. These allow the user to enforce a strict upper limit on the risk of experimentwise error. Empirical studies demonstrate that standard pattern discovery techniques can discover numerous spurious patterns when applied to random data and when applied to real-world data result in large numbers of patterns that are rejected when subjected to sound statistical evaluation. They also reveal that a number of pragmatic choices about how such tests are performed can greatly affect their power.

References

  1. Agrawal, R., & Srikant, R. (1995). Mining sequential patterns. In Eleventh international conference on data engineering (pp. 3–14). Taipei, Taiwan.

  2. Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining associations between sets of items in massive databases. In Proceedings of the 1993 ACM-SIGMOD international conference on management of data (pp. 207–216). Washington, DC.

  3. Agresti, A. (1992). A survey of exact inference for contingency tables. Statistical Science, 7(1), 131–153.

    MATH  Article  MathSciNet  Google Scholar 

  4. Aumann, Y., & Lindell, Y. (1999). A statistical theory for quantitative association rules. In Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining (KDD-99) (pp. 261–270).

  5. Bastide, Y., Pasquier, N., Taouil, R., Stumme, G., & Lakhal, L. (2000). Mining minimal non-redundant association rules using frequent closed itemsets. In First international conference on computational logic—CL 2000 (pp. 972–986). Berlin: Springer.

    Google Scholar 

  6. Bay, S. D., & Pazzani, M. J. (2001). Detecting group differences: Mining contrast sets. Data Mining and Knowledge Discovery, 5(3), 213–246.

    MATH  Article  Google Scholar 

  7. Bayardo, R. J., Jr., Agrawal, R., & Gunopulos, D. (2000). Constraint-based rule mining in large, dense databases. Data Mining and Knowledge Discovery, 4(2/3), 217–240.

    Article  Google Scholar 

  8. Benjamini, Y., & Hochberg, Y. (1995) Controlling the false discovery rate: A new and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B, 57, 289–300.

    MATH  MathSciNet  Google Scholar 

  9. Benjamini, Y., & Yekutieli, D. (2001) The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics, 29(4), 1165–1188.

    MATH  Article  MathSciNet  Google Scholar 

  10. Brijs, T., Swinnen, G., Vanhoof, K., & Wets, G. (1999). Using association rules for product assortment decisions: A case study. In Knowledge discovery and data mining (pp. 254–260).

  11. Brin, S., Motwani, R. & Silverstein, C. (1997). Beyond market baskets: Generalizing association rules to correlations. In J. Peckham (Ed.), SIGMOD 1997, proceedings ACM SIGMOD international conference on management of data (pp. 265–276). New York: ACM.

    Google Scholar 

  12. Calders, T., & Goethals, B. (2002). Mining all non-derivable frequent itemsets. In Proceedings of the 6th European conference on principles and practice of knowledge discovery in databases, PKDD 2002 (pp. 74–85). Berlin: Springer.

    Google Scholar 

  13. Dong, G., & Li, J. (1999). Efficient mining of emerging patterns: Discovering trends and differences. In Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining (KDD-99) (pp. 15–18). New York: ACM.

    Google Scholar 

  14. DuMouchel, W., & Pregibon, D. (2001). Empirical Bayes screening for multi-item associations. In KDD-2001: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining (pp. 76–76). New York: ACM.

    Google Scholar 

  15. Hettich, S., & Bay, S. D. (2006). The UCI KDD archive. From http://kdd.ics.uci.edu. Irvine, CA: University of California, Department of Information and Computer Science.

  16. Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6, 65–70.

    MathSciNet  Google Scholar 

  17. International Business Machines. (1996). IBM intelligent miner user’s guide, version 1, release 1.

  18. Jaroszewicz, S., & Simovici, D. A. (2004). Interestingness of frequent itemsets using Bayesian networks as background knowledge. In R. Kohavi, J. Gehrke, & J. Ghosh (Eds.), KDD-2004: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 178–186). New York: ACM.

    Google Scholar 

  19. Jensen, D. D., & Cohen, P. R. (2000) Multiple comparisons in induction algorithms. Machine Learning 38(3), 309–338.

    MATH  Article  Google Scholar 

  20. Johnson, R., (1984). Elementary statistics. Boston: Duxbury.

    Google Scholar 

  21. Klösgen, W. (1996). Explora: A multipattern and multistrategy discovery assistant. In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, & R. Uthurusamy (Eds.), Advances in knowledge discovery and data mining (pp. 249–271). Menlo Park: AAAI.

    Google Scholar 

  22. Kuramochi, M., & Karypis, G. (2001). Frequent subgraph discovery. In Proceedings of the 2001 IEEE international conference on data mining (ICDM-01) (pp. 313–320).

  23. Liu, B., Hsu, W., & Ma, Y. (1999). Pruning and summarizing the discovered associations. In Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining (KDD-99) (pp. 125–134). New York: AAAI.

    Google Scholar 

  24. Megiddo, N., & Srikant, R. (1998). Discovering predictive association rules. In Proceedings of the fourth international conference on knowledge discovery and data mining (KDD-98) (pp. 27–78). Menlo Park: AAAI.

    Google Scholar 

  25. Michalski, R. S. (1983). A theory and methodology of inductive learning. In R. S. Michalski, J. G. Carbonell, & T. M. Mitchell (Eds.), Machine learning: An artificial intelligence approach (pp. 83–129). Berlin: Springer.

    Google Scholar 

  26. Newman, D. J., Hettich, S., Blake, C., & Merz, C. J. (2006). UCI repository of machine learning databases [Machine-readable data repository]. University of California, Department of Information and Computer Science, Irvine, CA.

  27. Piatetsky-Shapiro, G. (1991). Discovery, analysis, and presentation of strong rules. In G. Piatetsky-Shapiro, J. Frawley (Eds.), Knowledge discovery in databases (pp. 229–248). Menlo Park: AAAI/MIT Press.

    Google Scholar 

  28. Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Mateo: Kaufmann.

    Google Scholar 

  29. Quinlan, J. R., & Cameron-Jones, R. M. (1995). Oversearching and layered search in empirical learning. In IJCAI’95 (pp. 1019–1024). Los Altos: Kaufmann.

    Google Scholar 

  30. Scheffer, T. (1995). Finding association rules that trade support optimally against confidence. Intelligent Data Analysis, 9(4), 381–395.

    Google Scholar 

  31. Scheffer, T., & Wrobel, S. (2002). Finding the most interesting patterns in a database quickly by using sequential sampling. Journal of Machine Learning Research, 3, 833–862.

    Article  MathSciNet  Google Scholar 

  32. Shaffer, J. P. (1995). Multiple hypothesis testing. Annual Review of Psychology, 46, 561–584.

    Article  Google Scholar 

  33. Turney, P. D. (2000). Types of cost in inductive concept learning. In Workshop on cost-sensitive learning at the seventeenth international conference on machine learning (pp. 15–21). Stanford University, CA.

  34. Webb, G. I. (1995). OPUS: An efficient admissible algorithm for unordered search. Journal of Artificial Intelligence Research, 3, 431–465.

    MATH  MathSciNet  Google Scholar 

  35. Webb, G. I. (2001). Discovering associations with numeric variables. In Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining (KDD-2001) (pp. 383–388). New York: The Association for Computing Machinery.

    Google Scholar 

  36. Webb, G. I. (2002). Magnum Opus Version 1.3. Software, G.I. Webb & Associates, Melbourne, Australia.

  37. Webb, G. I. (2003). Preliminary investigations into statistically valid exploratory rule discovery. In Proceedings of the Australasian data mining workshop (AusDM03) (pp. 1–9). University of Technology, Sydney.

  38. Webb, G. I. (2005). Magnum Opus Version 3.0.1. Software, G.I. Webb & Associates, Melbourne, Australia.

  39. Webb, G. I. (2006). Discovering significant rules. In Proceedings of the twelfth ACM SIGKDD international conference on knowledge discovery and data mining, KDD-2006. (pp. 434–443). New York: ACM.

    Google Scholar 

  40. Webb, G. I., & Zhang, S. (2005). K-optimal rule discovery. Data Mining and Knowledge Discovery, 10(1), 39–79.

    Article  MathSciNet  Google Scholar 

  41. Zaki, M. J. (2000). Generating non-redundant association rules. In Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining (KDD-2000) (pp. 34–43). New York: ACM.

    Google Scholar 

  42. Zhang, H., Padmanabhan, B., & Tuzhilin, A. (2004). On the discovery of significant statistical quantitative rules. In Proceedings of the tenth international conference on knowledge discovery and data mining (KDD-2004) (pp. 374–383). New York: ACM.

    Google Scholar 

  43. Zheng, Z., Kohavi, R., & Mason, L. (2001). Real world performance of association rule algorithms. In Proceedings of the seventh international conference on knowledge discovery and data mining (KDD-2001) (pp. 401–406). New York: ACM.

    Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Geoffrey I. Webb.

Additional information

Editor: Johannes Fürnkranz.

An erratum to this article can be found at http://dx.doi.org/10.1007/s10994-008-5045-y

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Webb, G.I. Discovering Significant Patterns. Mach Learn 68, 1–33 (2007). https://doi.org/10.1007/s10994-007-5006-x

Download citation

Keywords

  • Pattern discovery
  • Statistical evaluation
  • Association rules