Multiple Hypothesis Testing in Pattern Discovery

  • Sami Hanhijärvi
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6926)

Abstract

The problem of multiple hypothesis testing arises when there are more than one hypothesis to be tested simultaneously for statistical significance. This is a very common situation in many data mining applications. For instance, assessing simultaneously the significance of all frequent itemsets of a single dataset entails a host of hypothesis, one for each itemset. A multiple hypothesis testing method is needed to control the number of false positives (Type I error). Our contribution in this paper is to extend the multiple hypothesis framework to be used in a generic data mining setting. We provide a method that provably controls the family-wise error rate (FWER, the probability of at least one false positive). We show the power of our solution on real data.

Keywords

multiple hypothesis testing randomization significance test pattern mining 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bay, S.D., Pazzani, M.J.: Detecting group differences: Mining contrast sets. Data Mining and Knowledge Discovery 5(3), 213–246 (2001)CrossRefMATHGoogle Scholar
  2. 2.
    Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological) 57(1), 289–300 (1995)MathSciNetMATHGoogle Scholar
  3. 3.
    Dudoit, S., Shaffer, J.P., Boldrick, J.C.: Multiple hypothesis testing in microarray experiments. Statistical Science 18(1), 71–103 (2003)MathSciNetCrossRefMATHGoogle Scholar
  4. 4.
    Gionis, A., Mannila, H., Mielikäinen, T., Tsaparas, P.: Assessing data mining results via swap randomization. ACM Transactions on Knowledge Discovery from Data 1(3) (2007)Google Scholar
  5. 5.
    Hanhijärvi, S., Garriga, G.C., Puolamäki, K.: Randomization techniques for graphs. In: Proceedings of the Ninth SIAM International Conference on Data Mining, SDM 2009 (2009)Google Scholar
  6. 6.
    Hanhijärvi, S., Ojala, M., Vuokko, N., Puolamäki, K., Tatti, N., Mannila, H.: Tell me something i don’t know: randomization strategies for iterative data mining. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2009, pp. 379–388. ACM, New York (2009)Google Scholar
  7. 7.
    Kuramochi, M., Karypis, G.: An efficient algorithm for discovering frequent subgraphs. IEEE Transactions on Knowledge and Data Engineering 16(9), 1038–1051 (2004)CrossRefGoogle Scholar
  8. 8.
    Lallich, S., Teytaud, O., Prudhomme, E.: Association rule interestingness: measure and statistical validation. Quality Measures in Data Mining, 251–275 (2006)Google Scholar
  9. 9.
    Lallich, S., Teytaud, O., Prudhomme, E.: Statistical inference and data mining: false discoveries control. In: 17th COMPSTAT Symposium of the IASC, La Sapienza, Rome, pp. 325–336 (2006)Google Scholar
  10. 10.
    Megiddo, N., Srikant, R.: Discovering predictive association rules. In: Knowledge Discovery and Data Mining, pp. 274–278 (1998)Google Scholar
  11. 11.
    North, B.V., Curtis, D., Sham, P.C.: A note on the calculation of empirical P values from Monte Carlo procedures. The American Journal of Human Genetics 71(2), 439–441 (2002)CrossRefGoogle Scholar
  12. 12.
    Ojala, M., Vuokko, N., Kallio, A., Haiminen, N., Mannila, H.: Assessing data analysis results on real-valued matrices. Statistical Analysis and Data Mining 2, 209–230 (2009)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Webb, G.: Discovering significant patterns. Machine Learning 68, 1–33 (2007)CrossRefGoogle Scholar
  14. 14.
    Webb, G.: Layered critical values: a powerful direct-adjustment approach to discovering significant patterns. Machine Learning 71, 307–323 (2008)CrossRefGoogle Scholar
  15. 15.
    Webb, G.I.: Discovering significant rules. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2006, pp. 434–443. ACM, New York (2006)Google Scholar
  16. 16.
    Westfall, P.H., Young, S.S.: Resampling-based multiple testing: examples and methods for p-value adjustment. Wiley, Chichester (1993)MATHGoogle Scholar
  17. 17.
    Ying, X., Wu, X.: Graph generation with predescribed feature constraints. In: Proceedings of the Ninth SIAM International Conference on Data Mining, SDM 2009 (2009)Google Scholar
  18. 18.
    Zhang, H., Padmanabhan, B., Tuzhilin, A.: On the discovery of significant statistical quantitative rules. In: KDD 2004: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 374–383. ACM, New York (2004)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Sami Hanhijärvi
    • 1
  1. 1.Department of Information and Computer ScienceAalto UniversityFinland

Personalised recommendations