Data Mining and Knowledge Discovery

, Volume 28, Issue 1, pp 238–263 | Cite as

A statistical significance testing approach to mining the most informative set of patterns

  • Jefrey Lijffijt
  • Panagiotis Papapetrou
  • Kai Puolamäki
Article

Abstract

Hypothesis testing using constrained null models can be used to compute the significance of data mining results given what is already known about the data. We study the novel problem of finding the smallest set of patterns that explains most about the data in terms of a global p value. The resulting set of patterns, such as frequent patterns or clusterings, is the smallest set that statistically explains the data. We show that the newly formulated problem is, in its general form, NP-hard and there exists no efficient algorithm with finite approximation ratio. However, we show that in a special case a solution can be computed efficiently with a provable approximation ratio. We find that a greedy algorithm gives good results on real data and that, using our approach, we can formulate and solve many known data-mining tasks. We demonstrate our method on several data mining tasks. We conclude that our framework is able to identify in various settings a small set of patterns that statistically explains the data and to formulate data mining problems in the terms of statistical significance.

Keywords

Data mining algorithms Pattern mining Statistical significance testing 

Notes

Acknowledgments

We thank Mikael Fortelius and Jussi Eronen for useful discussions and the anonymous reviewers for their helpful feedback. This work was supported by the Finnish Centre of Excellence for Algorithmic Data Analysis Research (ALGODAN) and the Academy of Finland (Project 129282)

References

  1. Bayardo RJ Jr (1998) Efficiently mining long patterns from databases. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 85–93Google Scholar
  2. Bellman R (1961) On the approximation of curves by line segments using dynamic programming. Communications of the ACM 4(6): 284Google Scholar
  3. Brin S, Motwani R, Ullman JD, Tsur S (1997) Dynamic itemset counting and implication rules for market basket data. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 255–264Google Scholar
  4. Bringmann B, Zimmermann A (2009) One in a million: picking the right patterns. Knowl Inform Syst 18(1):61–81CrossRefGoogle Scholar
  5. Bullmore E, Long C, Suckling J, Fadili J, Calvert G, Zelaya F, Carpenter A, Brammer M (2001) Colored noise and computational inference in neurophysiological (FMRI) time series analysis: resampling methods in time and wavelet domains. Human Brain Mapp 12:61–78CrossRefGoogle Scholar
  6. Calders T, Goethals B (2007) Non-derivable itemset mining. Data Min Knowl Discov 14(1):171–206CrossRefMathSciNetGoogle Scholar
  7. De Bie T (2011a) An information theoretic framework for data mining. In: Proceedings of the 17th ACM SIGKDD conference on knowledge discovery and data mining (KDD)Google Scholar
  8. De Bie T (2011b) Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Discov 23(3):407–446CrossRefMATHMathSciNetGoogle Scholar
  9. Fortelius M (2005) New and old worlds database of fossil mammals (NOW). University of Helsinki. http://www.helsinki.fi/science/now/. Accessed 13 Sep 2012
  10. Gallo A, De Bie T, Cristianini N (2007) MINI: mining informative non-redundant itemsets. In: Proceedings of the European conference on principles and practice of knowledge discovery in databases (PKDD), pp 438–445Google Scholar
  11. Gionis A, Mannila H, Mielikäinen T, Tsaparas P (2007) Assessing data mining results via swap randomization. ACM Trans Knowl Discov Data 1(3):14CrossRefGoogle Scholar
  12. Good P (2000) Permutation tests: a practical guide to resampling methods for testing hypotheses. Springer, BerlinCrossRefGoogle Scholar
  13. Hanhijärvi S, Garriga GC, Puolamäki K (2009a) Randomization techniques for graphs. In: Proceedings of the SIAM international conference on data mining (SDM), pp 780–791Google Scholar
  14. Hanhijärvi S, Ojala M, Vuokko N, Puolamäki K, Tatti N, Mannila H (2009b) Tell me something i don’t know: randomization strategies for iterative data mining. In: ACM SIGKDD international conference on knowledge discovery and data mining, pp 379–388Google Scholar
  15. Knobbe AJ, Ho EKY (2006) Pattern teams. In: Proceedings of the European conference on principles and practice of knowledge discovery in databases (PKDD), pp 577–584Google Scholar
  16. Lijffijt J, Papapetrou P, Vuokko N, Puolamäki K (2010) The smallest set of constraints that explains the data: a randomization approach. Technical Report TKK-ICS-R31, Aalto University School of Science and Technology, Department of Information and Computer ScienceGoogle Scholar
  17. Mielikäinen T, Mannila H (2003) The pattern ordering problem. In: Proceedings of the European conference on principles and practice of knowledge discovery in databases (PKDD), pp 327–338Google Scholar
  18. North BV, Curtis D, Sham PC (2002) A note on the calculation of empirical p-values from Monte Carlo procedures. Am J Hum Genet 71(2):439–441CrossRefGoogle Scholar
  19. Ojala M, Vuokko N, Kallio A, Haiminen N, Mannila H (2009) Randomization methods for assessing data analysis results on real-valued matrices. Stat Anal Data Min 2(4):209–230CrossRefMathSciNetGoogle Scholar
  20. Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Efficient mining of association rules using closed itemset lattices. Inform Syst 24:25–46CrossRefGoogle Scholar
  21. Puolamäki K, Fortelius M, Mannila H (2006) Seriation in paleontological data using markov chain monte carlo methods. PLoS Comput Biol 2(2):e6CrossRefGoogle Scholar
  22. Schreiber T, Schmitz A (1999) Surrogate time series. Phys D 142:346–382CrossRefMathSciNetGoogle Scholar
  23. Vreeken J, van Leeuwen M, Siebes APJM (2011) Krimp: mining itemsets that compress. Data min Knowl Discov 23(1): 169–214Google Scholar
  24. Vuokko N, Kaski P (2011) Significance of patterns in time series collections. In: Proceedings of the SIAM international conference on data mining (SDM), pp 676–686Google Scholar
  25. Webb GI (2007) Discovering significant patterns. Mach Learn 68(1):1–33CrossRefGoogle Scholar
  26. Westfall PH, Young SS (1993) Resampling-based multiple testing: examples and methods for p-value adjustment. Wiley, New YorkGoogle Scholar
  27. Ying X, Wu X (2009) Graph generation with prescribed feature constraints. In: Proceedings of the SIAM conference on data mining (SDM), pp 966–977Google Scholar
  28. Zaman A, Simberloff D (2002) Random binary matrices in biogeographical ecology-instituting a good neighbor policy. Environ Ecol Stat 9(4):405–421CrossRefMathSciNetGoogle Scholar

Copyright information

© The Author(s) 2012

Authors and Affiliations

  • Jefrey Lijffijt
    • 1
  • Panagiotis Papapetrou
    • 1
    • 2
  • Kai Puolamäki
    • 1
    • 3
  1. 1.Department of Information and Computer ScienceAalto UniversityAaltoFinland
  2. 2.Department of Computer Science and Information Systems, BirkbeckUniversity of London Malet streetLondonUK
  3. 3.Finnish Institute of Occupational HealthTopeliuksenkatuHelsinkiFinland

Personalised recommendations