Abstract
Hypothesis testing using constrained null models can be used to compute the significance of data mining results given what is already known about the data. We study the novel problem of finding the smallest set of patterns that explains most about the data in terms of a global p value. The resulting set of patterns, such as frequent patterns or clusterings, is the smallest set that statistically explains the data. We show that the newly formulated problem is, in its general form, NP-hard and there exists no efficient algorithm with finite approximation ratio. However, we show that in a special case a solution can be computed efficiently with a provable approximation ratio. We find that a greedy algorithm gives good results on real data and that, using our approach, we can formulate and solve many known data-mining tasks. We demonstrate our method on several data mining tasks. We conclude that our framework is able to identify in various settings a small set of patterns that statistically explains the data and to formulate data mining problems in the terms of statistical significance.
Similar content being viewed by others
References
Bayardo RJ Jr (1998) Efficiently mining long patterns from databases. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 85–93
Bellman R (1961) On the approximation of curves by line segments using dynamic programming. Communications of the ACM 4(6): 284
Brin S, Motwani R, Ullman JD, Tsur S (1997) Dynamic itemset counting and implication rules for market basket data. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 255–264
Bringmann B, Zimmermann A (2009) One in a million: picking the right patterns. Knowl Inform Syst 18(1):61–81
Bullmore E, Long C, Suckling J, Fadili J, Calvert G, Zelaya F, Carpenter A, Brammer M (2001) Colored noise and computational inference in neurophysiological (FMRI) time series analysis: resampling methods in time and wavelet domains. Human Brain Mapp 12:61–78
Calders T, Goethals B (2007) Non-derivable itemset mining. Data Min Knowl Discov 14(1):171–206
De Bie T (2011a) An information theoretic framework for data mining. In: Proceedings of the 17th ACM SIGKDD conference on knowledge discovery and data mining (KDD)
De Bie T (2011b) Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Discov 23(3):407–446
Fortelius M (2005) New and old worlds database of fossil mammals (NOW). University of Helsinki. http://www.helsinki.fi/science/now/. Accessed 13 Sep 2012
Gallo A, De Bie T, Cristianini N (2007) MINI: mining informative non-redundant itemsets. In: Proceedings of the European conference on principles and practice of knowledge discovery in databases (PKDD), pp 438–445
Gionis A, Mannila H, Mielikäinen T, Tsaparas P (2007) Assessing data mining results via swap randomization. ACM Trans Knowl Discov Data 1(3):14
Good P (2000) Permutation tests: a practical guide to resampling methods for testing hypotheses. Springer, Berlin
Hanhijärvi S, Garriga GC, Puolamäki K (2009a) Randomization techniques for graphs. In: Proceedings of the SIAM international conference on data mining (SDM), pp 780–791
Hanhijärvi S, Ojala M, Vuokko N, Puolamäki K, Tatti N, Mannila H (2009b) Tell me something i don’t know: randomization strategies for iterative data mining. In: ACM SIGKDD international conference on knowledge discovery and data mining, pp 379–388
Knobbe AJ, Ho EKY (2006) Pattern teams. In: Proceedings of the European conference on principles and practice of knowledge discovery in databases (PKDD), pp 577–584
Lijffijt J, Papapetrou P, Vuokko N, Puolamäki K (2010) The smallest set of constraints that explains the data: a randomization approach. Technical Report TKK-ICS-R31, Aalto University School of Science and Technology, Department of Information and Computer Science
Mielikäinen T, Mannila H (2003) The pattern ordering problem. In: Proceedings of the European conference on principles and practice of knowledge discovery in databases (PKDD), pp 327–338
North BV, Curtis D, Sham PC (2002) A note on the calculation of empirical p-values from Monte Carlo procedures. Am J Hum Genet 71(2):439–441
Ojala M, Vuokko N, Kallio A, Haiminen N, Mannila H (2009) Randomization methods for assessing data analysis results on real-valued matrices. Stat Anal Data Min 2(4):209–230
Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Efficient mining of association rules using closed itemset lattices. Inform Syst 24:25–46
Puolamäki K, Fortelius M, Mannila H (2006) Seriation in paleontological data using markov chain monte carlo methods. PLoS Comput Biol 2(2):e6
Schreiber T, Schmitz A (1999) Surrogate time series. Phys D 142:346–382
Vreeken J, van Leeuwen M, Siebes APJM (2011) Krimp: mining itemsets that compress. Data min Knowl Discov 23(1): 169–214
Vuokko N, Kaski P (2011) Significance of patterns in time series collections. In: Proceedings of the SIAM international conference on data mining (SDM), pp 676–686
Webb GI (2007) Discovering significant patterns. Mach Learn 68(1):1–33
Westfall PH, Young SS (1993) Resampling-based multiple testing: examples and methods for p-value adjustment. Wiley, New York
Ying X, Wu X (2009) Graph generation with prescribed feature constraints. In: Proceedings of the SIAM conference on data mining (SDM), pp 966–977
Zaman A, Simberloff D (2002) Random binary matrices in biogeographical ecology-instituting a good neighbor policy. Environ Ecol Stat 9(4):405–421
Acknowledgments
We thank Mikael Fortelius and Jussi Eronen for useful discussions and the anonymous reviewers for their helpful feedback. This work was supported by the Finnish Centre of Excellence for Algorithmic Data Analysis Research (ALGODAN) and the Academy of Finland (Project 129282)
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Bart Goethals.
Rights and permissions
About this article
Cite this article
Lijffijt, J., Papapetrou, P. & Puolamäki, K. A statistical significance testing approach to mining the most informative set of patterns. Data Min Knowl Disc 28, 238–263 (2014). https://doi.org/10.1007/s10618-012-0298-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-012-0298-2