A statistical significance testing approach to mining the most informative set of patterns

Lijffijt, Jefrey; Papapetrou, Panagiotis; Puolamäki, Kai

doi:10.1007/s10618-012-0298-2

A statistical significance testing approach to mining the most informative set of patterns

Published: 19 December 2012

Volume 28, pages 238–263, (2014)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Jefrey Lijffijt¹,
Panagiotis Papapetrou^1,2 &
Kai Puolamäki^1,3

1989 Accesses
23 Citations
Explore all metrics

Abstract

Hypothesis testing using constrained null models can be used to compute the significance of data mining results given what is already known about the data. We study the novel problem of finding the smallest set of patterns that explains most about the data in terms of a global p value. The resulting set of patterns, such as frequent patterns or clusterings, is the smallest set that statistically explains the data. We show that the newly formulated problem is, in its general form, NP-hard and there exists no efficient algorithm with finite approximation ratio. However, we show that in a special case a solution can be computed efficiently with a provable approximation ratio. We find that a greedy algorithm gives good results on real data and that, using our approach, we can formulate and solve many known data-mining tasks. We demonstrate our method on several data mining tasks. We conclude that our framework is able to identify in various settings a small set of patterns that statistically explains the data and to formulate data mining problems in the terms of statistical significance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fast Estimation of the Pattern Frequency Spectrum

Computing Theoretically-Sound Upper Bounds to Expected Support for Frequent Pattern Mining Problems over Uncertain Big Data

A tutorial on statistically sound pattern discovery

Article Open access 20 December 2018

References

Bayardo RJ Jr (1998) Efficiently mining long patterns from databases. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 85–93
Bellman R (1961) On the approximation of curves by line segments using dynamic programming. Communications of the ACM 4(6): 284
Google Scholar
Brin S, Motwani R, Ullman JD, Tsur S (1997) Dynamic itemset counting and implication rules for market basket data. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 255–264
Bringmann B, Zimmermann A (2009) One in a million: picking the right patterns. Knowl Inform Syst 18(1):61–81
Article Google Scholar
Bullmore E, Long C, Suckling J, Fadili J, Calvert G, Zelaya F, Carpenter A, Brammer M (2001) Colored noise and computational inference in neurophysiological (FMRI) time series analysis: resampling methods in time and wavelet domains. Human Brain Mapp 12:61–78
Article Google Scholar
Calders T, Goethals B (2007) Non-derivable itemset mining. Data Min Knowl Discov 14(1):171–206
Article MathSciNet Google Scholar
De Bie T (2011a) An information theoretic framework for data mining. In: Proceedings of the 17th ACM SIGKDD conference on knowledge discovery and data mining (KDD)
De Bie T (2011b) Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Discov 23(3):407–446
Article MATH MathSciNet Google Scholar
Fortelius M (2005) New and old worlds database of fossil mammals (NOW). University of Helsinki. http://www.helsinki.fi/science/now/. Accessed 13 Sep 2012
Gallo A, De Bie T, Cristianini N (2007) MINI: mining informative non-redundant itemsets. In: Proceedings of the European conference on principles and practice of knowledge discovery in databases (PKDD), pp 438–445
Gionis A, Mannila H, Mielikäinen T, Tsaparas P (2007) Assessing data mining results via swap randomization. ACM Trans Knowl Discov Data 1(3):14
Article Google Scholar
Good P (2000) Permutation tests: a practical guide to resampling methods for testing hypotheses. Springer, Berlin
Book Google Scholar
Hanhijärvi S, Garriga GC, Puolamäki K (2009a) Randomization techniques for graphs. In: Proceedings of the SIAM international conference on data mining (SDM), pp 780–791
Hanhijärvi S, Ojala M, Vuokko N, Puolamäki K, Tatti N, Mannila H (2009b) Tell me something i don’t know: randomization strategies for iterative data mining. In: ACM SIGKDD international conference on knowledge discovery and data mining, pp 379–388
Knobbe AJ, Ho EKY (2006) Pattern teams. In: Proceedings of the European conference on principles and practice of knowledge discovery in databases (PKDD), pp 577–584
Lijffijt J, Papapetrou P, Vuokko N, Puolamäki K (2010) The smallest set of constraints that explains the data: a randomization approach. Technical Report TKK-ICS-R31, Aalto University School of Science and Technology, Department of Information and Computer Science
Mielikäinen T, Mannila H (2003) The pattern ordering problem. In: Proceedings of the European conference on principles and practice of knowledge discovery in databases (PKDD), pp 327–338
North BV, Curtis D, Sham PC (2002) A note on the calculation of empirical p-values from Monte Carlo procedures. Am J Hum Genet 71(2):439–441
Article Google Scholar
Ojala M, Vuokko N, Kallio A, Haiminen N, Mannila H (2009) Randomization methods for assessing data analysis results on real-valued matrices. Stat Anal Data Min 2(4):209–230
Article MathSciNet Google Scholar
Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Efficient mining of association rules using closed itemset lattices. Inform Syst 24:25–46
Article Google Scholar
Puolamäki K, Fortelius M, Mannila H (2006) Seriation in paleontological data using markov chain monte carlo methods. PLoS Comput Biol 2(2):e6
Article Google Scholar
Schreiber T, Schmitz A (1999) Surrogate time series. Phys D 142:346–382
Article MathSciNet Google Scholar
Vreeken J, van Leeuwen M, Siebes APJM (2011) Krimp: mining itemsets that compress. Data min Knowl Discov 23(1): 169–214
Vuokko N, Kaski P (2011) Significance of patterns in time series collections. In: Proceedings of the SIAM international conference on data mining (SDM), pp 676–686
Webb GI (2007) Discovering significant patterns. Mach Learn 68(1):1–33
Article Google Scholar
Westfall PH, Young SS (1993) Resampling-based multiple testing: examples and methods for p-value adjustment. Wiley, New York
Google Scholar
Ying X, Wu X (2009) Graph generation with prescribed feature constraints. In: Proceedings of the SIAM conference on data mining (SDM), pp 966–977
Zaman A, Simberloff D (2002) Random binary matrices in biogeographical ecology-instituting a good neighbor policy. Environ Ecol Stat 9(4):405–421
Article MathSciNet Google Scholar

Download references

Acknowledgments

We thank Mikael Fortelius and Jussi Eronen for useful discussions and the anonymous reviewers for their helpful feedback. This work was supported by the Finnish Centre of Excellence for Algorithmic Data Analysis Research (ALGODAN) and the Academy of Finland (Project 129282)

Author information

Authors and Affiliations

Department of Information and Computer Science, Aalto University, P.O. Box 15400, 00076 , Aalto, Finland
Jefrey Lijffijt, Panagiotis Papapetrou & Kai Puolamäki
Department of Computer Science and Information Systems, Birkbeck, University of London Malet street, London, WCIE 7HX, UK
Panagiotis Papapetrou
Finnish Institute of Occupational Health, Topeliuksenkatu, 41 a A, FI-00025 , Helsinki, Finland
Kai Puolamäki

Authors

Jefrey Lijffijt
View author publications
You can also search for this author in PubMed Google Scholar
Panagiotis Papapetrou
View author publications
You can also search for this author in PubMed Google Scholar
Kai Puolamäki
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jefrey Lijffijt.

Additional information

Communicated by Bart Goethals.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lijffijt, J., Papapetrou, P. & Puolamäki, K. A statistical significance testing approach to mining the most informative set of patterns. Data Min Knowl Disc 28, 238–263 (2014). https://doi.org/10.1007/s10618-012-0298-2

Download citation

Received: 29 December 2011
Accepted: 21 November 2012
Published: 19 December 2012
Issue Date: January 2014
DOI: https://doi.org/10.1007/s10618-012-0298-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A statistical significance testing approach to mining the most informative set of patterns

Abstract

Access this article

Similar content being viewed by others

Fast Estimation of the Pattern Frequency Spectrum

Computing Theoretically-Sound Upper Bounds to Expected Support for Frequent Pattern Mining Problems over Uncertain Big Data

A tutorial on statistically sound pattern discovery

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A statistical significance testing approach to mining the most informative set of patterns

Abstract

Access this article

Similar content being viewed by others

Fast Estimation of the Pattern Frequency Spectrum

Computing Theoretically-Sound Upper Bounds to Expected Support for Frequent Pattern Mining Problems over Uncertain Big Data

A tutorial on statistically sound pattern discovery

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation