Data Mining and Knowledge Discovery

, Volume 28, Issue 1, pp 238–263

A statistical significance testing approach to mining the most informative set of patterns

  • Jefrey Lijffijt
  • Panagiotis Papapetrou
  • Kai Puolamäki
Article

Abstract

Hypothesis testing using constrained null models can be used to compute the significance of data mining results given what is already known about the data. We study the novel problem of finding the smallest set of patterns that explains most about the data in terms of a global p value. The resulting set of patterns, such as frequent patterns or clusterings, is the smallest set that statistically explains the data. We show that the newly formulated problem is, in its general form, NP-hard and there exists no efficient algorithm with finite approximation ratio. However, we show that in a special case a solution can be computed efficiently with a provable approximation ratio. We find that a greedy algorithm gives good results on real data and that, using our approach, we can formulate and solve many known data-mining tasks. We demonstrate our method on several data mining tasks. We conclude that our framework is able to identify in various settings a small set of patterns that statistically explains the data and to formulate data mining problems in the terms of statistical significance.

Keywords

Data mining algorithms Pattern mining Statistical significance testing 

Copyright information

© The Author(s) 2012

Authors and Affiliations

  • Jefrey Lijffijt
    • 1
  • Panagiotis Papapetrou
    • 1
    • 2
  • Kai Puolamäki
    • 1
    • 3
  1. 1.Department of Information and Computer ScienceAalto UniversityAaltoFinland
  2. 2.Department of Computer Science and Information Systems, BirkbeckUniversity of London Malet streetLondonUK
  3. 3.Finnish Institute of Occupational HealthTopeliuksenkatuHelsinkiFinland

Personalised recommendations