A Tutorial on Statistically Sound Pattern Discovery

Statistically sound pattern discovery harnesses the rigour of statistical hypothesis testing to overcome many of the issues that have hampered standard data mining approaches to pattern discovery. Most importantly, application of appropriate statistical tests allows precise control over the risk of false discoveries -- patterns that are found in the sample data but do not hold in the wider population from which the sample was drawn. Statistical tests can also be applied to filter out patterns that are unlikely to be useful, removing uninformative variations of the key patterns in the data. This tutorial introduces the key statistical and data mining theory and techniques that underpin this fast developing field. We concentrate on two general classes of patterns: dependency rules that express statistical dependencies between condition and consequent parts and dependency sets that express mutual dependence between set elements. We clarify alternative interpretations of statistical dependence and introduce appropriate tests for evaluating statistical significance of patterns in different situations. We also introduce special techniques for controlling the likelihood of spurious discoveries when multitudes of patterns are evaluated. The paper is aimed at a wide variety of audiences. It provides the necessary statistical background and summary of the state-of-the-art for any data mining researcher or practitioner wishing to enter or understand statistically sound pattern discovery research or practice. It can serve as a general introduction to the field of statistically sound pattern discovery for any reader with a general background in data sciences.


Introduction
Statistically sound pattern discovery is an emerging trend in data mining with strong emphasis on statistical validity. Traditional pattern discovery can be defined as a process of exploring a space of possible patterns to determine which are present in a set of reference data (Agrawal et al, 1993;Cooley et al, 1997;Rigoutsos and Floratos, 1998;Kim et al, 2010). The emphasis has been on efficient search algorithms and computationally well-behaving pattern types, and little attention has been paid to the statistical validity of patterns. Conversely, in statistically sound pattern discovery, the first priority is to find genuine patterns that are likely to reflect properties of the underlying population and hold also in future data. Often the pattern types are also different, because they have been dictated by the needs of application fields rather than computational properties.
The problem of statistically sound pattern discovery is illustrated in Fig. 1. Usually, the analyst has a sample of data drawn from some population of interest. This sample is typically only a very small proportion of the total population of interest and may contain noise. The pattern discovery tool is applied to this sample, finding some set of patterns. It is unrealistic to expect this set of discovered patterns to directly match the ideal patterns that would be found by direct analysis of the real population rather than a sample thereof. Indeed, it is clear that in at least some cases, the application of naive techniques results in the majority of patterns found being only spurious artifacts. An extreme example of this problem arises with the popular minimum support and minimum confidence technique (Agrawal et al, 1993) when applied to the well-known Covtype benchmark dataset from the UCI repository (Lichman, 2013). The minimum support and minimum confidence technique seeks to find the frequent positive associations in data. For the Covtype dataset, the top 197,183,686 rules found by minimum support and minimum confidence are in fact negative associations (Webb, 2006). This gives rise to the suggestion that the oft cited problem of pattern discovery finding unmanageably large numbers of patterns is largely due to standard techniques returning results that are dominated by spurious patterns (Webb, 2011).
There is a rapidly growing body of pattern discovery techniques being developed in the data science community that utilize statistics to control the risk of such spurious discoveries. In this tutorial paper, that grew out of tutorials presented at ECMLPKDD-2014 and KDD-14, we introduce the relevant statistical theory and key techniques for statistically sound pattern discovery. We concentrate on pattern types that express statistical dependencies between categorical attributes, such as dependency rules (dependencies between condition and consequent parts) and dependency sets (mutual dependencies between set elements). However, the same principles of statistical significance testing apply to other pattern types, as well. In addition, there are many applications that involve statistical dependencies between other pattern types, like dependencies between frequent episodes or between subgraphs and classes.
To keep the scope manageable, we do not describe actual search algorithms but merely the statistical techniques that are employed during the search. However, we provide several new results that help to detect spurious and superfluous patterns and facilitate efficient search.
We aim at a generic presentation that is not bound to any specific search method, pattern type or school of statistics. Instead, we try to clarify alternative interpretations of statistical dependence and the underlying assumptions on the origins of data, because they often lead to different statistical methods.
In the paper, we have tried to use terminology that is consistent with statistics to make it easy to consult external sources. For this reason, we have avoided some special terms originated in pattern discovery that have another meaning in statistics or may become easily confused in this context (see Appendix B).
The rest of the paper is organized as follows. In Section 2, we give definitions of various types of statistical dependence and introduce the main principles and approaches of statistical significance testing. In Section 3, we investigate how statistical significance of dependency rules is evaluated under different assumptions. Especially, we contrast two alternative interpretations of dependency rules that are called variable-based and value-based interpretations and introduce appropriate tests in different sampling models. In Section 4, we discuss how to evaluate statistical significance of the improvement of one rule over another one. In Section 5, we survey the key techniques that have been developed for finding different types of statistically significant dependency sets. In Section 6, we discuss the problem of multiple hypothesis testing. We describe the main principles and popular correction methods and then introduce some special techniques for increasing power in the pattern discovery context. Finally, in Section 7, we summarize the main points and present conclusions. The mathematical notations used in this paper are given in Table 1.
natural numbers (including 0) N X , N A , N XA , N random variables in N A, B,C,. . . variables Dom(A) domain of variable A a 1 , a 2 , a 3 , . . . ∈ Dom(A) values of variable A A 1 =a 1 , A 2 =a 2 value assignment, where A 1 =a 1 and A 2 =a 2 X, Y, Z, Q variable sets (vector-valued variables) |X| = m number of variables in set X (its cardinality) x = (a 1 , ..., a m ) vector of variable values (X=x) = ((A 1 =a 1 ), . . . , (A l =a m )) value assignment of set X={A 1 , ..., A m }, also interpreted as a conjunction of assignments A i =a i I X=x an indicator variable for X=x; I X=x =1, if X=x, and I X=x =0 otherwise A short hand notations for A=1 and ¬A A=0, when A is binary X = ((A=1), . . . , (A m =1)) short hand notation for a conjunction of positive-valued asignments ¬X = ((A 1 =0) ∨ . . . ∨ (A m =0)) and its complement when X is a set of binary variables r = ((A 1 =a 1 ), . . . , (A k =a k )) a row, record or tuple of data D = (r 1 , . . . , r n ) data set |D| = n number of elements in D P(X=x) probability of event X=x fr (X=x) absolute frequency of event X=x; number of data elements where X=x φ(X=x → A=a) = P(A=a|X=x) precision of rule X=x → A=a γ(X=x, A=a) = P(X=x,A=a) P(X=x)P(A=a) lift of rule X=x → A=a δ(X=x, A=a) leverage of rule X=x → A=a = P(X=x, A=a) − P(X=x)P(A=a) ∆, Γ random variables for the leverage and lift H 0 , H A binary variables for the null hypothesis and an alternative hypothesis {H 1 , . . . , H m } a set of null hypotheses p probability value or a p-value p i observed p-value of H i P i random variable for the p-value of hypothesis H i p F p-value defined by Fisher's exact test p A , p X , p XA , p A|X parameter values of discrete probability distributions M statistical model, a 'sampling scheme' T test statistic t i observed value of the test statistic of H i T i random variable for the value of the test statistic of H i L(·) likelihood function MI(·) mutual information z z-score

Preliminaries
In this section, we give definitions of various types of statistical dependence and introduce the main principles and approaches of statistical significance testing.

Statistical dependence
The notion of statistical dependence is equivocal and even the simplest case, dependence between two events, is subject to alternative interpretations. Interpretations of statistical dependence between more than two events or variables are even more various. In the following, we consider the problem of defining and measuring statistical dependence and introduce the main types of independence and corresponding dependence patterns.

Dependence between two events
Definitions of statistical dependence are usually based on the classical notion of statistical independence between two events. Since events express states of affairs, we express them as variable-value combinations, A=a and B=b.
Definition 1 (Statistical independence between two events) Let A=a and B=b be two events, P(A=a) and P(B=b) their marginal probabilities, and P(A=a, B=b) their joint probability. Events (A=a) and (B=b) are statistically independent, if P(A=a, B=b) = P(A=a)P(B=b). (1) Statistical dependence is seldom defined formally, but in practice, there are two approaches. If dependence is considered as a Boolean property, then any departure from complete independence (Equation (1)) is defined as dependence. Another approach, prevalent in statistical data analysis, is to consider dependence as a continuous property ranging from complete independence to complete dependence. Complete dependence itself is an ambiguous term, but usually it refers to equivalence of events P(A=a, B=b) = P(A=a) = P(B=b) (perfect positive dependence) or mutual exclusion of events P(A=a, B b) = P(A=a) = P(B b) (perfect negative dependence).
The strength of dependence between two events can be evaluated with several alternative measures. In pattern discovery, two of the most popular measures are leverage and lift.
We note that this is the same as covariance between binary variables A and B.
Lift has also been called 'interest' (Brin et al, 1997), 'dependence' (Wu et al, 2004), and 'degree of independence' (Yao and Zhong, 1999)). It measures the ratio of the joint probability and its expectation under independence: γ(A=a, B=b) = P(A=a, B=b) P(A=a)P(B=b) .
If the real probabilities of events were known, the strength of dependence could be determined accurately. However, in practice, the probabilities are estimated from the data. The most common method is to approximate the real probabilities with relative frequencies (maximum likelihood estimates) but other estimation methods are also possible. The accuracy of these estimates depends on how representative and error-free the data is. The size of the data affects also precision, because continuous probabilities are approximated with discrete frequencies. Therefore, it is quite possible that two independent events express some degree of dependence in the data (i.e.,P(A=a, B=b) P (A=a)P(B=b), whereP is the estimated probability, even if P(A=a, B=b) = P(A=a)P(B=b) in the population). In the worst case, two events always co-occur in the data, indicating maximal dependence, even if they are actually independent. To some extent, the probability of such false discoveries can be controlled by statistical significance testing, which is discussed in Subsection 2.2. In the other extreme, two dependent events may appear independent in the data (i.e., P(A=a, B=b) =P(A=a)P(B=b)). However, this is not possible if the actual dependence is sufficiently strong (i.e., P(A=a, B=b) = P(A=a) or P(A=a, B=b) = P(B=b)), assuming that the data is error-free. Such missed discoveries are harder to detect, but to some extent, the problem can be alleviated by using powerful methods in significance testing (Subsection 2.2).

Dependence between two variables
For each variable, we can define several events, which describe its values. If the variable is categorical, it is natural to consider each variable-value combination as a possible event. Then, the independence between two categorical variables can be defined as follows: Definition 2 (Statistical independence between two variables) Let A and B be two categorical variables, whose domains are Dom(A) and Dom(B). A and B are statistically independent, if for all a ∈ Dom(A) and b ∈ Dom(B) P(A=a, B=b) = P(A=a)P(B=b).
Once again, dependence can be defined either as a Boolean property (lack of independence) or a continuous property. However, there is no standard way to measure the strength of dependence between variables. In practice, the measure is selected according to data and modelling purposes. Two commonly used measures are the χ 2 -measure and mutual information MI(A, B) = a∈Dom(A) b∈Dom(B) P(A=a, B=b)log P(A=a, B=b) P(A=a)P(B=b) . (3) If the variables are binary, the notions of independence between variables and the corresponding events coincide. Now independence between any of the four value combinations AB, A¬B,¬AB, ¬A¬B means independence between variables A and B and vice versa. In addition, the absolute value of leverage is the same for all value combinations and can be used to measure the strength of dependence between binary variables. This is shown in the corresponding contingency table (Table 2). Unfortunately, this handy property does not hold for multivalued variables. Table 3 shows an example contingency table for two three-valued variables where some value combinations are independent and others dependent.  Table 3 An example contingency table where some value combinations of A and B express independence and others dependence. The frequencies are expressed using leverage, δ = δ(A=a 1 , B=b 2 ).

Dependence between many events or variables
The notion of statistical independence can be generalized to three or more events or variables in several ways. The most common types of independence are mutual independence, bipartition independence, and conditional independence (see e.g., (Agresti, 2002, p. 318)). In the following, we give general definitions for these three types of independence.
In statistics and probability theory, mutual independence of a set of events is classically defined as follows (see e.g., (Feller, 1968)[p. 128]): Definition 3 (Mutual independence) Let X = {A 1 , . . . , A m } be a set of variables, whose domains are Dom(A i ), i = 1, . . . , m. Let a i ∈ Dom(A i ) notate a value of A i . A set of events (A 1 =a 1 , . . . , A m =a m ) is called mutually independent if for all {i 1 , . . . , i m ′ } ⊆ {1, . . . , m} holds If variables A i ∈ X are binary, the conjunction of true-valued events (A 1 =1, . . ., A m =1) can be expressed as A 1 , . . . , A m and the condition for mutual independence reduces to ∀Y ⊆ X P(Y) = A i ∈Y P(A i ). An equivalent condition is to require that for all truth value combinations (a 1 , . . . , a m ) ∈ {0, 1} m holds P( (Feller, 1968)[p. 128]. We note that in data mining, this property has sometimes been called independence of binary variables and independence of events has referred to a weaker condition P(X) = A i ∈X P(A i ) (i.e., without requiring independence in Y X, e.g., (Silverstein et al, 1998)). Both definitions have been used as a starting point to define interesting set-formed dependency patterns (e.g., (Webb, 2010;Silverstein et al, 1998)). In this paper, we will call this type of patterns dependency sets (Section 5).
In addition to mutual independence, a set of events or variables can express independence between different partitions of the set. The only difference to the basic definition of statistical independence is that now single events or variables have been replaced by sets of events or variables. In this paper, we call this type of independence bipartition independence.
Definition 4 (Bipartition independence) Let X be a set of variables. For any partition X=Y ∪ Z, where Y ∩ Z=∅, possible value combinations are notated by y ∈ Dom(Y) and z ∈ Dom(Z).
(i) Event Y=y is independent of event Z=z, if P(Y=y, Z=z) = P(Y=y)P(Z=z).
(ii) Set of variables Y is independent of Z, if P(Y=y, Z=z) = P(Y=y)P(Z=z) for all y ∈ Dom(Y) and z ∈ Dom(Z).
Now one can derive a large number of different dependence patterns from a single set X or event X=x. There are 2 m−1 − 1 ways to partition set X, |X| = m, into two subsets Y and Z = X \ Y (|Y| = 1, . . . , ⌈ m−1 2 ⌉). In data mining, patterns expressing bipartition dependence between sets of events are often expressed as dependency rules Y=y → Z=z. Because both the rule antecedent and consequent are binary conditions, the rule can be interpreted as dependence between two new binary (indicator) variables I Y=y and I Z=z (I Y=y =1 if Y=y and I Y=y =0 otherwise). In statistical terms, this is the same as collapsing a multidimensional contingency table into a simple 2 × 2 table. In addition to statistical dependence, dependency rules are often required to fulfil other criteria like sufficient frequency, strength of dependency or statistical significance. Corresponding patterns between sets of variables are less often studied, because the search is computationally much more demanding. In addition, collapsed contingency tables can reveal interesting and statistically significant dependencies between composed events, when no significant dependencies could be found between variables.
The third main type of independence is conditional independence between events or variables: Definition 5 (Conditional independence) Let X be a set of variables. For any par- possible value combinations are notated by y ∈ Dom(Y), z ∈ Dom(Z), and q ∈ Dom(Q).
(i) Events Y=y and Z=z are conditionally independent given Q=q, if P(Y=y, Z=z | Q=q) = P(Y=y | Q=q)P(Z=z | Q=q). (ii) Sets of variables Y and Z are conditionally independent given Q, if P(Y=y, Z=z | Q=q) = P(Y=y | Q=q)P(Z=z | Q=q) for all y ∈ Dom(Y), z ∈ Dom(Z), and q ∈ Dom(Q).
Conditional independence can be defined also for more than two sets of events or variables, given a third one. For example, in set {A, B,C, D}, we can find four conditional independences given D: A ⊥ BC, B ⊥ AC, C ⊥ AB, and A ⊥ B ⊥ C. However, these types of independence are seldom needed in practice. In pattern discovery, notions of conditional independence and dependence between events are used for inspecting improvement of a dependency rule YQ → C over its generalization Y → C (Section 4). In machine learning, conditional independence between variables or sets of variables is an important property for constructing full probability models, like Bayesian networks or log-linear models.

Statistical significance testing
The main idea of statistical significance testing is to estimate the probability that the observed discovery would have occurred by chance. If the probability is very small, we can assume that the discovery is genuine. Otherwise, it is considered spurious and discarded. The probability can be estimated either analytically or empirically. The analytical approach is used in the traditional significance testing, while randomization tests estimate the probability empirically. Traditional significance testing can be further divided into two main classes: the frequentist and Bayesian approaches. These main approaches to statistical significance testing are shown in Figure 2.
The frequentist approach of significance testing is the most commonly used and best studied (see e.g. (Freedman et al, 2007, Ch. 26) or (Lindgren, 1993, Ch. 10.1)). The approach is actually divided into two opposing schools, Fisherian and Neyman-Pearsonian, but most textbooks present a kind of synthesis (see e.g., (Hubbard and Bayarri, 2003)). The main idea is to estimate the probability of the observed or a more extreme phenomenon O under some null hypothesis, H 0 . In general, the null hypothesis is a statement on some parameter or parameters, which define the assumed null distribution. For example, when the objective is to test the significance of dependency rule X → A, the null hypothesis H 0 is the independence assumption: N XA = nP(X)P(A), where N XA is a random variable for the absolute frequency of XA. (Equivalently, H 0 could be ∆ = 0 or Γ = 1, where ∆ and Γ are random variables for the leverage and lift.) In independence testing, the null hypothesis is usually an equivalence statement, S =s 0 (nondirectional hypothesis), but in other contexts it can also be of the form S ≤ s 0 or S ≥ s 0 (directional hypothesis). Often, one also defines an explicit alternative hypothesis, H A , which can be either directional or nondirectional. For example, if X → A expresses positive dependence, then it is natural to form a directional hypothesis H A : N XA > nP(X)P(A) (or ∆ > 0 or Γ > 1).
When the null hypothesis has been defined, one should select a test statistic T (possibly S itself) and define its distribution (null distribution) under H 0 . The pvalue is defined from this distribution as the probability of the observed or a more extreme T-value, P(T ≥ t | H 0 ), P(T ≤ t | H 0 ), or P(T ≤ −t or T ≥ t | H 0 ). In the case of independence testing, possible test statistics are, for example, leverage and lift. The distribution under independence is defined according to the selected sampling model, which we will introduce in Section 3. The probability of observing positive dependence whose strength is at least δ(X, A) is P M (∆ ≥ δ(X, A) | H 0 ), where P M is the complementary cumulative distribution function for the assumed sampling model M.
When the p-value is evaluated, one should pay careful attention whether to perform a one-tailed or a two-tailed test. Unfortunately, there is both confusion and disagreement on their definitions and use. If the null distribution is symmetric and the alternative hypothesis is non-directional, then it is agreed that one should use a two-tailed test (i.e., evaluate P(T ≤ −t or T ≥ t | H 0 )). Similarly, if the distribution is two-tailed but the alternative hypothesis is directional, one should perform a onetailed test (P(T ≥ t | H 0 ) or P(T ≤ t | H 0 )). The problematic case is when the distribution is asymmetric and has only one tail, but the test-statistic is non-directional. An example is the χ 2 -test which can be called either one-tailed or two-tailed. If the alternative hypothesis is directional (like positive dependence between two events), it is often suggested to halve the corresponding p-value (Yates, 1984).
Up to this point, all frequentist approaches are more or less in agreement. The differences appear only when the p-values are interpreted. In the classical (Neyman-Pearsonian) hypothesis testing, the p-value is compared to some predefined threshold α. If p ≤ α, the null hypothesis is rejected and the discovery is called significant at level α. Parameter α (also known as the test size) defines the probability of committing a type I error, i.e., accepting a spurious pattern (and rejecting a correct null hypothesis). Another parameter, β, is used to define the probability of committing a type II error, i.e., rejecting a genuine pattern as non-significant (and keeping a false null hypothesis). The complement 1 − β defines the power of the test, i.e., probability that a genuine pattern passes the test. Ideally, one would like to minimize the test size and maximize its power. Unfortunately, this is not possible, because β increases when α decreases and vice versa. As a solution, it has been recommended (e.g., (Lehmann and Romano, 2005)[57]) to select appropriate α and then to check that the power is acceptable given the sample size. However, the power analysis can be difficult and all too often it is skipped altogether.
The most controversial problem in hypothesis testing is how to select an appropriate significance level. A convention is to use always the same standard levels, like α=0.05 or α=0.01. However, these values are quite arbitrary and widely criticized (see e.g., (Lehmann and Romano, 2005)[p. 57] (Lecoutre et al, 2001;Johnson, 1999)). Especially in large data sets, the p-values tend to be very small and hypotheses get too easily rejected with conventional thresholds. A simple alternative is to report only p-values, as advocated by Fisher and also many recent statisticians (e.g., (Lehmann and Romano, 2005)[pp.63-65] and (Hubbard and Bayarri, 2003)). Sometimes, this is called 'significance testing' in distinction from 'hypothesis testing' (with fixed αs), but the terms are not used systematically. Reporting only p-values may often be sufficient, but there are still situations where one should make concrete decisions and a binary judgement is needed.
Deciding threshold α is even harder in data mining where numerous patterns are tested. For example, if we use threshold α=0.05, then there is a 5% chance that a spurious pattern passes the significance test. If we test 10 000 spurious patterns, we can expect 500 of them to pass the test erroneously. This so called multiple testing problem is inherent in knowledge discovery, where one often performs an exhaustive search over all possible patterns. We will return to this problem in Section 6.
The idea of Bayesian significance testing (see e.g., (Lee, 2012, Ch. 4), (Albert, 1997;Jamil et al, 2017) is quite similar to the frequentist approach, but now we assign some prior probabilities P(H 0 ) and P(H A ) to the null hypothesis H 0 and the alternative research hypothesis H A . Next, the conditional probabilities, P(O | H 0 ) and P(O | H A ), of the observed or a more extreme phenomenon O under H 0 and H A are estimated from the data. Finally, the probabilities of both hypotheses are updated by the Bayes' rule and the acceptance or rejection of H 0 is decided by comparing posterior probabilities P(H 0 | O) and P(H A | O). The resulting conditional probabilities P(H 0 |O) are asymptotically similar (under some assumptions even identical) to the traditional p-values, but Bayesian testing is sensitive to the selected prior probabilities (Agresti and Min, 2005). One attractive feature of the Bayesian approach is that it allows to quantify the evidence for and against the null hypothesis. However, the procedure tends to be more complicated than the frequentist one; specifying prior dis-tributions may require a plethora of parameters and the posterior probabilities cannot always be evaluated analytically (Agresti and Hitchcock, 2005;Jamil et al, 2017).
A major problem with traditional significance testing is the underlying assumption that the data is a random or otherwise representative sample from the whole population (Smith, 1983). Often, the tests are used even if this assumption is violated (e.g., all available data is analyzed), but in this case, the discoveries describe only the available data set and there are no guarantees that they would hold in the whole population (Finch, 1979). This is especially typical for exploratory data analysis and data mining, in general. In addition, there are problems where random sampling is not possible even in principle, because the population is infinite (e.g., the researcher is interested in the dependence between employees' smoking habits and work efficiency in general, including already dead and still unborn people) (Edgington, 1995, 7-8).
Another restriction concerning many approaches to traditional significance testing are the parametric assumptions on the underlying data distribution. The problem is that the p-value cannot be estimated, unless one has made some distributional assumptions under the null hypothesis. Typically, one selects the distribution from some parametric family like binomial or normal distributions and estimates the missing parameters from the data. In large data sets, the exact form of the distribution is less critical, because sampling distributions often tend to normal, and also the parameters can be estimated more accurately. However, in dependence testing, the most important factor is the assumed distribution of absolute frequencies (random variables N XA , N X¬A , N ¬XA , and N ¬X¬A ). If fr(X) or fr(A) is small, there are no guarantees that the corresponding random variable would be normally distributed or that the frequency estimate would be accurate.
Both of the above mentioned problems can be solved by randomization testing (see e.g., (Edgington, 1995)). The main idea of randomization tests is to estimate the significance of the discovery empirically, by testing how often the discovery occurs in other data sets that are randomly generated under the null hypothesis. For this purpose, one should design a permutation scheme that implements the null hypothesis but keeps some data properties fixed.
If only a single dependency set X is tested, it is enough to generate random data sets D 1 , . . . , D b containing attributes A i ∈ X. Since the null hypothesis is mutual independence, the null distribution is simulated by permuting values of A i randomly. Usually, it is required that all marginal probabilities P(A i ) remain the same as in the original data, but there may be additional constraints. Test statistic T that evaluates goodness of the pattern is calculated in each random data set. For simplicity, we assume that the test statistic T is increasing by goodness (a higher value indicates a better pattern). If the original data set produced T-value t 0 and b random data sets produced T-values t 1 , . . . , t b , the empirical p-value of the observed pattern is If the data set is relatively small and X is simple, it is possible to enumerate all possible permutations, where the marginal probabilities hold. This leads to an exact permutation test, which gives an exact p-value. On the other hand, if the data set is large and/or X is more complex, all possibilities cannot be checked, and the empirical p-value is less accurate. In this case, the test is called a random permutation test or an approximate permutation test. There are also some special cases, like testing a single dependency rule, where it is possible to express the permutation test in a closed form that is easy to evaluate exactly (see Fisher's exact test in Subsection 3.2).
The advantage of randomization tests is that the data does not have to be a random sample from a finite population. The reason is that the permutation scheme defines a population, which is randomly sampled, when new data sets are generated. Therefore, the discoveries can be assumed to hold in the population defined by the permutation scheme. In addition, the randomization test approach does not assume any underlying parametric distribution. Still, the classical parametric test statistics can be used to estimate the p-values. (Edgington, 1995) With randomization one can test also such null hypotheses for which no closed form test exists. The only critical problem is that it is not always clear, how the data should be permuted. The number of random permutations plays also an important role in testing. The more random permutations are performed, the more accurate the empirical p-values are, but in practice, extensive permuting can be too time consuming. Computational costs restrict also the use of randomization testing in search algorithms especially in large data sets.
The idea of randomization tests can be extended for estimating the overall significance of all mining results or even for tackling the multiple testing problem. For example, Gionis et al (2007) tested the significance of the number of all frequent sets (given a minimum frequency threshold) and the number of all pair-wise correlations among the most frequent attributes (measured by Pearson's correlation coefficient, given a minimum correlation threshold) using randomization tests. In this case, it is necessary to generate complete data sets randomly for testing. The difficulty is to decide what properties of the original data set should be maintained. As a solution, Gionis et al. kept both column marginals (fr(A i )s) and row marginals (numbers of 1s on each row) fixed, and new data sets were generated by swap randomization (Cobb and Chen, 2003). A prerequisite for this method is that the attributes are semantically similar (e.g. occurrence or absence of species) and it is sensible to swap their values. In addition, there are some pathological cases, where no or only a few permutations exist with the given row and column marginals, resulting in a poor p-value, even if the original data set contains a significant pattern. (Gionis et al, 2007) 3 Statistical significance of dependency rules Dependency rules are a famous pattern type that expresses bipartition dependence between the rule antecedent and the consequent. In this section, we discuss how statistical significance of dependency rules is evaluated under different assumptions. Especially, we contrast two alternative interpretations of dependency rules that are called variable-based and value-based interpretations and introduce appropriate tests in different sampling models.

Dependency rules
Dependency rules are maybe the simplest type of statistical dependency patterns. As a result it has been possible to develop efficient exhaustive search algorithms, which makes dependency rule analysis an attractive starting point for any data mining task. Still, dependency rules can reveal arbitrarily complex pairwise dependencies from categorical or discretized numerical data, without any additional assumptions. In medical science, for example, an important task is to search for statistical dependencies between gene alleles, environmental factors, and diseases. We recall that statistical dependencies are not necessarily causal relationships, but still they can help to form causal hypotheses and reveal which factors predispose or prevent diseases. Interesting dependencies do not necessarily have to be strong or frequent, but instead, they should be statistically valid, i.e., genuine dependencies that are likely to hold also in future data. In addition, it is often required that the patterns should not contain any superfluous variables which would only obscure the real dependencies. Based on these considerations, we will first give a general definition of dependency rules and then discuss important aspects of genuine dependencies.
Definition 6 (Dependency rule) Let R be a set of categorical variables, X ⊆ R, and Y ⊆ R \ X. Let us denote value vectors of X and Y by x ∈ Dom(X) and y ∈ Dom(Y). Rule X=x → Y=y is a dependency rule, if P(X=x, Y=y) P(X=x)P(Y=y).
The dependency is (i) positive, if P(X=x, Y=y) > P(X=x)P(Y=y), and (ii) negative, if P(X=x, Y=y) < P(X=x)P(Y=y). Otherwise, the rule is called an independence rule.
We note that the direction of the rule is customary, because statistical dependence is a symmetric relation. Often, the rule is expressed in the order, where the precision of the rule (φ(X=x → Y=y) = P(Y=y | X=x) or φ(Y=y → X=x) = P(X=x | Y=y)) is maximal.
For simplicity, we will concentrate on a common special case of dependency rules, where 1) all variables are binary, 2) the consequent Y=y consist of a single variable-value combination, A=i, i ∈ {0, 1}, and 3) the antecedent X=x is a conjunction of true-valued attributes, i.e., X=x With these restrictions, the resulting rules can be expressed in a simpler form X → A=i, where i ∈ {0, 1}, or X → A and X → ¬A. Allowing negated consequents means that it is sufficient to represent only positive dependencies (a positive dependency between X and ¬A is the same as a negative dependency between X and A). We note that this restriction is purely representational and the following theory is easily extended to general dependency rules as well. Furthermore, we recall that this simpler form of rules can still represent all dependency rules after suitable data transformations (i.e., creating new binary variables for all values of the original variables).
Finally, we note that dependency rules deviate from traditional association rules (Agrawal et al, 1993) in their requirement of statistical dependence. Traditional association rules do not necessarily express any statistical dependence but relations between frequently occurring attribute sets. However, there has been research on association rules where the requirement of minimum frequency has been replaced by requirements of statistical dependence (see e.g., (Webb and Zhang, 2005;Webb, 2008Webb, , 2007Hämäläinen, 2012Hämäläinen, , 2010bLi, 2006;Morishita and Sese, 2000;Nijssen and Kok, 2006;Nijssen et al, 2009). For clarity, we will use here the term 'dependency rule' for all rule type patterns expressing statistical dependencies, even if they had been called association or classification rules in the original publications.
Statistical dependence is a necessary requirement of a dependency rule, but in addition, it is frequently useful to impose further constraints like that of statistical significance and absence of superfluous variables. The following example demonstrates these two requirements.
Example 1 Let us consider a medical database which contains information on patients, their lifestyle, diseases, medical measurements, and occurrence of certain gene alleles. Table 4 lists some dependency rules related to atherosclerosis from this data. The first two rules are examples of simple positive and negative dependencies (predisposing and protecting factors). Rules 3 and 4 demonstrate the non-monotonicity of statistical dependence. The combination of FH-disease (familial hypercholesterolaemia) and ABCA1-R219K allele protects from atherosclerosis, even if FH-disease alone increases the risk and ABCA1-R219K allele status alone does not protect from atherosclerosis (according to current knowledge, there is either weak positive dependence or independence).
Rule 5 is an example of a spurious rule, which is statistically insignificant and likely due to chance. The database contains only one person who uses spruce sprout extract regularly and who does not have atherosclerosis (the seller of the product himself). On the other hand, rule 6 is statistically significant and likely a genuine rule. Eating dark chocolate is quite common, but it is surprising how healthy these chocoholics are compared to other people.
The last rules illustrate the problem of superfluous variables. In rule 7, neither of condition attributes is superfluous, because the dependency is stronger and more significant than simpler dependencies involving only stress or only smoking. However, rules 8-10 demonstrate three types of superfluous rules, where extra factors i) have no effect on the dependency, ii) weaken it or iii) apparently improve it but not significantly. Rule 8 is superfluous, because coffee has no effect on the dependency between smoking and atherosclerosis. Rule 9 is superfluous, because sports weakens the dependency between high cholesterol and atherosclerosis. Rule 10 is the most difficult to judge, because the dependency is stronger than either of simpler dependencies involving only male gender or male pattern baldness. However, the improvement is not significant, because the data contains only one woman with male pattern baldness and atherosclerosis.
In the previous example, we did not give any measures for the strength or statistical significance of dependencies. The reason is that the selection of these measures as well as superfluousness depend on the interpretation of dependency rules. In principle, there are two alternative interpretations for rule X → A=i, i ∈ {0, 1}: either it can represent a dependency between events X (or I X =1) and A=i or between variables I X and A, where I X is an indicator variable for event X. These two interpretations have sometimes been called value-based and variable-based semantics (Blanchard et al, 2005) of the rule. Unfortunately, researchers have often forgotten to mention explicitly which interpretation they follow. This has caused much confusion and, in the worst case, led to missed or inappropriate discoveries. The following example demonstrates how variable-and value-based interpretations can lead to different results.
Example 2 An apple database describes 100 apples, including their colour (green or red), size (big or small), and taste (sweet or bitter). Let us notate A=sweet, ¬A=bitter, Y=red, ¬Y=green, Q=big, ¬Q=small.
We would like to find dependency rules which are able to classify apples by taste. The simplest such rule is red → sweet (Y → A) with P(A|Y)=0.92, P(¬A|¬Y)=1.0, δ(Y, A)=0.22, and γ(Y, A)=1.67. So, with this rule we can divide the apples into two baskets according to colour (Figure 3). The first basket contains 60 red apples, 55 of which are sweet, and the second basket contains 40 green apples, which are all bitter. This is quite a good rule if the goal is to classify well both sweet apples (for eating) and bitter apples (for juice and cider). However, if we would like to predict sweetness better (e.g., get a basket of sweet apples for our guests), we may prefer a more complex rule: red and big → sweet (YQ → A) with P(A|YQ)=1.0, P(¬A|¬(YQ))=0.75, δ(YQ, A)=0.18, γ(YQ, A)=1.82. This rule produces a basket of 40 big, red apples, all of them sweet, and another basket of 60 green or small apples, 45 of them bitter. So, depending on our modelling purposes, we may want to choose the variable-based interpretation and select the first rule, or the value-based interpretation and select the latter rule. This decision affects also which goodness measure should be used. Leverage suits the variable-based interpretation, because its absolute value is the same for all event combinations, but it may miss interesting dependencies between events. Lift, on the other hand, suits the value-based interpretation, because it favours rules where events are strongly dependent. However, it is not a reliable measure alone, because it ranks well also coincidental 'noise rules' (e.g., apple maggot → bitter). Therefore, it has to be accompanied with statistical significance tests.
In general, the variable-based interpretation tends to produce more reliable patterns, in the sense that the discovered dependencies hold well in future data (see e.g., (Hämäläinen, 2010a)[Ch.5]). However, there are applications where the value-based interpretation may better identify interesting dependency rules. One example is genedisease data, where many alleles or allele combinations (X) are rare. Still, they may have strong and statistically significant dependencies with some diseases (D). Medical scientists would certainly want to find such dependencies X → D, even if the overall dependency between variables X and D would be weak or insignificant.
In the following sections, we will examine how statistical significance is tested in the variable-based and value-based interpretations.

Sampling models for the variable-based interpretation
In the variable-based interpretation, the significance of dependency rule X → A is determined by classical independence tests. The task is to estimate the probability of the observed or a more 'extreme' contingency table, assuming that variables I X and A were actually independent. There is no consensus how the extremeness relation should be defined, but intuitively, contingency table τ i is more extreme than table τ j , if the dependence between X and A is stronger in τ i than in τ j . So, any measure for the strength of dependence between variables can be used as a discrepancy measure, to order contingency tables. The simplest such measure is leverage, but also odds ratio is commonly used. We note that odds ratio is not defined when N X¬A N ¬XA = 0 and some special policy is needed for these cases. In the following, we will notate the relation "table τ i is equally or more extreme to table τ j " by τ i τ j . The probability of each contingency table τ i depends on the assumed statistical model M. Model M defines the space of all possible contingency tables T M (under the model assumptions) and the probability P(τ i | M) of each table τ i ∈ T M . Because the task is to test independence, the assumed model should satisfy the independence assumption P(XA)=P(X)P(A) in some form. For the probabilities P(τ i | M) holds Classically, statistical models for independence testing have been divided into three main categories (sampling schemes) (Barnard, 1947;Pearson, 1947), which we call multinomial, double binomial, and hypergeometric models. In the statistics literature (e.g. (Barnard, 1947;Upton, 1982)), the corresponding sampling schemes are called double dichotomy, 2 × 2 comparative trial, and 2 × 2 independence trial.
In the following, we describe the three models using the classical urn metaphor. However, because there are two binary variables of interest, I X and A, we cannot use the basic urn model with white and black balls. Instead, we will use an apple basket model, with red and green, sweet and bitter apples, like in Example 2.

Multinomial model
In the multinomial model, it is assumed that the real probabilities of sweet red apples, bitter red apples, sweet green apples, and bitter green apples are defined by parameters p XA , p X¬A , p ¬XA , and p ¬X¬A . The probability of red apples is p X and of green apples 1 − p X . Similarly, the probability of sweet apples is p A and of bitter apples 1 − p A . According to the independence assumption, . A sample of n apples is taken randomly from an infinite basket (or from a finite basket with replacement). Now the probability of obtaining N XA sweet red apples, N X¬A bitter red apples, N ¬XA sweet green apples, and N ¬X¬A bitter green apples is defined by multinomial probability Since data size n is given, the contingency tables can be defined by triplets < N XA , N X¬A , N ¬XA > or, equivalently, triplets < N X , N A , N XA >. Therefore, the space of all possible contingency tables is For estimating the p-value with Equation (5), we should still solve two problems. First, the parameters p X and p A are unknown. The most common solution is to estimate them by the observed relative frequencies (maximum likelihood estimates). Second, we should decide, when a contingency table τ i is equally or more extreme than the observed contingency table τ o . For this purpose, we have to select the discrepancy measure, which evaluates the overall dependence in a contingency table, when only the data size n is fixed. Examples of such measures are leverage and the odds ratio.
In practice, the multinomial test is seldom used, but the multinomial model is an important theoretical model, from which other models can be derived as special cases.

Double binomial model
In the double binomial model, it is assumed that we have two infinite baskets, one for red and one for green apples. Let us call these the red and the green basket. In the red basket, the probability of sweet apples is p A|X and of bitter apples 1 − p A|X , and in the green basket the probabilities are p A|¬X and 1 − p A|¬X . According to the independence assumption, the probability of sweet apples is the same in both baskets: p A = p A|X = p A|¬X . A sample of fr(X) apples is taken randomly from the red basket and another random sample of fr(¬X) apples is taken from the green basket. The probability of obtaining N XA sweet apples among the selected fr(X) red apples is defined by the binomial probability Similarly, the probability of obtaining N ¬XA sweet apples among the selected green apples is Because the two samples are independent from each other, the probability of obtaining N XA sweet apples from fr(X) red apples and N ¬XA sweet apples from fr(¬X) green apples is the product of the two binomials where N A =N XA + N ¬XA is the total number of the obtained sweet apples. (Here fr(¬X) was dropped from the condition, because n is given.) We note that the double binomial probability is not exchangeable with respect to the roles of X and A, i.e., generally P(N XA , N ¬XA | n, fr(X), p A ) P(N XA , N X¬A | n, fr(A), p X ).
In practice, this means that the probability of obtaining fr(XA) sweet red apples, fr(X¬A) bitter red apples, fr(¬XA) sweet green apples, and fr(¬X¬A) bitter green apples is (nearly always) different in the model of the red and green baskets from the model of the sweet and bitter baskets.
Since fr(X) and fr(¬X) are given, each contingency table is defined as a pair < N XA , N ¬XA > or, equivalently, < N A , N XA >. The space of all possible contingency tables is We note that N A is not fixed, and therefore N A is generally not equal to the observed fr(A).
For estimating the significance with Equation (5), we should once again estimate the unknown parameter p A . The most common solution is to estimate p A from the data, and set p A = P(A). For the discrepancy measure, we can use for example leverage or the odds ratio. However, if we fix also N A = fr(A), it becomes easy to define, when a dependency is stronger than the observed. The positive dependence between red and sweet apples becomes stronger, when N XA increases, and the negative dependence between sweet and green apples becomes stronger, when N ¬XA = fr(A) − N XA decreases. Now the p-value gets a simple expression Another common solution is to approximate the p-value with asymptotic tests, which are discussed later.

Hypergeometric model
In the hypergeometric model, there is no sampling from an infinite basket. Instead, we can assume that we are given a finite basket of n apples, containing exactly fr(X) red apples and fr(¬X) green apples. We test all n apples and find that fr(XA) of red apples are sweet and fr(¬XA) of green apples are sweet. The question is how probable is our basket, or the set of all at least equally extreme baskets, among all possible apple baskets with fr(X) red apples, fr(¬X) green apples, fr(A) sweet apples, and fr(¬A) bitter apples. Now the baskets correspond to contingency tables. The number of all possible baskets with the fixed totals fr(X), fr(¬X), fr(A), and fr(¬A) is The equality follows from the Vandermonde's identity (We recall that customarily m l =0, when l > m.) Assuming that all baskets with these fixed totals are equally likely, the probability of a basket with N XA sweet red apples is Because all totals are fixed, the extremeness relation is also easy to define. Positive dependence is stronger than observed, when N XA > fr(XA). For the p-value, it is enough to sum the probabilities of baskets containing at least fr(XA) sweet red apples. The resulting p-value is where J 1 = min{fr(X¬A), fr(¬XA)}. (Instead of J 1 , we could give an upper range fr(A), because the zero terms disappear.) This p-value is known as Fisher's p, because it is used in Fisher's exact test, an exact permutation test, where p F is calculated. We give it a special symbol p F , because it will be used later. For negative dependence between red and sweet apples (or positive dependence between green and sweet apples) the p-value is where J 2 = min{fr(XA), fr(¬X¬A)}.

Asymptotic measures
We have seen that the p-values in the multinomial and double binomial models are quite difficult to calculate. However, the p-value can often be approximated easily using asymptotic measures. With certain assumptions, the resulting p-values converge to the correct p-values, when the data size n (or fr(X) and fr(¬X)) tend to infinity. In the following, we introduce two commonly used asymptotic measures for independence testing: the χ 2 -measure and mutual information. In statistics, the latter corresponds to the log likelihood ratio (Neyman and Pearson, 1928). The main idea of asymptotic tests is that instead of estimating the probability of the contingency table as such, we calculate some better behaving test statistic T. If T gets value t, we estimate the probability of P(T ≥ t) (assuming that large T values indicate a strong dependency).
In the case of the χ 2 -test, the test statistic is the χ 2 -measure. Now the variables are binary and Equation (2) reduces into a simpler form: = n(P(X, A) − P(X)P(A)) 2 P(X)P(¬X)P(A)P(¬A) = nδ 2 (X, A) P(X)P(¬X)P(A)P(¬A) .
So, in principle, each term measures how much the observed frequency fr(I X =i, A= j) deviates from its expectation nP(I X =i)P(A= j), under the independence assumption.
If the data size n is sufficiently large and none of the expected frequencies is too small, the χ 2 -measure follows approximately the χ 2 -distribution with one degree of freedom. This can be derived from the double binomial model as follows: N XA and N ¬XA are binomial variables with expected valuesμ 1 = fr(X)p A andμ 2 = fr(¬X)p A and standard deviationsσ 1 = fr(X)p A (1 − p A ) andσ 2 = fr(¬X)p A (1 − p A ). If the unknown parameter p A is estimated by P(A), we can calculate z-scores for N XA and N ¬XA : fr ( . The z-score measures how many standard deviations the observed frequency deviates from its expectation. If fr(X) and fr(¬X) are sufficiently large, then both z 1 and z 2 follow the standard normal distribution N(0, 1). Therefore, their squares and also the sum of the squares χ 2 = z 2 1 + z 2 2 follow the χ 2 -distribution. As a classical rule of thumb (Fisher, 1925), the χ 2 -measure can be used only, if all expected frequencies nP(I X =i)P(A= j), i, j ∈ {0, 1}, are at least 5. However, the approximations can still be poor in some situations, when the underlying binomial distributions are skewed, e.g., if P(A) is near 0 or 1, or if fr(X) and fr(¬X) are far from each other (Yates, 1984;Agresti, 1992). According to Carriere (2001), this is quite typical for data in medical science.
One reason for the inaccuracy of the χ 2 -measure is that the original binomial distributions are discrete while the χ 2 -distribution is continuous. A common solution is to make a continuity correction and subtract 0.5 from the expected frequency nP(X)P(A). According to Yates (1984), the resulting continuity corrected χ 2 -measure can give a good approximation to Fisher's p F , if the underlying hypergeometric distribution is not markedly skewed. However, according to Haber (1980), the resulting χ 2 -value can underestimate the significance, while the uncorrected χ 2 -value overestimates it.
Mutual information is another popular asymptotic measure, which has been used to test independence. For binary variables, Equation 3 becomes MI= log P(XA) P(XA) P(X¬A) P(X¬A) P(¬XA) P(¬XA) P(¬X¬A) P(¬X¬A) P(X) P(X) P(¬X) P(¬X) P(A) P(A) P(¬A) P(¬A) .
Mutual information is actually an information theoretic measure, but in statistics, 2n · MI is known as log likelihood ratio or the G-test of independence. The likelihood ratio can be derived from the multinomial model as follows: The likelihood of observed counts fr(XA), · · · , fr(¬X¬A) is defined by .

Comparison of models
The main difference between the classical sampling models is which of the variables N, N X , and N A are considered fixed. In the multinomial model all variables except N=n are randomized. However, if the model is conditioned with N X =fr(X), it leads to the double binomial model. If the double binomial model is conditioned with N A =fr(A), it leads to the hypergeometric model. For completeness, we could also consider the Poisson model, where all variables, including N, are unfixed Poisson variables. If the Poisson model is conditioned with the given data size, N=n, it leads to the multinomial model. (Lehmann and Romano, 2005, ch. 4.6-4.7) Selecting the correct model and defining the extremeness relation is a controversial problem, which statisticians (mostly Fisherian vs. Neyman-Pearsonian schools) have argued for the last century (see e.g., (Yates, 1984;Agresti, 1992;Lehmann, 1993;Upton, 1982;Howard, 1998)). Because the multinomial and double binomial models are computationally demanding, the discussion has mainly focused on whether to use the hypergeometric model or asymptotic measures based on the double binomial model. One problem is that even if one or both marginal totals, N X and/or N A , are kept unfixed, the observed counts fr(X) and/or fr(A) are still used to estimate the unknown parameters. Therefore, Fisher and his followers have suggested that we should always assume both N X and N A fixed and use Fisher's exact test or -when it is heavy to compute -a suitable asymptotic test. The opponents have reminded that the results are better generalizable outside the data set, if some variables are kept unfixed, but, nevertheless, they have also suggested to use asymptotic tests, which are conditional on the observed counts. According to our cross-validation experiments (Hämäläinen, 2012), the χ 2 -measure can be quite unreliable, in the sense that the discovered dependency rules may not hold in the test data at all or their lift and leverage values differ significantly between the training and test sets. On the contrary, Fisher's p has turned out to be a very robust and reliable measure in the dependency rule search. Mutual information is a good alternative and it often produces the same rules as p F .  Figure 4 shows an example of different p-values as functions fr(XA), when fr(X)= 50, fr(A)=30, and n=100. In all models, the discrepancy measure is leverage. In the beginning of the line, leverage is δ=0, and in the end, it is maximal δ=0.15. Since the double binomial model is asymmetric with respect to X and A, the values are slightly different, when fr(X) and fr(A) are reversed. Generally, Fisher's exact test gives more cautious p-values than the multinomial and double binomial models, when the data size is small and/or the dependence is weak. However, in the end of line the hypergeometric model gives actually the smallest p-value, although differences are not visible. In this example, the χ 2 -based approximation matches well the multinomial p-value for weaker dependencies. In general, all p-values approach each other, when the data size is increased.

Sampling models for the value-based interpretation
In the value-based interpretation the idea is that we would like to find events XA or X¬A, which express a strong dependency, even if the dependency between variables I X and A were relatively weak. In this case, the strength of the dependency is usually measured by lift, because leverage has the same absolute value for all events XA, X¬A, ¬XA, ¬X¬A. However, lift alone is not a reliable measure, because it obtains its maximum value also when fr(XA=i) = fr(X) = fr(A=i) = 1 (i ∈ {0, 1}) -i.e., when the rule occurs on just one row (Hahsler et al, 2006). Such a rule is quite likely due to chance and hardly interesting. Therefore, we should either test the null hypothesis of no positive dependence H 0 : Γ ≤ 1 (Benjamini and Leshno, 2005), that the lift is at most some threshold H 0 : Γ ≤ γ 0 , γ 0 > 1 (i.e., at most weak positive dependence) (Lallich et al, 2007), or simply calculate the probability of observing such a large lift value, if X and A were actually independent (independence testing, H 0 : Γ=1).
The p-value is defined like in the variable-based testing by Equation (5). The only difference is how to define the extremeness relation τ i τ j . A necessary condition for the extremeness of table τ i over τ j is that in τ i the lift is larger than in τ j . However, since the lift is largest, when N X and/or N A are smallest (and N XA =N X or N XA =N A ), it is sensible to require that also N XA is larger in τ i than in τ j .
If both N X and N A are fixed, then the lift is larger than observed if and only if the leverage is larger than observed, and it is enough to consider tables, where N XA ≥ fr(XA). However, if either N X , N A , or both are unfixed, then we should always check the lift Γ = nN XA N X N A , and compare it to the observed lift γ(X, A). Next, we will first survey how the value-based significance of dependence has been defined in the previous research. Then we will analyze the problem using the three classical models. Finally, we derive an alternative binomial test and the corresponding asymptotic measure for the value-based significance of dependence.

Value-based significance in the previous research
In the previous research on association rules, some authors (Dehaspe and Toivonen, 2001;Lallich et al, 2007Lallich et al, , 2005Bruzzese and Davino, 2003;Megiddo and Srikant, 1998) have speculated how to test the null hypothesis Γ=1. For some reason, it has been taken for granted that the exact probability is defined by a binomial model in the part of the data where X is true. This is equivalent to adopting the double binomial model, but inspecting just one of the baskets -the red apples. So, one tries to decide whether there is a dependency between the red colour and sweetness by taking a sample of fr(X) red apples from the red basket. It is assumed that N XA ∼ Bin(fr(X), p A ), and the unknown parameter p A is estimated from the data, as usual. For positive de-pendence, the p-value is defined as (Dehaspe and Toivonen, 2001)  We see that N X =fr(X) is the only variable, which has to be fixed -even N can be unfixed. In the work by Lallich et al (2005), it was explicitly noted that N A is unfixed, and therefore, in the positive case, i goes from fr(XA) to fr(X), not to min{fr(X), fr(A)}. The idea is that when N XA ≥ fr(XA), then N XA fr(X) ≥ P(A|X), and since p A =P(A) was fixed, then also Γ ≥ γ(X, A). Similarly, in the negative case, Γ ≤ γ(X, A). So, the test checks correctly all cases, where the lift is at least as large (or as small) as observed.
Since the cumulative binomial probability is quite difficult to calculate, it is common to estimate it asymptotically by the z-score. In the case of positive dependence, the binomial variable N XA has expected valueμ=fr(X)P(A) and standard deviation σ= fr(X)P(A)P(¬A). The corresponding z-score is (Lallich et al, 2005;Bruzzese and Davino, 2003) If fr(X) is sufficiently large and P(A) is not too near to 1 or 0, the z-score follows the standard normal distribution. However, when the expected frequency fr(X)P(A) is low (as a rule of thumb < 5), the binomial distribution is positively skewed. This means that the z-score overestimates the significance.
The problem of this z-score (and the original binomial model) is that two rules with different antecedents X are not comparable. So, all rules (with different X) are thought to be from different populations and are tested in different parts of the data. We note that this idea of checking just one basket is also underlying in the J-measure (Smyth and Goodman, 1992): which is often used to evaluate association rules. It is a reduced version of the mutual information, where all terms containing ¬X have been dropped. One solution to the comparison problem is to normalize the z-score and divide it by the maximum value max{z(X → A)} = fr(X)P(¬A) √ P(A)P(¬A) .

The result is
which is the same as Shortliffe's certainty factor (Shortliffe and Buchanan, 1975) for positive dependence. For the negative dependence, the certainty factor is According to Berzal et al (2001), the certainty factor can produce quite accurate results when used for prediction purposes. However, it is very sensitive to spurious and superfluous rules, because all rules with confidence 1.0 gain the maximum score, even if they occur on just one row. In addition, the certainty factor does not suit search purposes, because the upper bound is always the same.
Finally, we note that this single binomial probability (like the double binomial probability) is not an exchangeable measure in the sense that generally p(X → A) p(A → X). The same holds for the corresponding z-score, J-measure, and certainty factor. This can be counter-intuitive, when the task is to search for statistical dependencies, and these measures should be used with care. In addition, with this binomial model the significance of the positive dependence between X and A is generally not the same as the significance of the negative dependence between X and ¬A. With the corresponding z-score, the significance values are related, and where z pos denotes the z-score of positive dependence and z neg the z-score of negative dependence. With the certainty factor and J-measure, the significance of positive dependence between X and A and the significance of negative dependence between X and ¬A are equal.

Value-based significance under the classical models
Let us now analyze the value-based significance of dependency rules using the classical statistical models. For simplicity, we consider only positive dependence. We assume that the extremeness relation is defined by lift Γ and frequency N XA , i.e., a contingency table is more extreme than the observed contingency table, if it has Γ ≥ γ(X, A) and N XA ≥ fr(XA).
In the multinomial model, only the data size N=n is fixed. Each contingency table, described by triplet < N X , N A , N XA >, has probability P(N XA , N X − N XA , N A − N XA , n − N X − N A + N XA | n, p X , p A ), defined by Equation (6). The p-value is obtained, when we sum over all possible triplets, where Γ ≥ γ(X, A): where Q 1 = nN XA γ(X,A)N X . (We note that the terms are zero, if N X < N XA .) In the double binomial model, N X =fr(X) is also fixed. Each contingency table, described by pair < N A , N XA >, has probability P(N XA , N A − N XA | n, fr(X), p A ) by Equation 7. Now we should sum over all possible pairs, where Γ ≥ γ(X, A): where Q 2 = nN XA γ(X,A)fr(X) . In the hypergeometric model, also N A =fr(A) is fixed. As noted before, the extremeness relation is now the same as in the variable-based case, and the p-value is defined by Equation (8). This is an important observation, because it means that Fisher's exact test tests significance also in the value-based interpretation. The same is not true for the first two models, where rule X → A can get a different p-value in variable-based and value-based interpretations.

Alternative binomial model
When N X and/or N A are unfixed, the p-values are quite heavy to compute. Therefore, we will now derive a simple binomial model, where it is enough to sum over just one variable. The binomial probability can be further estimated by an equivalent z-score or the z-score can be used as an asymptotic test measure as such. Contrary to the traditional binomial model, we will test the rule in the whole data set, which means that the p-values of different rules are comparable.
Let us suppose that we have a large basket containing N apples. The probability of red apples is p X , of sweet apples p A , and of sweet red apples p XA . According to the independence assumption p XA =p X p A . Therefore, the basket contains N p XA =N p X p A sweet red apples. A sample of n apples is taken randomly from the basket. The exact probability of obtaining N XA sweet red apples is defined by the hypergeometric distribution: When N tends to infinity, the distribution approaches to binomial, and the probability of N XA given n and p XA becomes When the unknown parameter p XA =p X p A is estimated from the data, the probability becomes P(N XA | n, P(X)P(A)) = n N XA (P(X)P(A)) N XA (1 − P(X)P(A)) n−N XA .
Since N XA is the only variable which occurs in the probability, the extremeness relation is defined simply by τ i τ o ⇔ N XA ≥ fr(XA). When the unknown parameter p X p A is estimated from the data, the p-value of rule X → A is This is in fact the classical binomial test in the whole data set, while the traditionally used binomial (Equation (10)) is a binomial test in the part of data, where X is true. Lemma 1 (Appendix A) shows that p bin gives also an upper bound for the p-value in the multinomial model, i.e., Since N XA is a binomial variable with expected valueμ = nP(X)P(A) and standard deviationσ = √ nP(X)P(A)(1 − P(X)P(A)), the corresponding z-score is The same z-score was given also by Lallich et al (2005) for the case, where N X and N A were unfixed. Because the discrete binomial distribution is approximated by the continuous normal distribution, the continuity correction can be useful, like with the χ 2 -measure.
We note that this binomial probability and the corresponding z-score are exchangeable, which is intuitively a desired property. However, the statistical significance of positive dependence between X and A is generally not the same as the significance of negative dependence between X and ¬A. For example, the z-score for negative (or, equally, positive) dependence between X and ¬A is z(X → ¬A) = fr(X¬A) − nP(X)P(¬A) √ nP(X)P(¬A)(1 − P(X)P(¬A)) = − √ nδ(X, A) √ P(X)P(¬A)(1 − P(X)P(¬A)) .

Comparison of models
The main decision in the value-based interpretation is whether the significance of dependency rule X → A is evaluated in the whole data on only in the part of data where X holds. The latter models have two inherent problems: First, the p-values of discovered rules are not comparable because each of them has been tested in a different part of data. Second, the measures are not exchangeable, which means that X → A can get a totally different ranking than A → X, even if they express the same dependency between events. When the classical statistical models are used, the only difference to the variablebased interpretation is that now the discrepancy measure is lift. Computationally, the only practical choices are the hypergeometric model and asymptotic measures. The hypergeometric model produces reliable results, but it tends to favour large leverage instead of lift, which might be more interesting in the value-based interpretation. In addition, one should check for each rule X → A that the dependency is due to strong γ(X, A) and not due to γ(¬X, ¬A). With this checking, the χ 2 -measure can also be used. According to our experiment (Hämäläinen, 2010a, Ch.5), the χ 2 -measure and the z-score (Equation 15) tend to find rules with the strongest lift (among all compared measures), but in the same time the results are also the most unreliable. Robustness of the χ 2 -measure can be improved with the continuity correction, but with the z-score, it has only a marginal effect. One solution is to use the z-score only for preliminary pruning, and select the rules with the corresponding binomial probability (Hämäläinen, 2010b). double bin multinom. z1-based p z2-based p   (10)) models are asymmetric with respect to X and A, the values are slightly different, when fr(X) and fr(A) are reversed. The hypergeometric model is the only one which gives the same p-values as in the variable-based case ( Figure 4). Generally, the new binomial model, bin1, gives the most cautious p-values and the multinomial and double binomial models give the smallest p-values. The p-values approach each other, when the dependence becomes stronger, but still the differences are quite remarkable (p F =1.6 · 10 −12 vs. p bin1 =1.1 · 10 −4 ).
In this Figure, the single binomial model (bin2) looks fine. However, there are pathological cases when it and the related z-score and the J-measure give opposite rankings than the other significance measures. This is demonstrated in the following example. Example 3 Let us compare two rules, X → A and Y → A, in the value-based interpretation. The frequencies are n=100, fr(A)=50, fr(X)=fr(XA)=30, fr(Y)=60, and fr(YA)=50. i.e., P(A|X)=1 and P(Y|A)=1. The p-values, z-scores, and J-values are given in Table 5. All of the traditional association rule measures (p bin2 , its z-score z 2 , and J-measure) favour rule X → A, while all the other measures (new binomial p bin1 and its z-score z 1 , multinomial p mul , double binomial p double , and Fisher's p F ) rank rule Y → A better. In the three classical models, the difference between the rules is quite remarkable.

Redundancy and significance of improvement
An important task in dependency rule discovery is to identify redundant rules, which add little or no additional information on statistical dependencies to other rules. According to a classical definition (Bastide et al, 2000) "An association rule is redundant if it conveys the same information or less general information than the information conveyed by another rule of the same usefulness and the same relevance." However, what is considered useful or relevant depends on the modelling purpose and numerous definitions of redundant or uninformative rules have been proposed.
In the traditional association rule research, the goal has been to find all sufficiently frequent and 'confident' (high precision) rules. Thus, if the sufficient frequency or precision of a rule can be derived from other rules, the rule can be considered redundant (e.g., (Aggarwal and Yu, 2001;Goethals et al, 2005;Cheng et al, 2008;Balcazar, 2010;Li and Hamilton, 2004); see also a good overview by Balcazar (2010)). On the other hand, when the goal is to find statistical dependency rules, then rules that are merely side-products of other dependencies, can be considered uninformative. An important type of such dependencies are superfluous specializations (X → A) of more general dependency rules (Y → A, Y X). This concept of superfluous rules covers earlier notions of non-optimal or superfluous classification rules (Li, 2006), (statistically) redundant rules (Hu and Rao, 2007;Hämäläinen, 2012) and unproductivite rules (Webb, 2007) Superfluous rules are a common problem, because rules 'inherit' dependencies from their ancestor rules unless their extra factors reverse the dependency. This is regrettable, because undetected superfluous rules may lead to quite serious misconceptions. For example, if disease D is caused by gene group Y (i.e., Y → D), we are likely to find a large number of other dependency rules YQ → D, where Q contains coincidental genes. Now one could make a conclusion that the combination YQ 1 (with some arbitrary Q 1 ) predisposes to disease D and begin preventive care only with these patients.
Intuitively, the idea of superfluousness is clear. A superfluous rule X → A contains extraneous variables Q X, which have no effect or only weaken the original dependency X \ Q → A. It is also possible that Q apparently improves the dependency but the improvement is spurious (due to chance). In this case, the apparent improvement occurs only in the sample, and it may be detected with appropriate statistical significance tests. We recall that significance tests do not necessarily detect all superfluous rules but we can always adjust the significance level to prune more or less potentially superfluous rules. Formalizing the idea of superfluousness is more difficult, because it depends on the used measure, assumed statistical model, required significance level, and -most of all -whether we are using the value-based or variable-based interpretation. Therefore, we give here only a tentative, generic definition of superfluousness and significant improvement.
Definition 7 (Superfluous dependency rules) Let T a goodness measure which is increasing (vs. decreasing) by goodness. Let M be a statistical model which is used for determining the statistical significance and α the selected significance level. Let us denote the improvement of rule X → A=i over rule Y → A=i, i ∈ {0, 1}, by Rule X → A=i is superfluous (given T, M, and α) if there exists rule Y → A=i, Y X, such that ii) Improvement of rule X → A=i over rule Y → A=i is not significant at level α (value-based interpretation) or (iii) Improvement of rule X → A=i over rule Y → A=i is less significant than the improvement of rule ¬X → A i over rule ¬(XQ) → A i (variable-based interpretation).
In the following, we will consider only rule X → A, to simplify notations. In addition, we will notate X=YQ so that we can compare rule YQ → A to a simpler rule Y → A.
Let us first consider the problem of superfluousness in the value-based interpretation, where the significance tests are somewhat simpler. In the traditional association rule research, the goodness measure T is precision (or, equivalently, lift, because the consequent is fixed). Rule X → A is called productive, if P(A|X) > P(A|Y) for all Y X (e.g., (Bayardo et al, 2000;Webb, 2007)).
The significance of productivity is tested separately for all Y → A, Y X, and all p-values should be below some fixed threshold α. In each test, the null hypothesis is that there is no improvement in the precision: P(A|YQ) = P(A|Y). The condition means that Q and A are conditionally independent given Y. The significance is estimated by calculating p(Q → A|Y), i.e., the p-value of rule Q → A in the set where Y holds. Now it is quite natural to assume fr(Y), fr(YQ), and fr(YA) fixed, which leads to the hypergeometric model. The corresponding test is Fisher's exact test for conditional independence and the significance of productivity of YQ → A over Y → A is (Webb, 2007) p(Q → A | Y) = where J 1 = min{fr(YQ¬A), fr(Y¬QA)}. When the χ 2 -measure is used to estimate the significance of productivity, the equation is (Liu et al, 1999) χ 2 (Q → A | Y) = fr(Y)(P(Y)P(YQA) − P(YQ)P(YA)) 2 P(YQ)P(Y¬Q)P(YA)P(Y¬A) .
In principle, measure T can be any goodness measure for statistical dependence between values, including the previously introduced binomial probabilities and corresponding z-scores. However, different measures can disagree on their ranking of rules and which rules are considered superfluous. For example, leverage has a strong bias in favour of general rules, when compared to lift or precision. This is clearly seen from expression δ(Y, A) = P(Y)(P(A|Y) − P(A)) = P(Y)(γ(Y, A) − 1). On the other hand, asymptotic measures like the z-score and the χ 2 -measure tend to overestimate the significance, when the frequencies are small. Therefore, it is possible that a rule is not superfluous, when evaluated with an asymptotic measure, but superfluous, when the exact p-values are calculated.
In the variable-based interpretation, superfluousness of dependency rules is more Example 4 Let us reconsider the rules X → A (=YQ → A) and Y → A in Example 3. Rule X → A is clearly productive with respect to Y → A (P(A|Y) = 1.0 vs. P(A|X) = 0.83). Similarly, rule ¬Y → ¬A is productive with respect to ¬X → ¬A (P(¬A|¬Y) = 1.0 vs. P(¬A|¬X) = 0.71).
Let us now calculate the significance of productivity using Fisher's exact test. In the value-based interpretation, we evaluate only the first improvement: fr ( The value is so small that we can assume that the productivity is significant and X → A is not superfluous. In the variable-based interpretation, we evaluate also the second improvement: This value is much smaller than the previous one, which means that the improvement of ¬Y → ¬A over ¬X → ¬A is more significant than the improvement of X → A over Y → A. Thus, we would consider rule X → A superfluous. We would have ended up into the same conclusion, if we had simply compared the p F -values of both rules: p F (Y → A) = 7.47 · 10 −19 < 1.60 · 10 −12 = p F (X → A).
Finally, we note that superfluousness is closely related to the concept of speciousness (Yule, 1903;Hämäläinen and Webb, 2017), where an observed unconditional dependency vanishes or changes its sign when conditioned on other variables, called confounding factors. The latter phenomenon, reversal of the direction of the dependency, is also known as Yule-Simpson's paradox. In the context of dependency rules, rule X → A is considered specious if there is another rule Y → A or Y → ¬A such that X and A are either independent or negatively dependent in the population when conditioned on Y and ¬Y. In the sample, either of the conditional dependencies may also appear as weakly positive and one has to test their significance with a suitable test, like Birch's exact test (Birch, 1964), conditional mutual information (Hämäläinen and Webb, 2017) or various χ 2 -based tests.
It is noteworthy that the confounding factor Y does not necessarily share any attributes with X. However, in a special case when Y X, Birch's exact test for speciousness of X → A with respect to Y → A reduces to Equation (16) (significance of productivity). On the other hand, Birch's exact test for speciousness of Y → A with respect to X → A is equivalent to Equation (17). So, testing superfluousness of X → A with respect to Y → A in a variable-based interpretation can be considered as a special case of testing if X → A is specious by Y → A or vice versa.

Dependency sets
Dependency sets are a general name for set-type patterns whose main requirement is that the attributes are mutually dependent. The classical notion of mutual dependence (Definition 3) is very inclusive because it suffices that X contains at least one subset Y ⊆ X where P(Y) A i ∈Y P(A i ). This means that the property is monotonic, i.e., all supersets of a mutually dependent set are also mutually dependent. To avoid an excessive number of patterns, dependency sets usually represent only some of all mutually dependent sets, like minimal mutually dependent sets (Brin et al, 1997), sets that present new dependencies in comparison to their subsets (for some A ∈ X, δ(X \ {A}, A) 0) (Meo, 2000), or sets whose all bipartitions express statistical dependence (for all Y X δ(Y, X \ Y) 0) (Webb, 2010).
Compared to dependency rules, dependency sets offer a more compact presentation of dependencies, and in some contexts the reduction in the number of patterns can be quite drastic. This is evident when we recall that any set X can give rise up to |X| rules of the form X \ {A i } → A i and up to 2 |X| − 2 rules of the form X \ Y → Y. In many cases, these permutation rules reflect the same statistical dependency. This is always true when |X| = 2 (A → B and B → A present the same dependency), but the same phenomenon can occur also with more complex sets as the following observation demonstrates.
Observation 1 Let X be a set of binary attributes such that for all Y X P(Y) = A i ∈Y P(A i ) (i.e., attributes are mutually independent). Then for all Z X δ( This means that when all proper subsets of X express only mutual independence, then all permutation rules of X \ Z → Z have the same leverage, frequency and expected frequency and many goodness measures would rank them equally good. In real world data, the condition holds seldom precisely, but the same phenomenon tends to occur also when all subsets express at most weak dependence. In this case, it is intuitive to report only set X instead of listing all of its permutation rules. In principle, all dependency rules could be presented with dependency sets without losing any other information than the division to an antecedent and a consequent. The reason is that for any dependency rule X \ Y → Y, set X is also mutually dependent. This follows immediately from the fact that mutual independence of X (Definition 3) implies bipartition independence between X \ Y and Y (Definition 4). However, some set dependency patterns have extra requirements that may potentially miss interesting dependency rules.
The approaches for finding statistically significant dependency sets can be roughly divided into two categories: searching for frequent sets and testing their statistical significance afterwards and searching directly for dependency sets with statistical goodness measures and significance tests.
Frequent itemsets (Agrawal et al, 1996) are undoubtedly the most popular type of set patterns in knowledge discovery. A frequent itemset is a set of true-valued binary attributes (called items, according to the original market-basket setting) whose frequency exceeds some user-specified minimum frequency threshold. However, being frequent does not ensure that the elements in an itemset are associated. For example, consider two elements A and B that each occur in all but one example. Then itemset {A, B} will occur in all but at most two examples, making it extremely frequent, but it will represent a negative association rather than a positive association such as association discovery typically seeks.
Frequent itemsets can still be used as a starting point for finding statistically significant frequent dependency sets. If the original minimum frequency threshold was sufficiently small, it is possible to find all dependency sets in this way. The basic idea is to assume mutual independence of all attributes and then evaluate for each itemset the probability that its frequency is at least as large as observed. In principle, any significance testing approach could be used, but often this is done with randomization testing that approximates exact p-values. In swap randomization (e.g., (Gionis et al, 2007;Cobb and Chen, 2003)), both column marginals (attribute frequencies) and row marginals (numbers of items on each row) are kept fixed. The latter requirement allows suppression of associations that are due to co-occurrence of items only due to their appearing solely in rows that contain many items. A variant is iterative randomization (Hanhijärvi et al, 2009). This approach begins with fixed row and column margins, but on each iteration, it adds the most significant frequent itemset as a new constraint. The randomization problem is computationally very hard, and thus it is sufficient that the frequencies of itemsets hold only approximately. The process is repeated until no more significant itemsets can be found.
An alternative approach is to search directly for the top-K or all sufficiently good dependency sets with statistical goodness measures. Sometimes, these set patterns are still called 'rules' or presented by the best rule that can be derived from the set. Examples are correlation rules by Brin et al (1997), dependency sets by Meo (2000), strictly non-redundant association rules by Hämäläinen (2010bHämäläinen ( , 2011, and -the most rigorous of all -self-sufficient itemsets by Webb (2010) and Webb and Vreeken (2014). The first three of these define relatively simple patterns, where it is required that dependency set X expresses mutual dependence, it adds new dependencies to its subsets Y X, and the dependency is significant with the selected measure. In (Brin et al, 1997), the deviation of P(X) from A i ∈X P(A i ) is evaluated with the χ 2measure, and only the minimal significant dependency sets are presented. In (Meo, 2000), P(X) is compared to its maximal independence estimate that maximizes entropy, given frequencies of all Y X, |Y| = |X| − 1. Strictly non-redundant association rules are an intermediate form between set type and rule type patterns, where each set is presented by its best rule, evaluated with the z-score and binomial probability (Hämäläinen, 2010b) or the χ 2 -measure (Hämäläinen, 2011).
Self-sufficient itemsets are a more sophisticated pattern type with much stronger requirements. The core idea is that an itemset should only be considered interesting if its frequency cannot be explained by assuming independence between any partition of the items. That is, there should be no partition Q X, X \ Q such that P(X) ≈ P(Q)P(X\Q). interesting. Nonetheless, under the self-sufficient itemset approach {M, P,G} can be discarded because it is not more frequent than would be expected by assuming independence between {G} and {M, P}.
In self-sufficient itemsets, this requirement is formalized as a test for productivity. It is required that there is a significant positive association between every partition of the itemset, when evaluated with Fisher's exact test. In addition, self-sufficient itemsets have two additional criteria: they have to be non-redundant and independently productive.
In the context of self-sufficient itemsets, set X is considered redundant, if The motivation is that if A is a necessary consequent of another set of items Z, then Y = {A} ∪ Z should be associated with everything with which Z is associated. For example Z = {pregnant} entails A = female (Y = { f emale, pregnant}) and pregnant is associated with oedema. In consequence, X = { f emale, pregnant, oedema} is not likely to be interesting if {pregnant, oedema} is known.
A final form of test that can be employed is whether the frequency of an itemset X can be explained by the frequency of its productive and nonredundant supersets Y X. For example, if A, B and C are jointly necessary and sufficient for D then all subsets of {A, B,C, D} that include D should be productive and nonredundant. However, they may be misleading, as they fail to capture the full conditions necessary for D. Webb (2008) proposes that if Y X is productive and nonredundant, X should only be considered potentially interesting if it is independently productive, meaning that it passes tests for productivity when data covered by Y \ X are not considered.

Multiple testing problem
In this section, we discuss the problem of multiple hypothesis testing. We describe the main principles and popular correction methods and then introduce some special techniques for increasing power in the pattern discovery context.

Overview
The goal of pattern discovery is to find all sufficiently good patterns among exponentially many possible candidates. This leads inexorably to the problem of multiple hypothesis testing. The core of this problem is that the more patterns are tested, the more likely it is that spurious patterns pass their tests, causing type I error. This is easiest to demonstrate in the classical Neyman-Pearsonian hypothesis testing. Let us suppose we are testing m true null hypotheses (spurious patterns) and in each test the probability of type I error is exactly the selected significance level α. (In general, the probability is at most α, but with increasing power it approaches α.) In this case the expectation is that in m · α tests a type I error is committed and a spurious pattern passes the test. With normal significance levels this can be quite a considerable number. For example if α = 0.05 and 100000 spurious patterns are tested, we can expect 5000 of them to pass the test.
Solutions to the multiple testing problem try to control type I errors among all tests. In practice, there are two main approaches: The traditional approach is to control the familywise error rate which is the probability of accepting at least one false discovery (rejecting a true null hypothesis). Using Table 6 notations, the familywise error rate is FWER = P(V ≥ 1). Another, less stringent approach is to control the false discovery rate which is the expected proportion of false discoveries, Since FDR ≤ FWER, control of FWER subsumes control of FDR. In a special case, where all null hypotheses are true (m = m 0 ), FWER = FDR. The latter means that a FDR controlling method controls FWER in a weak sense, when the probability of type I errors is evaluated under the global null hypothesis H C 0 = ∩ m i=1 H i (all m hypotheses are true). However, usually it is required that FWER should be controlled in a strong sense, under any set of true null hypotheses. (For details, see e.g., (Ge et al, 2003).) In general, FWER control is preferred when false discoveries are intolerable (e.g., accepting a new medical treatment) or when it is expected that most null hypotheses would be true, while FDR control is often preferred in exploratory research, where the number of potential patterns is large and false discoveries are less serious (e.g., (Goeman and Solari, 2011)).

Methods for multiple hypothesis testing
The general idea of multiple hypothesis testing methods is to make rejection of individual hypotheses more difficult by adjusting the significance level α (or the corresponding critical value of some test statistic) or, equivalently, adjusting individual p-values, p 1 , . . . , p m , corresponding to hypotheses H 1 , . . . , H m . When the goal is to control FWER at level α, the procedure determines for each hypothesis H i an adjusted significance levelα i (possiblyα 1 = . . . =α m ) such that H i is rejected if and only if p i ≤α i . Alternatively, the p-value of H i can be adjusted and the adjusted p-valuê p i is compared to the original significance level α. Now FWER can be expressed as where K is the set of indices of true null hypotheses and P i andP i denote random variables for the original and adjusted p-values.
The correction procedures are designed such that FWER ≤ α holds, at least asymptotically, when the underlying assumptions are met. In addition, it is usually required that the adjusted p-values have the same order as the original p-values, i.e., p i ≤ p j ⇔p i ≤p j . This 'monotonicity of p-values' is by no means necessary for FWER control, but it is in line with the statistical intuition, according to which a pattern should not be declared significant if a more significant pattern (with smaller p) is declared spurious (Westfall and Young, 1993, p. 65). Reporting adjusted p-values (together with original unadjusted p-values) is often recommended, since they are more informative than binary rejection decisions. However, individual p-values cannot be interpreted separately becausep i is the smallest α-level of the entire test procedure that rejects H i given p-values or tests statistics of all hypotheses (Ge et al, 2003).
Famous multiple testing procedures and their assumptions are listed in Table 7. Bonferroni andŠídák corrections as well as the single-step minP method are examples of single-step methods where the same adjusted significance level is applied to all hypotheses. All the other methods in the table are step-wise methods that determine individual significance levels for each hypothesis, depending on the order of p-values and rejection of other hypotheses.
Step-wise methods can be further divided into step-down methods (Holm-Bonferroni method and the step-down minP method by Westfall and Young (1993)) that process hypotheses in the ascending order by their p-values and step-up methods (Hochberg, Benjamini-Hochberg and Benjamini-Hochberg-Yekutieli methods) that proceed in the opposite order. In general, singlestep methods are least powerful and step-up methods most powerful, with the exception of the powerful minP methods.
The least powerful method for controlling FWER is the popular Bonferroni correction. The lack of power is due to two pessimistic assumptions: m 0 is estimated by its upperbound m and the probability of type I error by upperbound P((P 1 ≤ α m ) ∨. . .∨(P m ≤ α m )) ≤ m i=1 P(P i ≤ α m ) (Boole's or Bonferroni's inequality). Therefore, the Bonferroni correction is least powerful when many null hypotheses are false or the hypotheses are positively associated. TheŠídák correction (Šídák, 1967) is slightly more powerful, because it assumes independence of true null hypotheses and can thus use a lower upperbound for the probability of type I error. However, the method gives exact control of FWER only under the independence assumption. The control is not guaranteed if the true hypotheses are negatively dependent and the method may be overly conservative, if they are positively dependent. The Holm-Bonferroni method (Holm, 1979) is a sequential variant of the Bonferroni method. It proceeds in a stepwise manner, by comparing the smallest p-value to α m , like the Bonferroni method, but the largest to α. Therefore, it rejects always at least as many null hypotheses as the Bonferroni method and the gain in power is greatest when most null hypotheses are false. The Hochberg method (Hochberg, 1988) can be considered as a step-up variant of the Holm-Bonferroni method. It is more powerful but it has extra requirements for the dependency structure among hypotheses. Sufficient conditions for the Hochberg method (and the underlying Simes inequality) are independence and certain types of positive dependence (e.g., positive regression dependence on a subset (Benjamini and Yekutieli, 2001)) between true hypotheses.Šídák's method can also be implemented in a similar step-wise manner by using criterion p i > 1 − (1 − α) 1/(m−i+1) in the Holm-Bonferroni method. However, the resulting Holm-Šídák method assumes also independence of hypotheses.
The Benjamini-Hochberg method (Benjamini and Hochberg, 1995) and the Benjamini-Hochberg-Yekutieli method (Benjamini and Yekutieli, 2001) are also step-up methods, but they control FDR instead of FWER. The Benjamini-Hochberg method is always at least as powerful as the Hochberg method and the difference is most pronounced when there are many false null hypotheses. The Benjamini-Hochberg method is also based on the Simes inequality and has the same requirements for the dependency structure between true hypotheses (independence or certain types of positive dependence). The Benjamini-Hochberg-Yekutieli method allows also negative dependencies, but it is less powerful, and may sometimes be even more conservative than the Holm-Bonferroni method (Goeman and Solari, 2014).
The minP methods (Westfall and Young, 1993) present a different approach to multiple hypothesis testing. These methods are usually implemented with permutation testing or other resampling methods and thus they adapt to any dependency structure between null hypotheses. This makes them powerful methods and they have been shown to be asymptotically optimal for a broad class of testing problems (Meinshausen et al, 2011).
The minP-methods are based on an alternative expression of FWER (Equation (19)): FWER = P(∪ i∈K {P i ≤α} | H K ) = P(min i∈K {P i ≤α} | H K ), where H K is an intersection of all true hypotheses andα is an adjusted significance level. Therefore, an optimalα can be determined as an α-quantile from the distribution of the minimum In principle, any technique for estimating the α-quantile can be used, but analytical methods are seldom available. However, the evaluation can be done also empirically, with resampling methods.
For strong control of FWER, the probability should be evaluated under H K which is unknown. Therefore, the estimation is done under the complete null hypotheses H C 0 . Strong control (at least partial strong control, (Rempala and Yang, 2013)) can still be obtained under certain extra conditions. One such condition is subset pivotality (Westfall and Young, 1993, p. 42) that requires the raw p-values (or other test statistics) of true null hypotheses to have the same joint distribution under H C 0 and any other set of hypotheses. Since the true null hypotheses are unknown, the minimum p-value is determined among all null hypotheses (the single-step method) or among all unrejected null hypotheses (the step-down method). The resulting single-step adjustment isp i = P( min 1≤ j≤m P j ≤ p i | H C 0 ). A similar adjustment can be done with other test statistics T, if subset pivotality or other required conditions hold. Assuming that high T-values are more significant, the adjusted p-value isp i = P( max 1≤ j≤m T j ≤ t i | H C 0 ), where T i is a random variable for the test statistic of H i and t i is its observed value. The probability under H C 0 can be estimated with permutation testing, by permuting the data under H C 0 and calculating the proportion of permuted data sets where min P or max T value is at least as extreme as the observed p i or t i . In pattern discovery, complete permutation testing is often infeasible, but there are more efficient approaches combining the minP correction with approximate permutation testing (e.g., (Han-hijärvi, 2011;Minato et al, 2014;Llinares López et al, 2015). However, the time and space requirements may still be too large for many practical pattern mining purposes.

Increasing power in pattern discovery
In pattern discovery, the main problem of multiple hypothesis testing is the huge number of possible hypotheses. This number is the same as the number of all possible patterns or the size of the search space that is usually exponential. If correction is done with respect to all possible patterns, the adjusted critical values may become so small that few patterns can be declared as significant. This means that one should always use as powerful correction methods as possible or control FDR instead of FWER when applicable, but this may still be insufficient. A complementary strategy is to reduce the number of hypotheses or otherwise target more power to those hypotheses that are likely to be interesting or significant. In the following, we describe three general techniques designed for this purpose: hold-out evaluation, filtering hypotheses and weighted hypothesis testing.
The idea of hold-out evaluation (Webb, 2007) (also known as two-stage testing, e.g., (Miller et al, 2001)) is to use only a part of the data for pattern discovery and save the rest for testing significance of patterns. The method consists of three steps: (i) Divide the data into an exploratory set D E and a hold-out set D H . (ii) Search for patterns in D E and select K patterns for testing. Note that the selection process at this step can use any principle suited to the application and need not involve hypothesis testing. (iii) Test the significance of the K patterns in D H using any multiple hypothesis testing procedure.
Now the number of hypotheses is only K which is typically much less than the size of the search space. This makes the method powerful, even if the p-values in the hold-out set are likely larger than they would have been in the whole data set. The power can be further enhanced by selecting powerful multiple testing methods, including methods that control FDR. A potential problem of hold-out evaluation is that the results may depend on how the data is partitioned. In a pathological case, a pattern may occur only in the exploratory set or only in the hold-out set, and thus remain undiscovered (Liu et al, 2011).
Another approach is to use filtering methods (see e.g., (Bourgon et al, 2010)) to select only promising hypotheses for significance testing. Ideally, the filter should prune out only true null hypotheses without compromising control of false discoveries. In practice, the true nulls are unknown and the filter uses some data characteristics to detect low power hypotheses that are unlikely to become rejected. Unfortunately, some filtering methods affect also the distribution of the test statistic and can violate strong control of FWER. As a solution, it has been suggested that the filtering statistic and the actual test statistic should be independent given a true null hypothesis (Bourgon et al, 2010).
In pattern discovery, one useful filtering method is to prune out so called untestable hypotheses (Terada et al, 2013;Mantel, 1980) that cannot cause type I errors. This approach can be used, when hypothesis testing is done conditionally on some data characteristics, like marginal frequencies. The idea is to determine beforehand, using only the given conditions, if a hypothesis can ever achieve sufficiently small p-value to become rejected at level α. If this is not possible, then the hypothesis cannot contribute to FWER either, and it can be safely ignored when determining corrections for multiple hypothesis testing. For example, if attributes A and B have frequencies fr(A) = 10 and fr(B) = 2 and n = 20, then the lowest possible p-value for rule A → B with Fisher's exact test is p F = 0.24. If α = 0.05, then this pattern can never pass the test and the hypothesis is considered untestable. In practice, selecting testable hypotheses can improve the power of the method considerably.
A third approach is to use a weighted multiple testing procedure (e.g., (Finos and Salmaso, 2007;Holm, 1979)) that gives more power to those hypotheses that are likely to be most interesting. Usually, the weights are given a priori according to assumed importance of hypotheses, but it is also possible to determine optimal weights from the data to maximize power of the test (see e.g., (Roeder and Wasserman, 2009)). The simplest approach is an allocated Bonferroni procedure (Rosenthal and Rubin, 1983) that allocates total α among all m hypotheses according to their importance. Each hypothesis H i is determined an individual significance levelα i such that m i=1α i ≤ α. This is equivalent to a weighted Bonferroni procedure, where one determines weights w i such that m i=1 w i = m and reject H i if p i ≤ w i α m . There are also weighted variants of other multiple correction procedures, like the weighted Holm-Bonferroni procedure (Holm, 1979) and the weighted Benjamini-Hochberg procedure (Benjamini and Hochberg, 1997). Usually, these procedures do not respect the monotonicity of p-values, which means that the most significant patterns may be missed if they were deemed uninteresting.
In the pattern discovery context, one natural principle is to base the weighting on the complexity of patterns and favour simple patterns. This approach is used in the method of layered critical values (Webb, 2008) where simpler patterns are tested with looser thresholds and strictest thresholds are reserved to most complex patterns. The motivation is that simpler patterns tend to have higher proportions of significant patterns and can be expected to be more interesting. In addition, this weighting strategy supports efficient search, because it helps to prune deeper levels of the search space.
When dependency patterns are searched, the complexity can be characterized by the number of attributes in the pattern which is the same as the level of the search space. Webb (2008) has suggested an allocation strategy where all patterns at level L are tested with thresholdα L such that L max L=1α L · S L ≤ α, where L max is the maximum level and S L is the number of all possible patterns at level L. One such allocation is to setα The method was originally proposed for the breadth-first search of classification rules, but it can be applied to other pattern types and depth-first search, as well. The only critical requirement is that the bias towards simple patterns fits the research problem. In a pathological case, the method may miss the most significant patterns if they are too complex. However, the same patterns might remain undetected also with a weaker but more balanced testing procedure.

Conclusions
Pattern discovery is a fundamental form of exploratory data analysis. In this tutorial, we have covered the key theory and techniques for finding statistically significant dependency patterns that are likely to represent true dependencies in the underlying population. We have concentrated on two general classes of patterns: dependency rules that express statistical dependencies between condition and consequent parts and dependency sets that express mutual dependence between set elements. Techniques for finding true statistical dependencies are based on statistical tests of different types of independence. The general idea is to evaluate how likely it is that the observed or a stronger dependency pattern would have occurred in the given sample data, if the independence assumption had been true. If this probability is too large, the pattern can be discarded as having a high risk of being spurious. In this tutorial we have presented the core relevant statistical theory and specific statistical tests for different notions of dependence under various assumptions on the underlying sampling model. However, in many applications it is often desirable to use stronger filters to the discovered patterns than a simple test for dependence. Statistically significant dependency rules and sets can be generated by adding unrelated or even negatively associated elements to existing patterns. Unless further tests are also satisfied, such as tests for productivity and significant improvement, the discovered rules and sets are likely to be dominated by many superfluous or redundant patterns. Fortunately, statistical significance testing can also be employed to control the risk of 'discovering' these and other forms of superfluous patterns. We have also surveyed the key such techniques.
The final major issue that we have covered is that of multiple testing. Each statistical hypothesis test controls the risk that its null hypothesis would be rejected if that hypothesis were false. However, typical pattern discovery tasks explore exceptionally large numbers of potential hypotheses, and even if the risks for each of the individual hypotheses are extremely small, they can accumulate until the cumulative risk of false discoveries approach near certainty. We also survey multiple testing methods that can control this cumulative risk.
It is important to remember that statistical significance testing controls only the risk of false discoveries -type I error. It does not control the risk of type II error -of failing to discover a pattern. When sample sizes are reasonably large, it is reasonable to expect that statistically sound pattern discovery techniques will find all real strong patterns in the data and will not find spurious weak patterns. However, it is important to recognize that in some circumstances it will be more appropriate to explore alternative techniques that trade off the risks of type I and type II error.
Statistically sound pattern discovery has brought the field of pattern mining to a new level of maturity, providing powerful and robust methods for finding useful sets of key patterns from sample data. We hope that this tutorial will help bring the power of these techniques to a wider group of users.

Appendix A: Auxiliary results
Lemma 1 p mul (X → A | n, p X p A ) ≤ n i=fr(XA) n i (p X p A ) i (1 − p X p A ) n−i .
Proof In the multinomial model, the exact p-value of rule X → A is where Q 1 = nN XA γN X (guaranteeing that Γ ≥ γ = γ(X, A)). Each term can be expressed in the form n N XA (p X p A ) N XA B(N XA , N X , N A ), where For the exact p-value, we should sum B(N XA , N X , N A ) over all values N X = N XA , . . . , n and N A = N XA , . . . , Q. We get an upper bound for the p-value, if we instead sum also N A over the whole range (i.e., ignoring the Γ ≥ γ condition): N A = N XA , . . . , n. Let C(N X , N A | N XA ) be the resulting coefficient: Using the multinomial theorem i) The assumption was that P(¬A|¬(XQ)) = P(¬X¬A) + P(X¬Q¬A) P(¬X) + P(X¬Q) > P(¬X¬A) P(¬X) = P(¬A|¬X).