Discovering dependencies with reliable mutual information

We consider the task of discovering functional dependencies in data for target attributes of interest. To solve it, we have to answer two questions: How do we quantify the dependency in a model-agnostic and interpretable way as well as reliably against sample size and dimensionality biases? How can we efficiently discover the exact or α\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha $$\end{document}-approximate top-k dependencies? We address the first question by adopting information-theoretic notions. Specifically, we consider the mutual information score, for which we propose a reliable estimator that enables robust optimization in high-dimensional data. To address the second question, we then systematically explore the algorithmic implications of using this measure for optimization. We show the problem is NP-hard and justify worst-case exponential-time as well as heuristic search methods. We propose two bounding functions for the estimator, which we use as pruning criteria in branch-and-bound search to efficiently mine dependencies with approximation guarantees. Empirical evaluation shows that the derived estimator has desirable statistical properties, the bounding functions lead to effective exact and greedy search algorithms, and when combined, qualitative experiments show the framework indeed discovers highly informative dependencies.


Introduction
Given data D from a joint distribution p(I, Y ) over input variables I = {X 1 , . . . , X d } and a target of interest Y , it is a fundamental problem in knowledge discovery to find subsets X ⊆ I that jointly influence or (approximately) determine Y . These dependencies are essential for a variety of applications. In scientific domains, for example, the analysis often involves identifying compact sets of descriptors that capture the underlying process of the various phenomena under investigation [1,2]. The task of functional dependency discovery can be formulated as finding the top-k attribute subsets X * 1 , . . . , where Q is some function quantifying the functional dependency of Y on X .
For an effective knowledge discovery procedure, Q should be able to identify any type of dependency, e.g., nonlinear, multivariate, without a priori assumptions on the underlying data generating process p [3]. Moreover, solutions to Eq. (1), besides being efficient, should be exact or come with approximation guarantees. These guarantees can, in particular, verify the absence of meaningful dependencies for Y and prompt the acquisition of new and potentially more relevant features [2]. These two requirements differentiate our task from similar and well-known applications such as key discovery for data management [4], feature selection in machine learning [5], and Markov blanket discovery in Bayesian networks [6]. The former operates under a closed world assumption and hence the discovered keys will not generalize to unseen data drawn from the same distribution p. Unlike feature selection where the enduser is a machine learning algorithm, we are interested in providing the analyst with sparse but exact solutions where all interactions are accounted for, rather than high-dimensional, greedy solutions for pairwise associations (see [7] for a survey on scores Q designed for feature selection). Lastly, Markov blanket algorithms 1 often operate under the assumption that p is faithfully represented by a DAG, implying a unique Markov blanket. Additionally, high-order dependencies (e.g., Y = X ⊕ Z ) are neglected due to the greedy search employed. The variant we consider does not impose a DAG structure for p, nor faithfulness, and is therefore better suited for exploratory analysis and high-order dependencies.
Given categorical data D, the ideal choice for Q is the information-theoretic measure fraction of information [9][10][11], defined as where H (Y ) = − y∈Y p(y) log( p(y)) denotes the Shannon entropy and H (Y | X ) = x∈X p(x)H (Y | X = x) the conditional Shannon entropy. The numerator is the mutual information I (X ; Y ) = H (Y ) − H (Y | X ). The entropy measures the uncertainty about Y , while the conditional entropy measures the uncertainty about Y after observing X . The fraction of information then represents the proportional reduction in uncertainty about Y by knowing X . Moreover, the extreme values F(X ; Y ) = 1 and F(X ; Y ) = 0 correspond to functional dependency and statistical independence, respectively.
Estimating the mutual information I (X ; Y ) naively with empirical probabilities, however, can lead to an overestimation of the true dependency between X and Y -a behavior known as dependency-by-chance [12]. While asymptotically efficient [13], the empirical estimator I (X ; Y ) has an increasing bias with the domain size of variables [14], and hence, is unsuited for dependency discovery where we have to soundly compare different variable sets of varying dimensionality and consequently of widely varying domain sizes. It is even possible that a dependency is indicated, when X and Y are actually independent in p (see Fig. 1 is a trivial and uninformative maximizer for Eq. (1). To the best of our knowledge, there are no exact algorithms that efficiently solve Eq. (1) incorporating more refined mutual information estimators.
To obtain a statistically reliable estimator for high-dimensional mutual information, we propose a correction to the plug-inÎ by subtracting its expected value E 0 [Î (X ; Y )] under a suitable null hypothesis model. We choose the nonparametric permutation model [15, p. 214], under which the expected value is computed as the averageÎ (X ; Y σ ) over all sample permutations σ . The resulting estimatorÎ 0 (X ; Y ) =Î (X ; Y ) − E 0 [Î (X ; Y )], which we term the reliable mutual information, not only accounts for dependency-by-chance, but is also efficiently computed. For the discovery part, we show that maximizing Eq. (1) witĥ I 0 is NP-hard. To enable efficient exact, approximate, and heuristic algorithms, we derive two bounding functions forÎ 0 that can be used with branch-and-bound and greedy search to heavily prune the search space.
In this article we build upon and extend our recent work published as Mandros et al. [16,17]. In the former paper, we introduced the general problem, a corrected estimator, and a bounding function that allows branch-and-bound search for the strongest dependencies. In the second paper, we focused on the discovery aspect, proved NP-hardness, proposed a tighter bounding function, and investigated better algorithms for both exact and heuristic search. In this paper, we present these results in a unified way that allows for more detail and insight in both the problem and proposed solutions. In particular, we provide a more comprehensive derivation of our estimator and make the link to nonparametric permutation tests. We show that in our setting mutual information is not submodular, which excludes approximation guarantees of greedy optimization for this function. Last, we extend the evaluation and include comparisons with a corrected estimator using a parametric null model, and by performing bias, variance, and precision/recall experiments.
Our overall contributions are the following. We derive a consistent and reliable estimator for mutual information to correct the positive bias, as well as accompany the estimator with a set of useful properties that can be used for optimization (Sect. 2). We then study the algorithmic aspects, show that maximizing the reliable estimator is NP-hard (Sect. 3), and derive two effective bounding functions that can be used by algorithms to prune the search space (Sect. 4). The first function is applicable to any estimator having a monotonically increasing (w.r.t. the superset relation) correction term, while the second is specifically tailored to our proposed reliable estimator and has an unbounded pruning potential over the first. We propose an admissible branch-and-bound algorithm to discover the α-approximate top dependencies for desired approximation guarantee α ∈ (0, 1], and in addition, a fast greedy algorithm (Sect. 5). Last, we perform an extensive evaluation for the estimator, pruning functions, and resulting discovery framework (Sect. 6). The experiments demonstrate an excellent performance for the greedy algorithm combined withÎ 0 (a non-monotonic, nonsubmodular set function), with optimal or nearly optimal results in all 35 real-world datasets under investigation.

Reliable mutual information and properties
In this section we derive our estimator for mutual information, as well as properties to be used for optimization. We start with preliminaries and notation.
Let us denote by [n] the set of positive integers up to n. The symbols log and ln refer to the logarithms of base 2 and e, respectively. We assume a set of discrete random variables I = {X 1 , . . . , X d } and Y is given along with an empirical sample D n = {d 1 , . . . , d n } of their joint distribution p. For a variable X we denote its domain, called categories (or distinct values), by V (X ) but we also write x ∈ X instead of x ∈ V (X ) whenever clear from the context. We identify a random variable X with the labeling X : [n] → V (X ) it induces on the data sample, i.e., X (i) = d i (X ). Moreover, for a set S = {S 1 , . . . , S l } of labelings over [n], we define the corresponding vector-valued labeling by S(i ) = (S 1 (i), . . . , S l (i)). With X Q for a subset Q ⊆ [n], we denote the map X restricted to domain Q.
We define c X : V (X ) → Z + to be the empirical counts of X , i.e., c X (x) = |{i ∈ [n] : for z ∈ Z . However, we usep(x) andp(z | x), respectively, whenever clear from the context. These empirical probabilities give rise to the empirical conditional entropyĤ , and the empirical fraction of informationF(X ; Y ) =Î (X ; Y )/Ĥ (Y ). These estimators are also known as plug-in estimators, because they arise from simply "plugging in" the empirical distributionp instead of p.

Reliable mutual information
Intuitively, the reason whyÎ is unreliable as an estimator for I is that it does not take into account the confidence in the empirical estimatesĤ (Y |X = x) for subsets X ⊆ I. This is particularly profound for the extreme case where the empirical count c X (x) is equal to 1. In this situation c X ∪Y (x, y) = 1 exactly for one value of y ∈ V (Y ) and, hence,Ĥ (Y |X = x) is trivially equal to 0 independent of the true distribution p. This case is likely to occur for many of the sampled values for X if the data size n is small compared to the observed domain of X -even when I (X ; Y ) = 0, which coincides with the highest error, because then The tendency for the plug-in estimatorÎ to overestimate is more formally explained by the bias result of Roulston [14], where We see that the bias is independent of the actual distribution p and it depends solely on the domain sizes and the number of samples n. The bias is high when the X , Y , samples produce jointly a large domain compared to their marginal domains and sample size n, and is at the highest when X and Y are independent in the underlying distribution p, i.e., when These last observations suggest a correction for the empiricalÎ (X ; Y ) by subtracting its bias assuming independence for X and Y . A nonparametric choice for the null model is the permutation model [15, p. 214], arriving at the bias where S n denotes the symmetric group of [n], i.e., the set of bijections from [n] to [n], and Y σ denotes the composition of map Y with the permutation σ ∈ S n , i.e., Y σ (·) = Y (σ (·)). Essentially, Eq. (2) is the average empirical mutual information over all possible sample permutations with fixed marginal counts. With this, the reliable mutual information is defined asÎ , and the reliable fraction of information aŝ The reliable estimatorÎ 0 controls the number of false positives by being unbiased under the null hypothesis with fixed marginal counts. In relation to statistical hypothesis testing and permutation tests, here we subtract the expected value of the null distribution instead of finding the exact probability of the tail. Our approach is more flexible as it does not require a fixed confidence interval, but instead it adapts to the data and the different dimensionalities encountered during search 2 . Moreover, we will see below that computing the mean is much more efficient than enumerating all possible permuted datasets to obtain the exact probability (which is only applicable for small data). Intuitively,Î 0 works in the following way: when it appears in a sample thatÎ (X ; Y ) is high for a X ⊆ I with a large domain, many of the permutations will also show high dependency, and hence the correction is large as well. From here on we use these quantities interchangeably sinceĤ (Y ) is just a constant normalization, and we abbreviate the correction terms E 0 [Î (X ; Y )] as m 0 (X , Y , n) and the normalized version Regarding the evaluation of Eq. (2), a naive approach with n! possible permutations is computationally infeasible. However, Vinh et al. [18] show that the complexity is dramatically reduced by reformulating it as a function of contingency table cell values and exploiting symmetries. Let the observed domains of X and Y be V (X ) = {x 1 , . . . , x R } and V (Y ) = {y 1 , . . . , y C }, respectively. We define shortcuts for the observed marginal counts a i = c(X = x i ) and b j = c(Y = y j ) as well as for the joint counts c i, j = c(X = x i , Y = y j ). The contingency table c for X and Y is then the complete joint count configuration c = {c i, j : The empirical mutual information for X and Y can then be computed asÎ Each σ ∈ S n results in a contingency table c σ . We denote with T = {c σ : σ ∈ S n } the set of all such contingency tables. Crucially, all these tables have the same marginal counts wherep 0 (c) is the probability of contingency table c ∈ T . This allows us to re-order the terms to have a per-cell contribution to m 0 , rather than per-contingency-table c ∈ T , i.e., Under the permutation model, the empirical counts c σ i j are distributed hypergeometrically, i.e.,p These probabilities can be computed efficiently in an incremental manner using the support of the hypergeometric distribution, i.e., k is nonzero for k ∈ [max(0, a i + b j − n), min(a i , b j )], and the hypergeometric recurrence formulâ The complexity for m 0 is then O(n max{|V (X )|, |V (Y )|}) [19]. Moreover, the computation can be done in parallel for each individual cell. In addition to being computationally efficient, the resulting reliable dependency scorê F 0 (X ; Y ) =F(X ; Y ) − b 0 (X , Y , n) satisfies several other properties. First of all, it is indeed a consistent estimator of F. In particular, Vinh et al. [20] show that lim n→∞ m 0 (X , Y , n) = 0, and together with the consistency of the plug-inF [13], we have that lim n→∞F0 (X ; Y ) = F(X ; Y ). Moreover,F 0 (X ; Y ) remains upper-bounded by 1, although this value is only attainable in the limit case n → ∞ (for true functional dependencies). Most importantly, contrary to the naive estimator, we have thatF 0 approaches zero 3 as the empirical domain V (X ) increases relative to the data size n. We show this by proving the monotonicity of m 0 with respect to the subset relation. Theorem 1 Given two sets of variables X , X with X ⊆ X ⊆ I, then m 0 (X , Y , n) ≤ m 0 (X , Y , n), i.e., the expected value under the permutation model is monotonically increasing with respect to the subset relation.
Proof Using the chain rule of information and that mutual information is nonnegative [21, Chapter 2], we have that I (X ; Y ) ≤ I (X ; Y ). Then for each σ ∈ S n it holds that I (X ; Y σ ) ≤ I (X ; Y σ ), and hence σ ∈S nÎ (X ; Y σ ) ≤ σ ∈S nÎ (X ; Y σ ), which concludes the proof.
Theorem 1 states that m 0 (X , Y , n) can indeed penalize spurious dependencies that appear with high dimensional X ⊆ I, justifying therefore the adjective reliable for the two estimators. In the following section, we couple the above information-theoretic quantities with relations for empirical attributes.

Specializations and labeling homomorphisms
Since we identify sets of random variables with their corresponding sample-index-to-value map, they are subject to the following general relations of maps with common domains.

Definition 1 Let A and B be maps defined on a common domain N . We say that
A special case of specializations is given by the subset relation of variable sets, e.g., if X ⊆ X ⊆ I then X X . The specialization relation implies some important properties for empirical probabilities and information-theoretic quantities.

Proposition 1
Given variables X , Z , Y , with X Z , the following statements hold: Proof Let us denote with p and q thep X ∪Y andp Z ∪Y distributions, respectively. Statement a) follows from the definition.
The statement then follows from the definition ofĤ . We have where the inequality follows from the monotonicity of the log function (and the fact that q(z) is positive for all z ∈ Z ). (c) Let us first recall the log-sum inequality [21, p. 31]: for nonnegative numbers a 1 , a 2 , . . . , a n and b 1 with equality if and only if a i /b i constant. We havê To analyze the monotonicity properties of the permutation model, the following additional definition will be useful.

Definition 2
We call a labeling X homomorphic to a labeling Z (w.r.t. the target variable See Table 1 for examples of both introduced relations. Importantly, the inequality of mutual information for specializations (Proposition 1d) carries over to homomorphic variables and in turn to their correction terms.

Proposition 2
Given variables X , Z , Y , with X Z , the following statements hold: Proof Let σ * ∈ S n be a permutation for which Y ≡ Y σ * and X Z σ * . Property a) follows fromÎ where the inequality holds from Proposition 1d). For (b), note that for every σ ∈ S n , it holds from Proposition 1d) thatÎ

Hardness of optimization
In this section, we prove the NP-hardness of maximizingF 0 (and henceÎ 0 ) by providing a reduction from the well-known NP-hard minimum set cover problem: given a finite universe U = {u 1 , . . . , u n } and collection of subsets The reduction consists of two parts. First, we construct a base transformation τ 1 (U , B) = D l that maps a set cover instance to a dataset D l , such that the plug-inF is monotonically increasing with coverage, and in particular, set covers correspond to attribute sets with an empirical fraction of information scoreF of 1, and correction terms b 0 that are a monotonically increasing function of their cardinality. In a second step, we calibrate the b 0 terms such that all candidate set covers have a higherF 0 value than partial set covers. The latter is achieved by copying the dataset D l a suitable number of times k such that the correction terms are sufficiently small but the overall transformation, denoted τ k (U , B) = D kl , is still of polynomial size. Combining these, we arrive at a polynomial time reduction, where maximizingF 0 in D kl corresponds to finding a minimal set cover for set cover instance (U , B).
The base transformation τ 1 (U , B) = D l is defined as follows. The dataset D l contains m descriptive attributes I = {X 1 , . . . , X m } corresponding to the sets of the set cover instance, and a target variable Y. The sample size is l = 2n + m + 1 with a logical partition of the sample into the three regions S 1 = [1, n], S 2 = [n + 1, 2n], and S 3 = [2n + 1, l]. The target attribute Y assigns to data points one of three values corresponding to the three parts, i.e., , and the descriptive attributes X i assign up to n + 3 distinct values depending on the set of universe elements covered by set B i , i.e., X i : [l] → {1, 2, . . . , n, a, b, c} with See Fig. 2 for an illustration. In a nutshell, the base transformation establishes a one-to-one correspondence between C ⊆ B and variable sets X ⊆ I, which we denote with I(C). We note the following two remarks. Let us use a for (a, . . . , a), and C as a short-cut for B∈C B. We have that S 1 and S 2 couple the amount of uncovered elements U \ C to the conditional entropŷ In addition, part S 3 links the size of C to the number of distinct values of I(C) on S 3 , i.e., We now establish three central properties for the base transformation. To show (b), sinceF(I(C ); Y ) and thusÎ (I(C ); Y ) are monotone in | C |, it is sufficient to consider the case where |U \ C | = 1, i.e., only one element u ∈ U is uncovered. In this case we haveÎ and, moreover, as required For (c) observe that for variable set X = I(C) corresponding to set cover C, we have for all i, j ∈ S 1 that X (i) = X ( j). Thus, X S 1 ≡ X S 1 for variable set X = I(C ) corresponding to set cover C . Moreover, we trivially have X S 2 ≡ X S 2 . Finally, let Q, Q ⊆ S 3 denote the indices belonging to S 3 where X and X take on values different from (c, . . . , c). Note that all values in these sets are unique. Furthermore, if |C| ≤ |C | then |Q| ≤ |Q | and in turn |Q \ Q | ≤ |Q \ Q|. This means we can find a permutation σ ∈ S n such that for all Now, although set covers C ⊆ B correspond to variable sets I(X ) with the maximal empirical fraction of information value of 1, due to the correction term, it can happen that F 0 (I(X ); Y ) ≥F 0 (I(X ); Y ) for a variable set I(X ) corresponding to a partial set cover. To prevent this, we make use of the following upper-bound of the expected mutual information under the permutation model.
Proposition 3 implies that we can arbitrarily shrink the correction terms if we increase the sample size but leave the number of distinct values constant. Thus, we define the extended transformation τ i (U , B) = D il through simply copying D l a number of i times, i.e., by With this definition, we proceed with the NP-hardness result.
Theorem 2 Given a sample of the joint distribution of variables I and Y , the problem of maximizingF 0 ( · ; Y ) over all possible subsets X ⊆ I is NP-hard.
Proof First, let us assume that there exists a number k ∈ O(l) such that w.r.t. transformation τ k , all set covers C ⊆ B and their corresponding variable sets X = I(C) have correction terms with m 0 (X , Y , kl) < 2/l. Since all properties of Lemma 1 transfer from τ 1 to τ k , this implies that for all variable sets X = I(C ) corresponding to partial set covers C ⊆ B, it holds that where the greater-than follows from Lemma 1(a) and 1(b). Thus, all X corresponding to set covers have largerF 0 than partial set covers. Moreover, we know that C must be a minimum set cover as required, because for a smaller set cover C , we would have I(C ) Now, to find the number k that defines the final transformation τ k , let D il = τ i (U , B) and C be a set cover of (U , B). Since X = I(C) has at most l distinct values in D il and Y exactly 3, from Proposition 3 and the monotonicity of ln, we have that where the last inequality follows from ln( , which is polynomial in the size of the set cover instance.

Admissible bounding functions for effective search algorithms
The NP-hardness established in the previous section excludes the existence of a polynomial time algorithm for maximizing the reliable fraction of information (unless P=NP), leaving therefore exact but exponential search and heuristics as the two options. For both, and particularly the former, reducing the search space can lead to more effective algorithms. For this purpose, we derive in this section bounding functions (also called optimistic estimators) for the reliable fraction of informationF 0 to be used for pruning.
Recall that an admissible bounding functionf is an upper bound to the optimization function value f of all supersets of a candidate solution X ⊆ I. The valuef (X ) is called the potential of node X , and it must hold thatf (X ) ≥ f (X ) for all X with X ⊆ X ⊆ I. With this property, all supersets X of X can be pruned iff (X ) ≤ f (X * ), where X * is the best candidate solution found during search. Therefore, for optimal pruning, the bounding function has to be as tight as possible. At the same time, it needs to be efficiently computable. For example, while the ideal bounding function for the reliable fraction of information would bef solving Eq. (4) is equivalent to the original problem and hence NP-hard. A first attempt for an efficient bounding function involves the upper bound of the fraction of information (i.e., F = 1) and the monotonicity of the b 0 term with respect to the subset relation (Theorem 1). In particular, for all X ⊆ X ⊆ I, it follows that Hence, we definef to be the monotonicity-based admissible bounding function. This optimistic estimator is both inexpensive 4 , and applicable to any estimator that has a monotonically increasing correction term. However, it is potentially loose as it assumes that full information about the target can be attained, without the "penalty" of an increased b 0 term. An alternative idea leading to a more principled admissible bounding function, is to relax the maximum over all supersets to the maximum over all specializations of X . We define the specialization-based bounding functionf spc (X ) through While Eq. (6) constitutes an admissible bounding function, it is unclear how it can be efficiently evaluated. To do so, let us denote by R + the operation of joining a labeling R with the target attribute Y , i.e., R + = {R, Y } (see Table 2 for an example). This definition gives rise to a simple constructive form for computingf spc .

Theorem 3 The functionf spc can be efficiently computed asf spc
Proof We start by showing that the (·) + operation causes a positive gain inF 0 , i.e., for an arbitrary labeling R it holds thatF 0 (R + ; Y ) ≥F 0 (R; Y ). It is sufficient to show that To conclude, let Z be an arbitrary specialization of X . We have by definition of Z and as required. Here, the first inequality follows from Proposition 1(b), the second from the positive gain of Z + over Z.
In a nutshell, the (·) + can only increase theF 0 value, and X + constitutes the most efficient specialization of X in terms of growth inF and b 0 (which is not necessarily attainable by a subset of input variables). Note that the X + operation is not computed explicitly since it is obtained as the nonzero cell counts of the joint contingency table for X and Y (which has to be computed forF 0 (X ; Y ) anyway). The following proposition shows that this idea indeed leads to a superior bound compared tof mon . Proposition 4 Let X ⊆ I and Δ =f mon (X ) −f spc (X ). The following statements hold: where the inequality holds from Proposition 1(b) and X X + .
and Y (i) = i/2 , respectively (see Table 2). We have One can show that the minimum of the last step is attained by the permutation σ * ∈ S n with which corresponds to sorting the a and b values of X (see Table 2). For this permutation the normalized conditional entropy evaluates to 1 − 1/ log(2l) as required.
Thus, we have established thatf spc is tighter thanf mon , and even that the difference can be arbitrary close to 1. Put differently, their ratio, and thus the potential for additional pruning, is unbounded.
Regarding their applicability to other mutual information estimators,f mon only requires monotonicity for the correction term, whilef spc additionally needs a positive gain w.r.t. to the (·) + operation. The former is easier to satisfy. Computationally,f spc (X ) is more expensive thanf mon (X ) by a factor of |V (Y )|. In practice one can combine both optimistic estimators . . in a chain-like manner: first check the pruning condition w.r.t.f mon and only computef spc if that first check fails. That is, wheneverf mon (X ) is sufficient to prune a candidate X we can still do so with the same computational complexity. However, the additional evaluation off spc (X ) can be a disadvantage in case it still does not allow to prune. This trade-off is evaluated in Sect. 6.3.

Algorithms
In this section we provide two search algorithms, one exponential and one heuristic, for maximizing the reliable fraction of information. Both make use of the bounding functions proposed. For simplicity, we solve the top-1 problem, but both algorithms can be trivially extended to a top-k formulation. return GRD(C * , X * ) 9: X * = GRD(∅, ∅)

Exponential search
Branch-and-bound, as the name suggests, consists of two main ingredients, a strategy to explore the search space and a bound for the optimization function at hand (see, e.g., [23,Chap. 12.4]). Besides being very effective in practice for hard problems, this style of optimization also provides the option of relaxing the required result guarantee to that of an α-approximation for accuracy parameter α ∈ (0, 1]. Hence, using α-values of less than 1 allows to trade accuracy for computation time in a principled manner. Here, we consider optimized pruning for unordered search (OPUS), an advanced variant of branch-andbound that effectively propagates pruning information to siblings in the search tree [24]. Algorithm 1 shows the details of this approach.
In addition to keeping track of the best solution X * explored so far, the algorithm maintains a priority queue Q of pairs (X , Z), where X ⊆ I is a candidate solution and Z ⊆ I constitutes the variables that can still be used to refine X , e.g., X = X ∪ {Z } for a Z ∈ Z. The top element is the one with the smallest cardinality and the highestf value (a combination of breadth-first and best-first order). Starting with Q = {(∅, I)}, X * = ∅, and a desired approximation guarantee α ∈ (0, 1], in every iteration OPUS creates all refinements of the top element of Q and updates X * accordingly (lines 5-7). Next the refinements are pruned usingf and α (line 8). Following, the pruned list is sorted according to decreasing potential (a "trick" to propagate the most refinement elements to the least promising candidates [24]), the possible refinement elements Z are non-redundantly propagated to the refinements of the top element, and finally the priority queue is updated with the new candidates (lines 9-11).

Heuristic search
A commonly used alternative to exponential search for optimizing dependency measures is the standard greedy algorithm (see [5,7]). This algorithm only refines the best candidate in a given iteration. Moreover, bounding functions can be incorporated as an early termination criterion. For the reliable fraction of information in particular, there is potential to prune many of the higher levels of the search space. The algorithm is presented in Algorithm 2.
The algorithm keeps track of the best solution X * explored, as well as the best candidate for refinement C * . Starting with X * = ∅ and C * = ∅, the algorithm in each iteration (i.e., search space level) checks whether C * can be refined further, i.e., if I \ C * is not empty, or if C * has potential (the early termination criterion). If not, the algorithm terminates returning X * (lines 2-3). Otherwise C * is refined to all possible refinements, and the best one is selected as a candidate to update X * (lines 5-7). Table 3 Example data demonstrating non-submodularity of I ,Î ,Î 0 in our supervised scenario where the target Y is fixed (Proposition 5) Concerning the approximation ratio of the greedy algorithm, there exists a large amount of research focused on submodular and/or monotone functions, e.g., [25][26][27]. Recall that for a set I = {X 1 , . . . , X d }, a function f : 2 I → R is called submodular if for every X ⊆ X ⊆ I and X i ∈ I \ X , it holds that i.e., it satisfies the diminishing returns property. The following proposition establishes that I ,Î , andÎ 0 , are all violating this property. Proof We prove it via an intuitive counter example. Let us consider the data of Table 3  i.e., there is a violation of the diminishing returns property, and henceÎ is not submodular. By considering p =p, it is straightforward to show that I is also not submodular.

Proposition 5 Given
RegardingÎ 0 , we have that and henceÎ 0 is not submodular. Also note that while bothÎ and I are monotone functions with respect to the subset relation (Theorem 1),Î 0 is not.
While approximation results for submodular and/or monotone functions are not applicable toÎ 0 , we empirically evaluate the quality of solutions in Sect. 6.3.2.

Evaluation
In this section, we investigate the empirical performance of discovering dependencies with the reliable fraction of informationF 0 , including the estimated bias and variance ofF 0 as an estimator, the consistency of correctly retrieving the top minimal dependency on synthetic data, and the performance of the bounding functions for both branch-and-bound and greedy search. We additionally perform qualitative experiments with two case studies.

Empirical bias and standard deviation
Here, we evaluate the estimated bias and variance ofF 0 for various degrees of dependency. We do so by creating synthetic data from various models for which we know the true F. Let us denote by P the set of all joint probability mass functions over two random variables X and Y with |V (X )| = |V (Y )| = 3, and by P [a,b] all such probability mass functions for which we have a score of F p (X ; Y ) ∈ [a, b]. We consider four different dependency score regions: "weak" P [0,0.25) , "low" P [0.25,0.5) , "high" P [0.5,0.75) , and "strong" P [0.75,1] .
Let τ (D n ) be the result of estimator τ computed on data D n . We denote with b n ( p, τ ) and std n ( p, τ ) the bias and standard deviation of τ when fixing the underlying pmf to p ∈ P, i.e., b n ( p, . We sample uniformly 100 pmfs p (1) , . . . , p (100) , 25 from each dependency region. For every p (i) we calculate the true F p (i) value and compute the expectation terms by sampling per pmf p (i) a total of 1000 datasets D n ∼ p (i) of size n. We average over P [a,b] regions and end up with estimates μ n (τ, P [a,b] ) and σ n (τ, P [a,b] ) for the average bias and standard deviation of estimator τ and sample size n.
In addition to the plug-inF, we consider two additional estimators. The first is based on the same correction principle but with a parametric model and asymptotic values, and particular the χ 2 distribution, proposed by Vinh et al. [28]. This corrected estimator, which we denote asF χ,α , is defined aŝ where χ α,l(X ,Y ) is the critical value corresponding to a significance level 1 − α and degrees of freedom l(X , Here, α can be thought as a parameter regulating the amount of penalty. The second follows an alternative correction resulting from the application of the quantification adjustment framework proposed by Romano et al. [12]. We denote this estimator byF adj , which is defined aŝ For this experiment we consider τ = {F 0 ,F adj ,F,F χ,95 ,F χ,99 } and n ∈ {5, 10, 20, 30, 40, 50, 60}. 5 We expect the small sample sizes for the small domain size |V (X )| = 3 to behave similar to larger data sizes combined with the potentially huge domains V (X ) for X ⊆ I occurring during search.
We first focus on the general behavior of the bias and standard deviation for each estimator τ , and plot in Fig. 3 the average bias μ n (τ, P [0,1] ) and standard deviation σ n (τ, P [0,1] ) across different data sizes n. We observe that the corrected estimatorF 0 exchanges the positive bias ofF for a smaller, negative bias, and has the tendency to underestimate the true dependency for small n, as desired. Additionally, it converges very fast to 0 with respect to n. TheF adj has a very small positive bias, while theF χ,α has a large negative bias and slow convergence that become more profound for increased α.
Regarding the standard deviation, the right plot show that theF adj has by far the largest, which is to be expected as it also has the smallest bias. The plug-inF also has a large standard deviation that in combination with the relatively high bias, show thatF is not a suitable estimator for functional dependency discovery. TheF χ,95 ,F χ,99 , andF 0 , have similar standard deviations, withF 0 being slightly higher for n = 5. In general, estimators achieve better bias by trading variance, and from Fig. 3 we see that in comparison to all estimators,F 0 has the best bias for variance trade-off. It is also interesting to consider the bias behavior not on average for P [0,1] , but specifically for weak and strong dependencies, i.e., the cases where F is closer to independence and functional dependency, respectively, and plot in Fig. 4 the average biases μ n (τ, P [0,0.25) ) (left) and μ n (τ, P [0.75,1] ) (right). Looking at the left plot we see that the reliable fraction of informationF 0 has a very small negative bias, andF has the largest positive bias and very slow convergence. BothF χ,95 andF χ,99 have a large negative bias, particularlyF χ,99 , whilê F adj is practically unbiased. Regarding strong dependencies, the right plot shows that botĥ F,F adj have a small positive bias, while the rest have large negative biases for n = 5. For bothF χ,95 andF χ,99 the bias is particularly high and does not converge fast to 0, unlikeF 0 that does after only n = 10 data samples. From a bias perspective,F 0 shows the best reliable behavior, with small and "fast" negative bias across the whole range of dependencies.
With these observations, we can conclude thatF 0 is a suitable estimator for F, and particularly for exploratory tasks, as it does not require parameters and parametric assumptions in order to produce results. TheF adj , although practically unbiased, has a very large standard deviation. TheF χ,α has the ability of regulating the amount of penalty with α, but that requires Smaller α values will causeF χ,α to start behaving more likeF and overestimate dependencies. The reliableF 0 does that automatically with the data-dependent quantity E 0 [Î ].

Precision, recall, and F1
Next we evaluate the performance ofF 0 in correctly retrieving the minimal top dependency on synthetic data. We create a Bayesian network with input variables I = {X 1 , X 2 , X 3 , X 4 } of domain size 3 and a binary target variable Y . The only edge is X 4 → Y with the corresponding probability tables shown in Fig. 5. Variables X ∈ I are uniformly distributed. The minimal top dependency in this network has score F(X 4 ; Y ) = 0.25, with all other subsets of I, excluding the supersets of X 4 , having a score of 0. We are interested in the problem of retrieving X * = {X 4 } from sampled data as the top dependency. That is, we want the solutions sets to contain X 4 , and at the same time be as small in cardinality as possible. An appropriate metric to quantify this is the F1 score, which is a weighted combination of both precision and recall. For example, the top result X * = {X 1 , X 4 } of estimator τ has a recall of 1, precision 0.5, and recall 0.66.
Like before, we consider τ ∈ {F 0 ,F adj ,F,F χ,95 ,F χ,99 } and samples sizes n = {5, 10, 20, 30, 40, 50, 60}, and for each n we sample 1000 datasets according to the network. Since the number of attributes is small, we use level-wise exhaustive search. We randomize the order of which candidates are explored in each level to remove any bias introduced from the deterministic order 6 . We plot the average precision, recall, and F1 score, over 1000 data for each estimator and n in Fig. 5.
We see that theF 0 ,F χ,95 , andF χ,99 , have much better F1 curves thanF andF adj , with those ofF 0 andF χ,99 being the best. The plug-in estimatorF almost always retrieves {X 1 , X 2 , X 3 , X 4 } as a solution for n ≥ 20, and hence has very high recall but very small precision. For n = 5, 10, the estimate is already 1 before the last level of the search, and henceF returns proper subsets of I resulting in slightly higher precision. In other words, F returns arbitrary solutions. The adjustedF adj performs much better thanF, but the large variance does not allow it to compete in terms of precision with the corrected estimators, and hence has much lower F1 across all n.
We observe again thatF 0 shows good performance. In fact, it has a similar F1 curve to that ofF χ,99 that corresponds to a significance level of 1%. At the same time, Fig. 3 suggests that smaller α values forF χ,α can lead to better bias and variance trade-off, but that would harm the F1 score as we can see forF χ,95 at 5%. The reliableF 0 achieves both high F1 and good bias-variance trade-off, without the need of any parameter, and hence is much more suitable for exploratory tasks.

Optimization performance
We next investigate the optimization performance of the algorithms and bounding functions proposed on real-world data. Our code is available online. 7 We consider datasets from the KEEL data repository [29]. In particular, we use all classification datasets with d ∈ [10,90] and no missing values, resulting in 35 datasets with 52000 and 30 rows and columns on average, respectively. All metric attributes are discretized in 5 equal-frequency bins. The datasets are summarized in Table 4. The runtimes are averaged over 3 runs.
We use two metrics for evaluation, the relative runtime difference and the relative difference in number of explored nodes. For methods A and B, the relative runtime difference on a particular dataset is computed as where τ A and τ B are the run times for A and B, respectively. The rrd score lies in [−1, 1], where positive (negative) values indicate that B is proportionally faster (slower). For example, 6 For example, let us assume that the top score for the first level of the search space, i.e., all singletons, is < 1. The first candidate from the second level is {X 1 , X 2 }. It might be that for an estimator τ that τ ({X 1 , X 2 }; Y ) = τ ({X 3 , X 4 }; Y ) = 1, and hence the algorithm will return as a solution {X 1 , X 2 } and not {X 3 , X 4 } that contains X 4 , resulting in 0 precision and recall instead of 0.5 and 1, respectively. Randomization alleviates this issue. 7 https://github.com/pmandros/fodiscovery   Table 4 continued ID dataset #rows #attr.
#cl. Avg.  The relative nodes explored difference rnd is defined similarly. For both scores, we consider (−0.5, 0.5) to be a region of practical equivalence, i.e., a factor of 2 of improvement is required to consider a method "better".

Branch-and-bound
We first investigate the performance of the exponential algorithm by comparing OPUS spc and OPUS mon , i.e., Alg. 1 withf spc andf mon as bounding functions, respectively. For a fair comparison, we set a common α value for both methods on each dataset by determining the largest α value in increments of 0.05 such that they terminate in less than 90 minutes. The results are in Table 4. In Fig. 6 we present the comparison between OPUS spc and OPUS mon . The top plot demonstrates thatf spc can lead to a considerable reduction in nodes explored overf mon . In particular, 15 cases have at least a factor of 2 reduction, 7 have 4, and there is one 1 with 760. For 20 cases there is no practical difference. The plot validates that the potential for additional pruning is indeed unbounded (Sect. 4). In terms of runtime efficiency (bottom plot), OPUS spc is "faster" in 70% of the datasets. In more detail, and considering practical improvements, 12 datasets have at least a factor of 2 speedup, 6 have 4, 1 has 266, while only 2 have a factor of 2 slowdown. Moreover, we observe from the plot (since datasets are sorted in decreasing number of attributes) a clear correlation between number of attributes and efficiency: the 6 out of 10 datasets with the slowdown are also the ones with the lowest 14 3 15  22  25  23  24  4  10  26  2  20  9  31  5  27  7  19  28  21  12  29  11  17  35  1  8  30  32  6  33  34  13  We observe in general that both bounding functions, and particularly thef spc , make the branch-and-bound search very effective in practice, requiring a couple of minutes on average for termination with good approximation guarantees.
In Table 4 we also show the maximum depth and solution depth for OPUS spc , i.e., how far in the search space the algorithm had to go and in which level the solution was found. We see that indeed theF 0 retrieves solutions small in cardinality, 3.6 on average, which is a reasonable number for the size of the data considered. Thef spc on the other hand, with 5.9 maximum depth level on average, prunes many of the higher levels of the search space, which explains to a large extend its effectiveness.

Greedy
We now proceed with the evaluation for the heuristic search. We present the relative runtime differences of GRD and GRD spc , i.e., Algorithm 2 with and withoutf spc , in Fig. 7 (results in Table 4). While the greedy algorithm is fast, the plot shows thatf spc indeed improves the efficiency of the heuristic search, as we find that for 12 datasets there is a speedup of at least a factor of 2, and 8 of at least a factor of 4.
Next, we investigate the quality of the greedy results. Note that this is possible as we have access to the branch-and-bound results. In Fig. 8 we plot the differences between thê F 0 score of the results obtained by greedy and branch-and-bound on each dataset. Note that branch-and-bound uses the same α values as with the experiments in Sec 6.3.1, and that we only plot the nonzero differences in the two plots, left for α = 1, i.e., optimal solutions, and right for α < 1, i.e., approximate solutions with guarantees.
At a first glance, we observe that there is no difference in 21 out of 35 cases considered, 7 where greedy is better (this of course on the datasets where α < 1), and 7 for branchand-bound. Out of the 21 cases where the two algorithms have equalF 0 , 16 of them have α = 1, i.e., the greedy algorithm is optimal roughly 45% of the time. Moreover, the cases where branch-and-bound is better is only by a small margin, 0.03 on average, while greedy "wins" by 0.1 on average. Another observation from the right plot of Fig. 8 is that the largest differences between the two algorithms is for the 3 datasets where the lowest α values where used, i.e., 0.05, 0.1, and 0.35.
In Fig. 9 we consider the relative runtime difference between greedy and branch-andbound, i.e., GRD spc and OPUS spc . As expected, the greedy algorithm is significantly faster in the majority of cases. There are, however, 4 cases where branch-and-bound terminates    Fig. 8 Evaluating the heuristic algorithm for result quality. Left: difference inF 0 between methods GRD spc and OPUS spc (i.e.,F 0 (X * grd ; Y )−F 0 (X * bnb ; Y ) where X * grd and X * bnb are the solutions of Algorithm 2 and 1, respectively) for α = 1. Since α = 1, the negative values close to 0 indicate that Algorithm 2 retrieves nearly optimal solutions. Data are sorted in increasing quality difference. Right: difference for α < 1. Positive values indicate that Algorithm 2 retrieves better solutions when Algorithm 1 uses guarantees α < 1. Data are sorted in increasing α values 14  3  15  22  25  23  24  4  10  26  2  20  9  31  5  27  7  19  28  21  12  29  11  17  35  1  8  30  32  6  33  34  13  Evaluating the heuristic algorithm in terms of running time. Relative time difference between methods GRD spc and OPUS spc . Positive (negative) numbers indicate that GRD spc (OPUS spc ) is proportionally "better". Datasets are ordered in decreasing number of attributes much faster, which also happen to coincide with more aggressive α values for branch-andbound.

Case studies
We close this section with examples of concrete dependencies discovered in two different applications: determining the winner of a tic-tac-toe configuration and predicting the preferred crystal structure of octet binary semi-conductors. Both settings are examples of problems where elementary input features are available, but to correctly represent the input/output relation either nonlinear models have to be used or-if interpretable models are soughtcomplex auxiliary features have to be constructed from the given elementary features.
The game of tic-tac-toe [30] is one of the earliest examples of this complex feature construction problem. Tic-tac-toe is a game of two players where each player picks a symbol from {x, o} and, taking turns, marks his symbol in an unoccupied cell of a 3 × 3 game board. A player wins the game if he marks 3 consecutive cells in a row, column, or diagonal. A game can end in draw, if the board configuration does not allow for any winning move. The Searching for dependencies reveals as top pattern with empirical fraction of information F = 0.61 and corrected scoreF 0 = 0.45 the variable set X = {X 1 , X 3 , X 5 , X 7 , X 9 } i.e., the four corner cells and the middle one. This is a sensible discovery as these cells correspond exactly to those involved in the highest number of winning combinations (see Fig. 10). Moreover, removing a variable results in the loss of a considerable amount of information, while adding a variable would provide more information, but also redundancy. That is, the increase of fraction of information would not be higher than the increase ofb 0 .
Our second example is a classical problem from Materials Science [31], which has meanwhile become a canonical example for the challenge of the automatic discovery of interpretable and "physically meaningful" prediction models of material properties [1,2]. The task is to predict the symmetry or crystal structure in which a given binary compound semi-conductor material will crystalize. That is, each of the 82 material involved consist of two atom types (A and B) and the output variable Y = {rocksalt, zincblende} describes the crystal structure it prefers energy-wise. The input variables are 14 electro-chemical features of the two atom types considered in isolation: the radii of the three different electron orbitals shapes s, p, and d of atom type A denoted as r s (A), r p (A), r d (A), as well as four important energy quantities that determine its chemical properties (electron affinity, ionization potential, HOMO and LUMO energy levels); the same variables are defined for component B. i.e., the atomical s and p radii of component A. Again, this is a sensible finding, since these two variables constitute two out of three variables contained in the best structure prediction model that can be identified using the nonlinear subgroup discovery approach [1] (see Fig.  11). Also both features are involved in the best linear LASSO model based on systematically constructed nonlinear combinations of the elementary input variables [2]. The fact that not all variables of those models are identified can likely be explained by the facts that (a) the continuous input variables had to be discretized and (b) the dataset is extremely small with only 82 entries, which renders the discovery of reliable patterns with more than two variables very challenging.

Discussion and Conclusions
We considered the dual problem of measuring and efficiently discovering functional dependencies from data. For a model-agnostic and interpretable knowledge discovery procedure, we adopted an information-theoretic approach and proposed a consistent and robust estimator for mutual information suitable for optimization in high-dimensional data. We proved the NP-hardness of the problem, and derived two bounding functions for the estimator that can be used to prune the search space. With these, we can effectively discover the optimal, or α-approximate top-k dependencies with branch-and-bound. The experimental evaluation showed that the estimator has desired statistical properties, the bounding functions are very effective for both exhaustive and heuristic algorithms, and the greedy algorithm provides solutions that are nearly optimal. Qualitative experiments on two case studies indicate that our proposed framework indeed discovers informative dependencies. While the given reduction from set cover can be extended to show that, unless P=NP, no fully polynomial time approximation scheme exists, the possibility for weaker approximation guarantees remains. In particular, the strong empirical performance of the greedy algorithm hints thatF 0 could have a certain structure favored by the greedy algorithm, e.g., some weaker form of submodularity (we remind thatF 0 is neither submodular nor monotone). For instance, one could explore ideas from Horel and Singer [32] where a monotone function is -approximately submodular if it can be bounded by a submodular function within 1 ± . Another idea is that of restricted submodularity for monotone functions [33], where a function is submodular over a subset of the search space. One can also explore the submodularity index for general set functions [34], where a proxy for the degree of non-submodularity is incorporated in the approximation guarantee.
For future work, the proposed bounding functions are likely to be applicable to a larger selection of corrected-for-chance dependency measures. For example, the monotonicitybased bounding function only requires a correction term that is monotonically increasing with the superset relation. Additionally, it is also of interest to discover functional dependencies given continuous data. As entropy has been defined for such data, e.g., differential and cumulative entropy [35], it is possible to instantiate fraction of information scores. The question would be in what way these measures can be corrected for chance, and whether optimistic estimators exist that allow for efficient and exact optimization.