The randomized information coefficient: assessing dependencies in noisy data
Abstract
When differentiating between strong and weak relationships using information theoretic measures, the variance plays an important role: the higher the variance, the lower the chance to correctly rank the relationships. We propose the randomized information coefficient (RIC), a mutual information based measure with low variance, to quantify the dependency between two sets of numerical variables. We first formally establish the importance of achieving low variance when comparing relationships using the mutual information estimated with grids. Second, we experimentally demonstrate the effectiveness of RIC for (i) detecting noisy dependencies and (ii) ranking dependencies for the applications of genetic network inference and feature selection for regression. Across these tasks, RIC is very competitive over other 16 stateoftheart measures. Other prominent features of RIC include its simplicity and efficiency, making it a promising new method for dependency assessment.
Keywords
Dependency measures Noisy relationships Normalized mutual information Randomized ensembles1 Introduction
Differences between dependency measures for variables and set of variables
Variables  Sets of variables  

Symbol:  \(\mathcal {D}(X,Y)\) where X and Y are one dimensional variables  \(\mathcal {D}({\mathbf {X}},{\mathbf {Y}})\) where \({\mathbf {X}}\) and \({\mathbf {Y}}\) are resp. a set of p and q variables 
Example  \(\mathcal {D}(\texttt {weight},\texttt {height})\)  \(\mathcal {D}(\{\texttt {weight},\texttt {height}\},\texttt {BMI})\) 
Application  Feature filtering for regression, Genetic network inference  Feature selection for regression 
The intuition behind this measure is that on average a random grid can encapsulate the relationship between \({\mathbf {X}}\) and \({\mathbf {Y}}\). Both random discretization and ensembles of classifiers have been shown to be effective in machine learning, for example, in random forests (Breiman 2001). Substantial randomization has been shown to be even more effective in reducing the variance of predictions (Geurts et al. 2006). Our aim is to exploit this powerful approach to develop an efficient, effective and easytocompute statistic for quantifying dependency between two set variables.

We propose a lowvariance statistic (RIC) based on information and ensemble theory, which is efficient and easy to compute;

Via theoretical analysis and extensive experimental evaluation, we link our measure’s strong performance on (i) discrimination between strong and weak noisy relationships, and (ii) ranking of relationships, to its low variance estimation of mutual information;

We extensively demonstrate the competitive performance of RIC versus 16 stateoftheart dependency measures using both simulated and real scenarios.
2 Related work
We first present a brief review of the many available dependency measures and their connections with RIC.
2.1 Correlation and kernel based measures
When the user is only interested in linear dependencies between two variables, the sample Pearson’s correlation coefficient r is powerful. This was extended in Székely and Rizzo (2009) to handle nonlinear dependencies between two sets of variables using distance correlation (dCorr). More recently, random projections have been employed to achieve speed improvements (LopezPaz et al. 2013), yielding the randomized dependency coefficient (RDC). RDC might be seen as a randomized way to identify the maximal correlation between sets of variables and thus can also be seen as an extension of the alternative conditional expectation (ACE) algorithm proposed in Breiman and Friedman (1985). In our work, the random discretization grids used in RIC can be seen as random projections. However, we do not use a linear measure of dependency such as r because this would require optimization across projections to return a meaningful result. Instead, we compute the normalized mutual information that quantifies nonlinear dependencies for each possible projection (grid). This approach allows us to take into account every single grid and each of them contributes to the computation of the average value of \( NI \) across grids. No optimization is required.
The correlation between two sets of variables can also be measured employing the joint distribution of the studied variables under kernel embeddings. The Hilbert–Schmidt independence criterion (HSIC) (Gretton et al. 2005) is an example of such measures that has been shown to be competitive in feature selection tasks (Song et al. 2007). RIC measures the dependency between two sets of variables employing their distribution without kernel embeddings: the distribution is efficiently estimated making use of the random grid and no kernels are used because the distribution estimated with the grid can be straightforwardly plugged in the normalized mutual information formula.
2.2 Mutual information
The mutual information (MI) between two sets of random variables \(I({\mathbf {X}},{\mathbf {Y}})\) is a powerful and well established dependency measure (Cover and Thomas 2012). A number of different estimators have been proposed for mutual information (Steuer et al. 2002; Kraskov et al. 2004). The standard approach however consists of discretizing the space of possible values that \({\mathbf {X}}\) and \({\mathbf {Y}}\) can take and then estimating the probability mass function using the frequency of occurrence. There are many possible approaches to discretization of random variables. For example, a single random variable can be easily discretized according to equalwidth or equalfrequency binning, or according to more complex principles such as the minimum description length (Fayyad and Irani 1993). We note that there is no universally accepted optimal discretization technique. Even though, for sets of variables few sensible discretization have been proposed (Dougherty et al. 1995; Garcia et al. 2013), to our knowledge, there is no extensive survey about the estimation of mutual information with multiple variable discretization approaches.
Mutual information estimators based on discretization in equal width intervals have been discussed in Steuer et al. (2002). Particularly crucial is the choice of the number of bins used to discretize X and Y: too big values lead to overestimation of mutual information due to a finitesample effect. To mitigate this problem, adaptive partitioning of the discretization grid on the joint distribution (X, Y) has been proposed (Fraser and Swinney 1986) and optimized for speed (Cellucci et al. 2005). Other competitive mutual information estimators used in practice are Kraskov’s k nearest neighbors estimator (Kraskov et al. 2004) and the kernel density estimator (Moon et al. 1995). An extensive comparison of these estimators can be found in Khan et al. (2007). Mutual information has been successfully employed for a variety of applications, such as feature selection (Nguyen et al. 2014b) and reverse engineering genetic networks (Villaverde et al. 2013). Given the evident number of application scenarios of mutual information and its undeniable efficacy, we choose to use the discretizationbased MI estimator as the main building block of RIC. We further make use of normalization because it helps to deflate mutual information on finite samples, bounding the output values in [0, 1] (Romano et al. 2014).
2.3 Other information theoretic measures
More recently, new measures based on information theory, such as the maximal information coefficient (MIC) presented in Reshef et al. (2011) and the mutual information dimension (MID) (Sugiyama and Borgwardt 2013), have been proposed. MID is based on discretization and it aims to outperform other measures in white noise scenarios. In particular, it outperforms MIC under white noise. Other prominent features of MID include its efficiency with an average running time \({\mathcal {O}}(n\log {n})\), and the ability to characterize multifunctional relationships with a score of 1. MIC is another successful measure of dependence whose value is interpretable in various settings. Its value is obtained by performing discretization using grids over the joint distribution (X, Y). MIC satisfies a useful property called equitability, which allows it to act as a proxy for the coefficient of determination \(R^2\) of a functional relationship (Reshef et al. 2015b).
Dependency measures available in literature compared by their applicability to sets of variables and their best and worst case computational complexity
Family  Acr.  Name  References  Sets of vars.  Best compl.  Worst compl. 

Mutual information estimators  \(I_\text {ew}\)  Mutual information (discretization equal width)  Steuer et al. (2002)  ✗  \({\mathcal {O}}(n^{1.5})\)  
\(I_\text {ef}\)  Mutual information (discretization equal frequency)  Steuer et al. (2002)  ✗  \(\mathcal {O}(n^{1.5})\)  
\(I_{\text {A}}\)  Mutual information (adaptive partitioning)  Cellucci et al. (2005)  ✗  \(\mathcal {O}(n^{1.5})\)  
\(I_{\text {mean}}\)  Mutual information (mean nearest neighbours)  Faivishevsky and Goldberger (2009)  ✓  \(\mathcal {O}(n^{2})\)  
\(I_{\text {KDE}}\)  Mutual information (kernel density estimation)  Moon et al. (1995)  ✓  \({\mathcal {O}}(n^{2})\)  
\(I_{k\text {NN}}\)  Mutual information (nearest neighbours)  Kraskov et al. (2004)  ✓  \(\mathcal {O}(n^{1.5})\)  \(\mathcal {O}(n^{2})\)  
Correlation based  \(r^2\)  Squared Pearson’s correlation  –  ✗  \(\mathcal {O}(n)\)  
ACE  Alternative conditional expectation  Breiman and Friedman (1985)  ✗  \(\mathcal {O}(n)\)  
dCorr  Distance correlation  Székely and Rizzo (2009)  ✓  \(\mathcal {O}(n \log {n})\)  \(\mathcal {O}(n^2)\)  
RDC  Randomized dependency coefficient  LopezPaz et al. (2013)  ✓  \(\mathcal {O}(n \log {n})\)  
Kernel based  HSIC  Hilbert–Schmidt independence criterion  Gretton et al. (2005)  ✓  \(\mathcal {O}(n^2)\)  
Information theory based  MIC  Maximal information coefficient  Reshef et al. (2011)  ✗  \(\mathcal {O}(n)\)  \(\mathcal {O}(n^{3.6})\) 
GMIC  Generalized mean information coefficient  Luedtke and Tran (2013)  ✗  \(\mathcal {O}(2^n)\)  
MID  Mutual information dimension  Sugiyama and Borgwardt (2013)  ✗  \(\mathcal {O}(n \log {n})\)  \(\mathcal {O}(n^2)\)  
MIC\(_e\)  Maximal information coefficient  Reshef et al. (2015b)  ✗  \(\mathcal {O}(n)\)  \(\mathcal {O}(n^{2.25})\)  
TIC\(_e\)  Total information coefficient  Reshef et al. (2015b)  ✗  \(\mathcal {O}(n)\)  \(\mathcal {O}(n^{2.25})\)  
RIC  Randomized information coefficient  –  ✓  \(\mathcal {O}(n^{1.5})\) 
In this paper we introduce RIC. RIC is a dependency measure to compare sets of random variables based on normalized mutual information which is efficient and easy to compute. Table 2 shows a list of dependency measures currently available in literature. Not all of them is applicable to set of variables and some show high computational complexity with regards to the number of points n. Some complexities can be obtained with particular parameter choices or clever implementation techniques. We refer to the respective papers for a detailed analysis. Moreover, recent advances in this area have delivered faster computational techniques for the most recently proposed measures of dependence. For example, the approximated estimator for the population value of MIC can be sped up (Tang et al. 2014; Zhang et al. 2014), and the new exact estimator MIC\(_e\) provides very competitive computational complexity. Moreover, very recently a new technique for fast computation of distance correlation has been proposed (Huo and Szekely 2014).
3 The randomized information coefficient
Theorem 1
 (i)
\(\mathcal {RIC}({\mathbf {X}},{\mathbf {Y}}) = 0\) if and only if \({\mathbf {X}}\) and \({\mathbf {Y}}\) are independent;
 (ii)
\(\mathcal {RIC}({\mathbf {X}},{\mathbf {Y}}) \le 1\).
Proof
 (i)
(\(\mathbf {X}\) and \(\mathbf {Y}\) are independent \(\Rightarrow \) \(\mathcal {RIC}(\mathbf {X},\mathbf {Y}) = 0\)) If the variables in \(\mathbf {X}\) and independent from the variables in \({\mathbf {Y}}\), for any randomization grid G it holds true that \({\mathcal {I}}({\mathbf {X}},{\mathbf {Y}}G) = 0\). Therefore, \(\mathcal {RIC}({\mathbf {X}},{\mathbf {Y}}) = 0\).
(\(\mathcal {RIC}({\mathbf {X}},{\mathbf {Y}}) = 0\) \(\Rightarrow \) \({\mathbf {X}}\) and \(\mathbf {Y}\) are independent) For any randomization grid G, the mutual information is always nonnegative: \(\mathcal {I}(\mathbf {X},\mathbf {Y}G) \ge 0\). Thus, the normalized mutual information is also nonnegative: \(\mathcal {NMI}(\mathbf {X},\mathbf {Y}G) \ge 0\). Therefore being RIC the expected value of a nonnegative quantity, \(\mathcal {RIC}(\mathbf {X},\mathbf {Y}) = \int _{G} \mathcal {NMI}(\mathbf {X},\mathbf {Y}G)P(G) dG = 0\) implies that \(\mathcal {NMI}(\mathbf {X},\mathbf {Y}G)\) is equal to 0 for any possible G. If the normalized mutual information is equal to 0 also the mutual information is equal to 0: \(\mathcal {I}(\mathbf {X},\mathbf {Y}G) = 0\). This implies that \(\mathbf {X}\) and \(\mathbf {Y}\) are independent according to the discretization imposed by G (Cover and Thomas 2012). This is true for every possible discretization grid G, therefore also \(\mathbf {X}\) and \(\mathbf {Y}\) are independent.
 (ii)For any grid G, \(\mathcal {NMI}(\mathbf {X},\mathbf {Y}G) \le 1\) because \(\mathcal {I}(\mathbf {X},\mathbf {Y}G) \le \max { \{ \mathcal {H}(\mathbf {X}G),\mathcal {H}(\mathbf {Y}G) \}}\). Thus,$$\begin{aligned} \mathcal {RIC}(\mathbf {X},\mathbf {Y})&= \int _{G} \mathcal {NMI}(\mathbf {X},\mathbf {Y}G)P(G) dG \\&\le \int _G P(G) dG = \int _{\infty }^{\infty } \cdots \int _{\infty }^{\infty } P(\gamma _1,\ldots ,\gamma _{D_{\max }^2}) d\gamma _1,\ldots ,d\gamma _{D_{\max }^2} \\&= \int _{\infty }^{\infty } P(\gamma _1) d\gamma _1 \cdots \int _{\infty }^{\infty } P(\gamma _{D_{\max }^2}) d\gamma _{D_{\max }^2} = 1. \end{aligned}$$
Mind that RIC is equal to 0 when variables in \({\mathbf {X}}\) are independent from the variables in \({\mathbf {Y}}\) even if the variables in either the set \({\mathbf {X}}\) or the set \({\mathbf {Y}}\) are dependent to each others.
Contingency table \(({\mathbf {X}},{\mathbf {Y}})G\) on a data set \(\{({\mathbf {X}}_i,{\mathbf {Y}}_i)\}_{i=0\ldots ,n1}\) defined by the grid G


\(\mathcal {O} \left( K_r \cdot D_{\max } \cdot n + K_r^2(n + D_{\max }^2) \right) \) if random ferns are used;

\(\mathcal {O} \left( K_r \cdot D_{\max } \cdot n \cdot (p + q) + K_r^2(n + D_{\max }^2) \right) \) if random seeds are used.
4 Variance analysis of grid estimators of mutual information
In this section, we theoretically justify the use of random grids to obtain small variance with the RIC statistic. Then, we prove that a lower variance is beneficial when comparing dependencies and ranking relationships according to the grid estimator of mutual information.
4.1 Ensembles for reducing the variance
The main motivation for our use of random discretization grids is that averaging across independent random grids allows reduction of variance (Geurts 2002). By using random grids, it is possible to achieve small correlation between the different estimations of \( NI \). RIC variance tends to be a small value if the estimations are uncorrelated.
Theorem 2
Proof
We aim to show that the decrease in variance is due to the random grid G, by comparing the variance of \( NI _F\) where F is a fixed grid with equal width bins for X and Y. The number of bins for each variable is fixed to 9 for both G and F, and cutoffs are generated in the range \([2,2]\) and \([3,3]\) for X and Y, respectively. The chosen joint distribution (X, Y) is induced on \(n = 100\) points with \(X \sim \mathcal {N}(0,1)\) and \(Y = X + \eta \) with \(\eta \sim \mathcal {N}(0,1)\). The variance of RIC decreases as K increases because the random grids enable us to decorrelate the estimations of \( NI \). In general, if we allow grids of different cardinality (different number of cutoffs) and large K, the variance can be decreased even further.
4.2 Importance of variance in comparing relationships using the grid estimator of mutual information
When mutual information is used as a proxy for the strength of the relationship, a small estimation variance is likely to be more useful than a smaller bias when comparing relationships, as implied by some observations in Kraskov et al. (2004), Margolin et al. (2006) and Schaffernicht et al. (2010). The reason is that systematic biases cancel each other out. We formalize these observations as follows:
Theorem 3
Proof
Remark
If there is a systematic bias component, the variance of a dependency measure is important also to identify if a relationship exists. The probability of making an error in determining if a relationship exists (independence testing between X and Y with \(\hat{\phi }\)) is just a special case of Theorem 3 where \(\phi _w = 0\).
Regarding the grid estimator of mutual information \(I_F\) on a fixed grid F with \(n_F\) bins, there is always a systematic bias component which is a function of the number of samples n and the number of bins \(n_F\) (Moddemeijer 1989). This systematic bias component cancels out in \(\text{ bias }(I_{F,s})  \text{ bias }(I_{F,w})\). If the nonsystematic estimation bias is small enough, then the denominator of the upper bound is dominated by the true difference \(\mathcal {I}_s  \mathcal {I}_w\). Therefore, the upper bound decreases because of the numerator, i.e., the sum of the variances. Of course variance is just part of the picture. It is worth to decrease the variance of an estimator if the estimand has some utility. Moreover, many estimators have a bias and variance tradeoff. Deliberately reducing the variance at the expense of bias is not a good idea. Variance can be reduced if there is a strong systematic estimation bias component and if the effect on the nonsystematic bias is minimal.
Moreover, when the dependency measure with a systematic bias is used for ranking relationships, we can still show that reducing the estimator variance plays an important role.
Corollary 1
Proof
As we empirically demonstrated above for the grid estimator of mutual information, \(\text{ bias }(\hat{\phi }_{i+1})  \text{ bias }(\hat{\phi }_i) \) tends to be small if there is some systematic bias component, and thus a small variance is the main contributor to the accuracy.
Remark about boostrapping It is also natural to consider whether using bootstrapping improves the discrimination performance of a statistic by decreasing the variance. When bootstrapping, the statistic is actually estimated on around 63% of the samples and this decreases the discrimination ability of each measure. Similarly, sampling without replacement of a smaller number of points and averaging across different estimation of a measure is not expected to perform well. The best way to decrease the variance is thus to inject randomness in the estimator itself. This is the rationale for RIC. We achieve this goal by using a strong measure such as mutual information and injects randomness in its estimation in order to decrease the global variance.
5 Experiments on dependency between two variables
In this section, we compare RIC^{1} with 16 other stateoftheart statistics that quantify the dependency between two variables X and Y. We focus on three tasks: identification of noisy relationships, inference of network of variables, and feature filtering for regression. Table 4 shows the list of competitor measures compared in this paper and the parameters used in their analysis. The parameters used are the default parameters suggested by the authors of the measures in their respective papers. Indeed, only on the task of feature filtering for regression it is possible to tune parameters with crossvalidation on a given data set. The tasks of inference of network of variables and identification of noisy relationships are unsupervised learning tasks and do not allow parameter tuning when applied to a new data set. Nonetheless, most of the default parameters are not tuned for hypothesis testing. Therefore, we decided to follow the approach used in Reshef et al. (2015a). In this comprehensive empirical study, leading measures of dependence are compared in terms of two important features: equitability and power against independence. Similarly in this paper, we discuss the power against independence on different noise models as well as the equitability of the measures. When testing the power of a measure for a particular noise model, we identify the best parameters for independence testing by maximizing the power on average on a set of relationships and different noise levels.
Dependency measures compared in this paper and parameters used in the tasks of network inference, feature filtering for regression, and estimation of running times
Family  Acr.  Name  Parameters 

Mutual information estimators  \(I_\text {ew}\)  Mutual information (discretization equal width)  \(D = \lfloor \sqrt{n/5} \rfloor \) 
\(I_\text {ef}\)  Mutual information (discretization equal frequency)  \(D = \lfloor \sqrt{n/5} \rfloor \)  
\(I_{\text {A}}\)  Mutual information (adaptive partitioning)  –  
\(I_{\text {mean}}\)  Mutual information (mean nearest neighbours)  –  
\(I_{\text {KDE}}\)  Mutual information (kernel density estimation)  \(h_0 = n^{1/6}\)  
\(I_{k\text {NN}}\)  Mutual information (nearest neighbours)  \(k = 6\)  
Correlation based  \(r^2\)  Squared Pearson’s correlation  – 
ACE  Alternative conditional expectation  \(\epsilon = 10^{12}\)  
dCorr  Distance correlation  –  
RDC  Randomized dependency coefficient  \(k = 20\), \( s = 1/6\)  
Kernel based  HSIC  Hilbert–Schmidt independence criterion  \(\sigma _X,\sigma _Y = \) med. dist. 
Information theory based  MIC  Maximal information coefficient  \(\alpha = 0.6\) 
MIC\(_e\)  Maximal information coefficient  \(\alpha = 0.6\)  
GMIC  Generalized mean information coefficient  \(\alpha = 0.6, p=1\)  
MID  Mutual information dimension  –  
TIC\(_e\)  Total information coefficient  \(\alpha = 0.65\)  
RIC  Randomized information coefficient  \(K_r = 20\), \(D_{\max } = \lfloor \sqrt{n} \rfloor \) 
5.1 Identification of noisy relationships
We consider the task of discriminating between noise and a noisy relationship, i.e., determining whether a dependency exists by testing for independence between X and Y, across a large number of dependency types. In Fig. 5, 12 different relationships between X and Y are induced on \(n = 320\) data points.
We show the performance of RIC with \(D_{\max } = \lfloor \sqrt{n/4} \rfloor \) and \(K_r = 200\) as obtained by parameter tuning. Detailed results for each relationship types are provided in “Appendix A”. Note that because not all the relationships in Fig. 5a are functional, it is not possible to plot power against a normalized xaxis as in Reshef et al. (2015a). In Reshef et al. (2015a) the power of functional relationships is plotted against the \(R^2\) between the true underline function between variables and its noisy version. In this paper, we follow the approach in Simon and Tibshirani (2011) where the xaxis represents some nonnormalized amount of noise added to the relationship between the variables. Therefore, the amount of noise for a particular value on the xaxis and a particular relationship is not comparable with the amount of noise added to another relationship at the same point on the xaxis. Nonetheless on our set of relationships, we would like to point out that all the power plots are monotonically decreasing and they do not look to be intersecting each other. In particular, if a dependency measure \(\mathcal {D}_1\) shows higher power than a measure \(\mathcal {D}_2\) at a given level of noise, \(\mathcal {D}_1\) will also have higher power than \(\mathcal {D}_2\) at a higher level of noise. Please refer to Fig. 20 in “Appendix A”.
RIC also shows good performance under the white noise model but it is outperformed by HSIC. Average results are shown in Fig. 8. The optimal parameters under white noise are different from the optimal parameters under additive noise for many measures as shown in “Appendix A”. Regarding the grid estimators of mutual information, RIC, TIC\(_e\), MIC, MIC\(_e\), and MID, a denser grid is better suited for the white noise scenario because points are uniformly distributed on the joint domain (X, Y). \(I_{kNN }\) presents competitive performance under white noise when k is small. As in the additive noise model, TIC\(_e\) proved to be strong competitor to RIC in this scenario. Instead, dCorr seems to be little competitive under the white noise model. HSIC with very small kernel width performs the best under white noise.
5.2 Equitability
In this section, we assess the equitability of the measures discussed in this paper. A dependence measure is equitable if it provides similar scores to equally noisy relationships of different kinds, relative to some measure of noise (Reshef et al. 2011, 2015a, b). For example, in the case of functional relationships, one natural instantiation of equitability is for an equitable measure of dependence to assign similar scores to relationships with the same coefficient of determination \(R^2\) between the true underlying function and its noisy version. Therefore, for functional relationships an equitable measure is 1 if the dependency between the variables is noiseless.
The two different sets of functional relationships used in the equitability analysis
Simon and Tibshirani (2011)  

1  \( y = x + \sigma \cdot \varepsilon \) 
2  \( y = 4 (x  0.5)^2 + \sigma \cdot \varepsilon \) 
3  \( y = 128( x  \frac{1}{3})^348(x  \frac{1}{3})^3 12 (x  \frac{1}{3})+ \sigma \cdot \varepsilon \) 
4  \( y = \sin (4 \pi x) + \sigma \cdot \varepsilon \) 
Reshef et al. (2011)  

1  \( y = x + \sigma \cdot \varepsilon \) 
2  \( y = 4 (x  0.5)^2 + \sigma \cdot \varepsilon \) 
3  \( y = 4(2.3 x  1.3)^3+(2.3 x  1.3)^24(2.3 x  1.3)+ \sigma \cdot \varepsilon \) 
4  \( y = \sin (8 \pi x ) + \sigma \cdot \varepsilon \) 
In exploratory data analysis, often there is no groundtruth. For example, there is no groundtruth when the task is identifying the top pair of dependent variables among all the possible pairs. In this case, it is not possible to tune the parameters for a particular measure. In this analysis, we relied on the default values provided for the measures in the respective papers. These are shown in Table 4. With their default parameters, the best measures in terms of equitability are MIC and ACE. More specifically, ACE seems to consistently score noiseless functional relationships with a value 1 but seems to fail when the amount of noise increases. MIC and its improved version MIC\(_e\) instead show good equitability across the board. On the other hand, RIC is not an equitable measure. The scatterplot for RIC is similar to GMIC and TIC\(_e\) scatterplots. Indeed, all these measures use multiple grids to compute mutual information related statistics and aggregate their values. This aggregated gridbased approach seems to be more beneficial when the task is identify a relationship with high power.
We rank the measures in terms of equitability in Figs. 10 and 11. For each scatterplot we identify the worst case for equitability: i.e., the maximum range of values for \(R^2\) associated to a single value for a measure. That single value corresponds to two completely different levels of noise. For example, the Pearson correlation squared \(r^2\) is equal to 0 for both a completely noiseless sinusoidal relationship and a completely noisy one. Indeed, \(r^2\) is consistently ranked as last in Figs. 10 and 11. MIC\(_e\) shows to be the best overall in this task. Note also that MIC\(_e\) seems to better match the \(R^2\) on this sets of relationships: it is very close to 0 when \(R^2\) is 0 and 1 when \(R^2\). Other work in literature proposed to enforce this property using adjustment for chance (Romano et al. 2016; Wang et al. 2017). This is an important property which enables MIC\(_e\) to be used as proxy of the \(R^2\).
5.3 Application to network inference
We next employ the measures for biological network reverse engineering, which is a popular and successful application domain for dependency measures (Villaverde et al. 2013). The applications include cellular, metabolic, gene regulatory, and signalling networks. Each of the m variables is associated with a time series of length n. In order to identify the strongest relationships between variables (e.g., genes), a dependency measure \(\mathcal {D}\) is employed. Due to the natural delay of biochemical interactions in biological networks, the strongest dependency might occur only after some time (Xuan et al. 2012). For this reason, we incorporate time delay into the dependency measures as \(\mathcal {D}_{delayed } = \max _{ \tau \in [\tau _m,+\tau _m]}{ \mathcal {D}\left( X(t\tau ),Y(t) \right) }\), where \(\mathcal {D}\) is any measure from Table 4 and \(\tau _m\) is the maximum time delay. We collected 10 datasets where the true interactions between the variables are known. A dependency measure is effective on this task if its output is high on real interactions (positive class) and low on noninteracting pairs of variables (negative class). We evaluate the performance of a measure with the average precision (AP), also known as the area under the precisionrecall curve. In order to obtain meaningful comparisons and perform statistical hypothesis testing, we performed 50 bootstrap repetitions for each dataset and computed the mean AP (mAP) across the repetitions.
Summary of the datasets used for network inference (left) and regression (right): n is the data points and m is the number of variables
#  Name  n  m 

Network inference datasets  
1  Glycolysis  57  10 
2  Enzymecat  250  8 
3  Smallchain  100  4 
4  Irmaonoff  125  5 
5  Mapk  210  12 
6  Dream410  105  10 
7  Dream4100  210  100 
8  SynT1  100  200 
9  SynT1s  30  200 
10  SynT2  30  40 
Regression datasets  
1  Pyrim  74  27 
2  Bodyfat  252  14 
3  Triazines  186  60 
4  Wisconsin  194  32 
5  Crime  111  144 
6  Pole  1000  48 
7  Qsar  384  482 
8  Qsar2  384  186 
The small amount of data available and the high amount of noise in biological time series posed a very challenging task for all statistics. Mutual information estimators have been extensively employed for this task (Villaverde et al. 2013). Just recently, HSIC has been tested on network inference (Lippert et al. 2009) and even more recently dCorr has been shown to be competitive on this task (Guo et al. 2014). In this task, it is important to powerfully discriminate between independent and nonlinearly dependent variables. Indeed, measures with high power as discussed in Sect. 5.1 might have an advantage (Guo et al. 2014). Of course, a measure can be even more competitive if it is also equitable. Nonetheless, this is a different task from equitability assessment and equitability is only part of the picture. This explains the performance of dCorr and HSIC in the literature (Lippert et al. 2009; Guo et al. 2014).
Mean average precision (mAP) on 10 networks: n length of time series; m number of variables
Glycolysis  Enzymecat  Smallchain  Irmaonoff  Mapk  Dream410  Dream4100  SynT1  SynT1s  SynT2  

(n, m)  (57,10)  (250,8)  (100,4)  (125,5)  (210,12)  (105,10)  (210,100)  (100,200)  (30,200)  (30,40) 
RIC  67.5±3.7  91.4±2.0  91.4±1.2  70.5±3.5  57.6±2.2  64.6±6.1  10.3±0.7  7.9±0.4  6.6±0.7  14.1±3.8 
dCorr  67.8±3.0\((=)\)  88.6±2.3\(()\)  91.7±0.0\((+)\)  68.6±2.6\(()\)  50.0±1.0\(()\)  68.8±5.9\((+)\)  12.6±0.6\((+)\)  7.4±0.3\(()\)  6.7±0.6\((=)\)  16.0±2.5\((+)\) 
\(I_{KDE }\)  67.9±3.7\((=)\)  93.5±2.1\((+)\)  88.1±8.1\(()\)  71.5±5.6\((=)\)  59.0±3.1\((+)\)  61.1±7.0\(()\)  9.7±0.7\(()\)  7.8±0.3\(()\)  6.4±0.6\((=)\)  10.2±1.8\(()\) 
\(I_{kNN }\)  65.5±4.6\(()\)  91.5±3.8\((=)\)  90.0±5.1\(()\)  72.7±6.6\((+)\)  68.8±2.0\((+)\)  51.4±6.1\(()\)  8.5±0.8\(()\)  7.8±0.4\(()\)  7.0±0.8\((+)\)  10.3±1.8\(()\) 
\(r^2\)  68.4±2.7\((=)\)  86.0±1.9\(()\)  91.7±0.0\((+)\)  69.0±2.9\(()\)  46.9±1.3\(()\)  56.7±6.1\(()\)  12.5±0.6\((+)\)  6.7±0.4\(()\)  6.3±0.6\(()\)  14.5±2.3\((=)\) 
RDC  61.6±7.6\(()\)  89.2±4.4\(()\)  84.4±9.9\(()\)  68.3±5.3\(()\)  63.0±2.7\((+)\)  57.8±4.7\(()\)  11.3±0.9\((+)\)  6.8±0.7\(()\)  4.2±0.5\(()\)  10.4±2.7\(()\) 
\(I_{A }\)  62.4±4.6\(()\)  89.3±3.5\(()\)  82.7±10.8\(()\)  70.9±5.6\((=)\)  57.7±3.5\((=)\)  61.1±7.0\(()\)  9.4±0.6\(()\)  7.4±0.4\(()\)  4.6±0.6\(()\)  10.3±2.4\(()\) 
\(I_{ef }\)  62.3±4.3\(()\)  91.7±3.6\((=)\)  86.4±7.5\(()\)  73.0±5.1\((+)\)  56.0±3.0\(()\)  58.4±8.4\(()\)  9.3±0.6\(()\)  7.2±0.5\(()\)  4.5±0.7\(()\)  10.2±2.5\(()\) 
\(I_{ew }\)  63.5±4.7\(()\)  78.2±7.6\(()\)  90.9±2.1\((=)\)  73.0±6.3\((+)\)  50.6±2.2\(()\)  56.9±6.1\(()\)  10.1±0.8\((=)\)  6.8±0.5\(()\)  4.0±0.4\(()\)  9.1±1.4\(()\) 
MIC\(_e\)  66.5±4.5\((=)\)  94.3±4.4\((+)\)  86.1±8.3\(()\)  69.8±4.3\((=)\)  55.3±6.8\(()\)  63.7±6.2\((=)\)  11.1±0.8\((+)\)  7.4±0.5\(()\)  5.5±0.6\(()\)  11.7±2.5\(()\) 
MIC  64.4±4.9\(()\)  75.9±9.6\(()\)  84.9±10.5\(()\)  71.2±5.5\((=)\)  45.1±6.8\(()\)  56.1±8.8\(()\)  8.8±0.7\(()\)  6.8±0.5\(()\)  5.5±0.7\(()\)  10.1±1.9\(()\) 
GMIC  66.6±4.2\((=)\)  89.0±3.8\(()\)  90.3±3.0\(()\)  68.8±3.7\(()\)  53.5±6.6\(()\)  57.2±4.2\(()\)  10.4±0.7\((=)\)  7.3±0.5\(()\)  5.7±0.7\(()\)  12.5±2.8\(()\) 
MID  35.7±8.6\(()\)  47.4±11.7\(()\)  75.2±14.8\(()\)  79.5±11.2\((+)\)  37.5±4.8\(()\)  39.6±6.2\(()\)  3.9±0.4\(()\)  2.3±0.4\(()\)  1.6±0.1\(()\)  8.8±1.5\(()\) 
ACE  67.4±5.0\((=)\)  88.5±6.2\(()\)  84.4±10.7\(()\)  75.6±7.0\((+)\)  62.0±2.0\((+)\)  53.8±7.0\(()\)  9.9±0.8\(()\)  7.4±0.4\(()\)  6.1±0.6\(()\)  11.0±2.4\(()\) 
HSIC  64.8±3.7\(()\)  87.7±3.8\(()\)  91.5±1.2\((=)\)  68.4±2.4\(()\)  51.9±2.3\(()\)  64.5±5.7\((=)\)  9.9±0.9\(()\)  7.5±0.7\(()\)  7.1±1.1\((+)\)  11.7±2.4\(()\) 
\(I_{mean }\)  46.8±2.0\(()\)  90.2±0.0\(()\)  91.6±0.7\((=)\)  69.6±3.4\((=)\)  33.0±1.4\(()\)  65.4±5.6\((=)\)  8.1±0.5\(()\)  4.6±0.2\(()\)  2.7±0.2\(()\)  7.5±0.9\(()\) 
We use RIC with parameters \(D_{\max } = \lfloor \sqrt{n} \rfloor \) and \(K_r = 20\) because on these tasks it is important to achieve high discrimination between strong relationships as well as weak relationships. Figure 12 presents the average rank of the measures across all tested networks. Overall, RIC performs consistently well across all datasets. It outperforms by far all the discretization based mutual information estimators as well as other information theoretic based measures including MIC, GMIC and MID. Among the mutual information estimators, \(I_{KDE }\) and \(I_{kNN }\) show very good results. RIC’s main competitor was dCorr, which also shows very good performance mainly due to the crucial importance of the linear relationships between variables. Its results are very correlated with \(r^2\) results, which in some cases provides the best result for a single data set. This is mainly due to its high ability to discriminate linear relationships well. We found RIC particularly competitive on short time series with a large number of variables.
We want to reiterate that this is an unsupervised task. Therefore it is not possible to tune parameters with crossvalidation on a given data set. The tasks of inference of network of variables and identification of noisy relationships are unsupervised learning tasks and do not allow parameter tuning when applied to a new data set. When the user is provided with a new data set, this can only rely on the default parameters of a measure. Of course these could be tweaked to identify different top pairs of relationships. Nonetheless, all different sets of top relationships obtained with different parameters should be inspected individually because no ground truth is available for validation. Moreover, the real data sets discussed in this paper are a sample of the possible real data sets that can be analysed. Yet our data sets are independent and the analysis discussed in the paper can provide a picture of the behavior of different measures with default parameters.
5.4 Feature filtering for regression
In this section, we evaluate the performance of RIC and the other statistics as feature filtering techniques. A dependency measure \(\mathcal {D}\) can be used to rank the m features \(X_i\) on a regression task based on their prediction ability for the target variable Y. Only the top \(m^{\star }\) features according to \(\mathcal {D}\) are used to build a regressor for Y. Table 8 shows the average correlation coefficient between the predicted and the actual target value using the top \(m^{\star } \le 10\) features using a kNN regressor (with \(k=3\)). Each value is obtained by averaging 3 random trials of 10fold crossvalidation for each \(m^{\star } \le 10\).
Correlation coefficient between the predicted and actual target value on 8 datasets using kNN (\(k = 3\))
Pyrim  Bodyfat  Triazines  Wisconsin  Crime  Pole  Qsar  Qsar2  

(n, m)  (74,27)  (252,14)  (186,60)  (194,32)  (111,144)  (1000,48)  (384,482)  (384,186) 
RIC  0.261±0.120  0.642±0.115  0.215±0.120  0.034±0.013  0.892±0.042  0.685±0.156  0.277±0.091  0.479±0.053 
dCorr  0.205±0.046\(()\)  0.643±0.114\((=)\)  0.118±0.062\(()\)  0.041±0.012\((=)\)  0.852±0.057\(()\)  0.686±0.218\((=)\)  0.310±0.025\((+)\)  0.382±0.130\(()\) 
\(I_{KDE }\)  0.231±0.068\(()\)  0.635±0.117\(()\)  0.148±0.095\(()\)  0.039±0.012\((=)\)  0.614±0.047\(()\)  0.686±0.221\((=)\)  0.291±0.029\((=)\)  0.424±0.151\(()\) 
\(I_{kNN }\)  0.216±0.051\(()\)  0.639±0.116\(()\)  0.098±0.035\(()\)  0.038±0.011\((=)\)  0.893±0.051\((=)\)  0.621±0.134\(()\)  0.300±0.094\((+)\)  0.423±0.025\(()\) 
\(r^2\)  0.264±0.064\((=)\)  0.644±0.114\((=)\)  0.125±0.050\(()\)  0.041±0.009\((+)\)  0.870±0.045\(()\)  0.414±0.311\(()\)  0.273±0.037\((=)\)  0.375±0.134\(()\) 
RDC  0.206±0.052\(()\)  0.642±0.115\((=)\)  0.199±0.079\((=)\)  0.017±0.008\(()\)  0.891±0.042\((=)\)  0.679±0.197\((=)\)  0.280±0.058\((=)\)  0.430±0.060\(()\) 
\(I_{A }\)  0.235±0.088\((=)\)  0.640±0.115\((=)\)  0.062±0.047\(()\)  0.037±0.021\((=)\)  0.891±0.041\((=)\)  0.010±0.007\(()\)  0.284±0.042\((=)\)  0.418±0.046\(()\) 
\(I_{ef }\)  0.190±0.068\(()\)  0.640±0.116\((=)\)  0.171±0.053\(()\)  0.036±0.015\((=)\)  0.889±0.041\(()\)  0.693±0.156\((=)\)  0.278±0.104\((=)\)  0.429±0.028\(()\) 
\(I_{ew }\)  0.249±0.064\((=)\)  0.641±0.115\((=)\)  0.188±0.097\(()\)  0.033±0.007\((=)\)  0.859±0.059\(()\)  0.661±0.145\(()\)  0.264±0.085\(()\)  0.441±0.046\(()\) 
MIC  0.186±0.072\(()\)  0.642±0.114\((=)\)  0.051±0.023\(()\)  0.010±0.009\(()\)  0.776±0.040\(()\)  0.694±0.156\((=)\)  0.293±0.030\((=)\)  0.448±0.039\(()\) 
MIC\(_e\)  0.187±0.088\(()\)  0.641±0.115\((=)\)  0.158±0.067\(()\)  0.028±0.008\(()\)  0.819±0.059\(()\)  0.774±0.180\((+)\)  0.301±0.046\((+)\)  0.428±0.056\(()\) 
GMIC  0.206±0.069\(()\)  0.634±0.118\(()\)  0.141±0.056\(()\)  0.026±0.005\(()\)  0.803±0.055\(()\)  0.734±0.179\((+)\)  0.292±0.058\((=)\)  0.468±0.054\(()\) 
MID  0.241±0.167\((=)\)  0.605±0.137\(()\)  0.160±0.062\(()\)  0.047±0.030\((+)\)  0.178±0.047\(()\)  0.808±0.215\((+)\)  0.194±0.130\(()\)  0.186±0.074\(()\) 
ACE  0.221±0.051\(()\)  0.641±0.115\((=)\)  0.111±0.073\(()\)  0.011±0.008\(()\)  0.894±0.042\((=)\)  0.000±0.000\(()\)  0.270±0.056\((=)\)  0.439±0.023\(()\) 
HSIC  0.174±0.068\(()\)  0.638±0.116\((=)\)  0.057±0.063\(()\)  0.028±0.011\(()\)  0.853±0.046\(()\)  0.000±0.000\(()\)  0.001±0.001\(()\)  0.000±0.000\(()\) 
\(I_{mean }\)  0.178±0.073\(()\)  0.636±0.117\(()\)  0.073±0.076\(()\)  0.034±0.011\((=)\)  0.853±0.046\(()\)  0.000±0.000\(()\)  0.001±0.001\(()\)  0.001±0.000\(()\) 
As for the task of network inference in Sect. 5.3, it is important for a measure to be both equitable and powerful when detecting relationships. Powerful measures despite being nonequitable have been shown to perform well on this task: e.g. the HSIC (Song et al. 2007).
As in Sect. 5.3 we use RIC with parameters \(D_{\max } = \lfloor \sqrt{n} \rfloor \) to avoid low density grids that are better suited for testing of independence tasks. Overall, as can be observed from Fig. 13, RIC performs consistently well on average. RIC is also particularly useful when the number of features m is high and especially when their relationships to the target variable Y are noisy. These represent the most challenging scenarios as can be justified by the low correlation coefficient achievable using the selected features, e.g., on the Pyrim and Triazines datasets. We also note the good performance of RIC on datasets where there are features that can take only a predefined number of values: e.g., discrete numerical features. Pole, Qsar, and Qsar2 include these type of features. For such features it is very difficult to either optimize a kernel or a grid size or find the optimal data transformation to obtain the maximal correlation with ACE, which explains the less competitive performance of HSIC, \(I_{KDE }\), \(I_{A }\), \(I_{mean }\), and ACE. RIC is not affected by this problem as there is no optimization and grids are generated at random. Note that the good performance on feature selection for RIC is also due to the fact that features with high entropy are penalized because of the normalization factor in Eq. (2).
5.5 Run time comparison
Figure 14a shows the running time for RIC with default parameters \(K_r = 20\) and \(D_{\max } = \lfloor \sqrt{n} \rfloor \). Similarly to other measures, the running time for RIC depends to its parameter setting. Figure 14b shows the different time taken by RIC on \(n=10^3\) records according to different \(K_r\) and different c where \(D_{\max } = \lfloor \sqrt{n/c} \rfloor \). By increasing \(K_r\) we increase the number of random grids and by increasing c with decrease the grid coarsity. Figure 14b shows different plots at the variation of \(K_r\) for \(c = 4\), \(c=1\), and \(c=0.1\) which respectively yield to \(D_{\max } = \lfloor \sqrt{n/4} \rfloor \), \(D_{\max } = \lfloor \sqrt{n} \rfloor \), and \(D_{\max } = \lfloor \sqrt{n \cdot 10} \rfloor \). These settings are respectively the ones we used for: independence testing under additive noise; network inference and feature filtering; and independence testing under white noise. The latter scenario proved to be the most challenging in terms of RIC running time.
Large \(K_r\) increases the computational time. Nonetheless, large \(K_r\) is not always required. As discussed in Sect. 5.1 even though it is always beneficial to increase \(K_r\) to further decrease the variance of RIC, this is particularly important when n is small. Thus, \(K_r\) can always be tuned by the user according to the sample size of the data set analyzed and the disposable computational budget.
6 Experiments on dependency between two sets of variables
In this section, we perform comparisons between the performance of measures which quantify the dependency between two sets of p variables \({\mathbf {X}}\) and q variables \({\mathbf {Y}}\). This is different from finding a subset of variables that are significantly correlated. In that case, new advances in that area yielded interesting measures to compare (Nguyen et al. 2014a; Nguyen and Vreeken 2015). In our paper, we compare the measures discussed in Table 4. The Pearson’s correlation coefficient, ACE, \(I_{A }\), MIC, GMIC, and MID are not applicable in these scenarios and there is no straightforward method to extend them to sets of variables available in literature.
6.1 Identification of multivariable noisy relationships
We fix the number of variables \(p=3\) for \({\mathbf {X}}\) because some measures require specific tuning in regards to the number of variables considered. For example, the most straightforward way to extend the discretization based estimators of mutual information \(I_{ew }\) and \(I_{ef }\) is to independently discretize all the variables in each set. This requires carefully choosing the number of discretization bins for each variable in \({\mathbf {X}}\) and each variable in \({\mathbf {Y}}\). If the same number of bins \(D_X\) is chosen for all the variables in \({\mathbf {X}}\) and the same number of bins \(D_Y\) is chosen for all the variables in \({\mathbf {Y}}\), it is possible to end up with as many as \(D_X^p \cdot D_Y^q\) total bins. This issue makes it practically infeasible to use \(I_{ew }\) and \(I_{ef }\) in high p, q scenarios. Given this limitation of the discretization based estimators of mutual information, we also made use of a multivariable discretization approach of the set of variables \({\mathbf {X}}\) which allows a more sensible choice of the total number of bins. Even if methods for multivariable discretization are available in literature (Garcia et al. 2013; Dougherty et al. 1995) to our knowledge there is no extensive survey about the performance of estimation of mutual information with multivariable approaches. Therefore, we chose to discretize \({\mathbf {X}}\) and \({\mathbf {Y}}\) with the clustering algorithm kmeans and then compute the mutual information. We name this measure \(I_{kmean }\). This allows us to choose the total number of bins (clusters) to be produced.
In our case, where \(p = 3\) and \(q = 1\) we chose compute \(I_{ew }\) and \(I_{ef }\) fixing \(D_Y = 5\) and compute \(D_X\) in order to limit the number of total bins in regards to the number n of data points: \(D_X^p \cdot D_Y \le \frac{n}{5} \Rightarrow D_X = \lfloor \frac{ \log { n/25 } }{\log {p}} \rfloor \). When \(n = 320\), \(D_X = 2\). We tuned the parameters of every other measure in order to maximize the average power on all relationships. Please refer to “Appendix B” for more details. Regarding RIC, in order to have full control on the number of bins produced, we compared the multivariable dicretization approach that uses random seeds as described in Algorithm 4. More specifically, we fixed the number of random seeds to \( \lfloor \sqrt{n/c} \rfloor \) given that also choosing the number of random seeds at random might result in configurations with as few as 2 seeds, which strongly deteriorates the discrimination ability of mutual information on multiple variables. The parameter c for RIC that maximizes the power on average is \(c=6\) which generates \(\lfloor \sqrt{n/6} \rfloor \) seeds. This setting is very similar to the optimal parameter setting found for testing for independence between variables under additive noise in Sect. 5.1. Most of the measures obtain similar optimal parameters to the ones obtained when testing for independence between variables. Just \(I_{KDE }\) seems to require even larger kernel width when comparing sets of variables.
6.2 Feature selection for regression
We also tested multivariable measures of dependency in the task of feature selection using a similar framework to Sect. 5.4. Rather than filtering the features according to their individual importance to the target variable Y, we proceed by forward selection. The optimal set of p features according to a dependency measure is identified by finding the best set of features \({\mathbf {X}} = {\mathbf {X}}^{p1} \cup X_i\), with \({\mathbf {X}}^{p1}\) representing the set chosen at the previous iteration of forward selection and \(X_i\) chosen among the possible \(m (p1)\) features of a dataset. A multivariable dependency measure can be fully employed in this case because we require to compute the dependency between \({\mathbf {X}}\) features and the target variable Y at each step of the iteration.
As in Sects. 5.3 and 5.4 we use RIC with parameters \(D_{\max } = \lfloor \sqrt{n} \rfloor \) to avoid low density grids that are better suited for testing of independence tasks under additive noise. We use the random seeds discretization approach of Algorithm 4 with a fixed number of random seeds. We also choose to fix \(D_X = 2\) and \(D_Y = 5\) for the naive discretization based estimators of mutual information. Average results for all the measures are shown in Fig. 17 and a table with detailed comparisons is presented in “Appendix B”. We notice that the ranking by performance of classifier changes from the one obtained using the feature filtering approach, although RIC again shows competitive performance against the other approaches. All estimators of mutual information lose positions except for the \(I_{KDE }\) kernel based estimator. It seems that on multiple variables kernels are more effective than in the univariate scenario. Indeed, HSIC also gains a few positions. RDC’s average performance stays the same and it still gets outperformed by dCorr. dCorr performs really well when computed on sets of variables. As previously noted, even in this case RIC outperforms \(I_{kmeans }\) and this result is due to the randomized approach.
7 Conclusion
Footnotes
 1.
RIC implementation is available at https://sites.google.com/site/randinfocoeff/.
 2.
From http://www.iim.csic.es/~gingproc/mider.html (Villaverde et al. 2014).
 3.
From http://code.google.com/p/informationdynamicstoolkit/ (Lizier and Jidt 2014).
 4.
From http://tinyurl.com/ojlkrla (Margolin et al. 2006).
 5.
 6.
 7.
 8.
 9.
 10.
 11.
 12.
Notes
Acknowledgements
Simone Romano’s work was supported by a Melbourne International Research Scholarship (MIRS). James Bailey’s work was supported by an Australian Research Council Future Fellowship. Experiments were carried out on Amazon cloud supported by AWS in Education Grant Award.
References
 Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.CrossRefzbMATHGoogle Scholar
 Breiman, L., & Friedman, J. H. (1985). Estimating optimal transformations for multiple regression and correlation. Journal of the American statistical Association, 80(391), 580–598.MathSciNetCrossRefzbMATHGoogle Scholar
 Cellucci, C., Albano, A. M., & Rapp, P. (2005). Statistical validation of mutual information calculations: Comparison of alternative numerical algorithms. Physical Review E, 71(6), 066208.CrossRefGoogle Scholar
 Cover, T. M., & Thomas, J. A. (2012). Elements of information theory. New York: Wiley.zbMATHGoogle Scholar
 Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 7, 1–30.MathSciNetzbMATHGoogle Scholar
 Dougherty, J., Kohavi, R., Sahami, M., et al. (1995). Supervised and unsupervised discretization of continuous features. Machine learning: Proceedings of the twelfth international conference, 12, 194–202.Google Scholar
 Faivishevsky, L. & Goldberger, J. (2009). ICA based on a smooth estimation of the differential entropy. In Advances in neural information processing systems (pp. 433–440).Google Scholar
 Fayyad, U. & Irani, K. (1993). Multiinterval discretization of continuousvalued attributes for classification learning. In International joint conference on artificial intelligence (IJCAI) Google Scholar
 Fraser, A. M., & Swinney, H. L. (1986). Independent coordinates for strange attractors from mutual information. Physical Review A, 33(2), 1134.MathSciNetCrossRefzbMATHGoogle Scholar
 Garcia, S., Luengo, J., Sáez, J. A., López, V., & Herrera, F. (2013). A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. IEEE Transactions on Knowledge and Data Engineering, 25(4), 734–750.CrossRefGoogle Scholar
 Geurts, P. (2002). Bias/variance tradeoff and time series classification. PhD thesis, Department d’Életrecité, Életronique et Informatique. Institut Momntefiore. Unversité de Liège.Google Scholar
 Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremely randomized trees. Machine Learning, 63(1), 3–42.CrossRefzbMATHGoogle Scholar
 Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., & Smola, A. (2012). A kernel twosample test. The Journal of Machine Learning Research, 13(1), 723–773.MathSciNetzbMATHGoogle Scholar
 Gretton, A., Bousquet, O., Smola, A., & Schölkopf, B. (2005). Measuring statistical dependence with Hilbert–Schmidt norms. In Algorithmic learning theory (pp. 63–77). Springer.Google Scholar
 Guo, X., Zhang, Y., Hu, W., Tan, H., & Wang, X. (2014). Inferring nonlinear gene regulatory networks from gene expression data based on distance correlation. PloS ONE, 9(2), e87446.CrossRefGoogle Scholar
 Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. The Journal of Machine Learning Research, 3, 1157–1182.zbMATHGoogle Scholar
 Huo, X. & Szekely, G. J. (2014). Fast computing for distance covariance. ArXiv preprint arXiv:1410.1503.
 Khan, S., Bandyopadhyay, S., Ganguly, A. R., Saigal, S., Erickson, D. J., I. I. I., Protopopescu, V., et al. (2007). Relative performance of mutual information estimation methods for quantifying the dependence among short and noisy data. Physical Review E, 76(2), 026209.Google Scholar
 Kraskov, A., Stögbauer, H., & Grassberger, P. (2004). Estimating mutual information. Physical Review E, 69(6), 066138.Google Scholar
 Kursa, M. B. (2014). rFerns: An implementation of the random ferns method for generalpurpose machine learning. Journal of Statistical Software, 61(10), 1–13.CrossRefGoogle Scholar
 Kvalseth, T. O. (1987). Entropy and correlation: Some comments. IEEE transactions on Systems, Man and Cybernetics, 17(3), 517–519.CrossRefGoogle Scholar
 Lippert, C., Stegle, O., Ghahramani, Z., & Borgwardt, K. M. (2009). A kernel method for unsupervised structured network inference. In International conference on artificial intelligence and statistics (pp. 368–375).Google Scholar
 Lizier, J. T. (2014). JIDT: An informationtheoretic toolkit for studying the dynamics of complex systems. ArXiv preprint arXiv:1408.3270.
 LopezPaz, D., Hennig, P., & Schölkopf, B. (2013). The randomized dependence coefficient. In Advances in neural information processing systems (pp. 1–9).Google Scholar
 Luedtke, A. & Tran, L. (2013). The generalized mean information coefficient. ArXiv preprint arXiv:1308.5712.
 Margolin, A. A., Nemenman, I., Basso, K., Wiggins, C., Stolovitzky, G., Favera, R. D., et al. (2006). Aracne: An algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics, 7(Suppl 1), S7.CrossRefGoogle Scholar
 Moddemeijer, R. (1989). On estimation of entropy and mutual information of continuous distributions. Signal Processing, 16(3), 233–248.MathSciNetCrossRefGoogle Scholar
 Moon, Y.I., Rajagopalan, B., & Lall, U. (1995). Estimation of mutual information using kernel density estimators. Physical Review E, 52(3), 2318.CrossRefGoogle Scholar
 Nguyen, H. V., Müller, E., Vreeken, J., Efros, P., & Böhm, K. (2014a). Multivariate maximal correlation analysis. In Proceedings of the 31st international conference on machine learning (ICML14) (pp. 775–783).Google Scholar
 Nguyen, H. V. & Vreeken, J. (2015). Universal dependency analysis. ArXiv preprint arXiv:1510.08389.
 Nguyen, X. V., Chan, J., Romano, S., & Bailey, J. (2014b). Effective global approaches for mutual information based feature selection. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 512–521). ACM.Google Scholar
 Özuysal, M., Fua, P., & Lepetit, V. (2007). Fast keypoint recognition in ten lines of code. In CVPR.Google Scholar
 Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh, P. J., et al. (2011). Detecting novel associations in large data sets. Science, 334(6062), 1518–1524.CrossRefzbMATHGoogle Scholar
 Reshef, D. N., Reshef, Y. A., Sabeti, P. C., & Mitzenmacher, M. M. (2015a). An empirical study of leading measures of dependence. ArXiv preprint arXiv:1505.02214.
 Reshef, Y. A., Reshef, D. N., Finucane, H. K., Sabeti, P. C., & Mitzenmacher, M, M. (2015b). Measuring dependence powerfully and equitably. ArXiv preprint arXiv:1505.02213.
 Romano, S., Bailey, J., Nguyen, V., & Verspoor, K. (2014). Standardized mutual information for clustering comparisons: One step further in adjustment for chance. In Proceedings of the 31st international conference on machine learning (ICML14) (pp. 1143–1151).Google Scholar
 Romano, S., Vinh, N. X., Bailey, J., & Verspoor, K. (2016). A framework to adjust dependency measure estimates for chance. In Proceedings of the 2016 SIAM international conference on data mining (pp. 423–431). Society for Industrial and Applied Mathematics.Google Scholar
 Ross, S. (2012). A first course in probability. Upper Saddle River: Pearson.zbMATHGoogle Scholar
 Schaffernicht, E., Kaltenhaeuser, R., Verma, S. S., & Gross, H.M. (2010). On estimating mutual information for feature selection. In Artificial neural networks ICANN 2010 (pp. 362–367). Springer.Google Scholar
 Simon, N. & Tibshirani, R. (2011). Comment on detecting novel associations in large data sets. ArXiv preprint arXiv:1401.7645.
 Song, L., Smola, A., Gretton, A., Borgwardt, K. M., & Bedo, J. (2007). Supervised feature selection via dependence estimation. In Proceedings of the 24th international conference on Machine learning (pp. 823–830). ACM.Google Scholar
 Steuer, R., Kurths, J., Daub, C. O., Weise, J., & Selbig, J. (2002). The mutual information: Detecting and evaluating dependencies between variables. Bioinformatics, 18(suppl 2), S231–S240.CrossRefGoogle Scholar
 Sugiyama, M. & Borgwardt, K. M. (2013). Measuring statistical dependence via the mutual information dimension. In Proceedings of the twentythird international joint conference on artificial intelligence (pp. 1692–1698). AAAI Press.Google Scholar
 Székely, G. J., Rizzo, M. L., et al. (2009). Brownian distance covariance. The Annals of Applied Statistics, 3(4), 1236–1265.MathSciNetCrossRefzbMATHGoogle Scholar
 Tang, D., Wang, M., Zheng, W., & Wang, H. (2014). Rapidmic: Rapid computation of the maximal information coefficient. Evolutionary Bioinformatics Online, 10, 11.Google Scholar
 Van den Bulcke, T., Van Leemput, K., Naudts, B., van Remortel, P., Ma, H., Verschoren, A., et al. (2006). Syntren: A generator of synthetic gene expression data for design and analysis of structure learning algorithms. BMC Bioinformatics, 7(1), 43.CrossRefGoogle Scholar
 Villaverde, A. F., Ross, J., & Banga, J. R. (2013). Reverse engineering cellular networks with information theoretic methods. Cells, 2(2), 306–329.CrossRefGoogle Scholar
 Villaverde, A. F., Ross, J., Morán, F., & Banga, J. R. (2014). MIDER: Network inference with mutual information distance and entropy reduction. PloS ONE, 9(5), e96732.CrossRefGoogle Scholar
 Vinh, N. X., Epps, J., & Bailey, J. (2010). Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11, 2837–2854.MathSciNetzbMATHGoogle Scholar
 Wang, J., Kumar, S., & Chang, S.F. (2012). Semisupervised hashing for largescale search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(12), 2393–2406.CrossRefGoogle Scholar
 Wang, Y., Romano, S., Nguyen, V., Bailey, J., Ma, X., & Xia, S.T. (2017). Unbiased multivariate correlation analysis.Google Scholar
 Xuan, N., Chetty, M., Coppel, R., & Wangikar, P. (2012). Gene regulatory network modeling via global optimization of highorder dynamic bayesian network. BMC Bioinformatics, 13(1), 131.CrossRefGoogle Scholar
 Zhang, Y., Jia, S., Huang, H., Qiu, J., & Zhou, C. (2014). A novel algorithm for the precise calculation of the maximal information coefficient. Scientific Reports, 4, 6662.CrossRefGoogle Scholar