Abstract
The identification of relevant features, i.e., the driving variables that determine a process or the properties of a system, is an essential part of the analysis of data sets with a large number of variables. A mathematical rigorous approach to quantifying the relevance of these features is mutual information. Mutual information determines the relevance of features in terms of their joint mutual dependence to the property of interest. However, mutual information requires as input probability distributions, which cannot be reliably estimated from continuous distributions such as physical quantities like lengths or energies. Here, we introduce total cumulative mutual information (TCMI), a measure of the relevance of mutual dependences that extends mutual information to random variables of continuous distribution based on cumulative probability distributions. TCMI is a non-parametric, robust, and deterministic measure that facilitates comparisons and rankings between feature sets with different cardinality. The ranking induced by TCMI allows for feature selection, i.e., the identification of variable sets that are nonlinear statistically related to a property of interest, taking into account the number of data samples as well as the cardinality of the set of variables. We evaluate the performance of our measure with simulated data, compare its performance with similar multivariate-dependence measures, and demonstrate the effectiveness of our feature-selection method on a set of standard data sets and a typical scenario in materials science.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
The past two decades have been marked by an explosion in the availability of scientific data and significant improvements in statistical data analysis. In particular, the physical sciences have seen an unprecedented surge in data exploration aimed at the data-driven discovery of statistical dependencies of physical variables relevant to a property of interest. These observations culminated in the emergence of a new paradigm in science, the so-called “big-data driven” science (Hey et al. 2009).
1.1 Feature selection
The identification of relevant variables, i.e., the properties or the driving variables of a process or system’s property, has propelled investigations for an understanding of the underlying processes that generated the data (Guyon and Elisseeff 2003). Such a variable \(X \in \vec {X}\) may be an attribute, parameter, or a combination of properties measured or obtained from experiments or simulations. The fundamental challenge is to find a functional dependency \(f: \vec {X} \mapsto Y\), between a set of variables \(\vec {X}' \in \vec {X}\) related to a certain output Y (target, response function). The objective is to find a set of variables (the so-called features) that maximizes a feature-selection criterion \({\mathcal {Q}}\) with respect to a property of interest Y (Blum and Langley 1997; Kohavi and John 1997),
Feature selection comprises two parts: (i) the choice of a search strategy and (ii) a feature-selection criterion \({\mathcal {Q}}\) for evaluating a feature-subset’s relevance.
Empirical cumulative entropy \(\hat{{\mathcal {H}}}(Y)\) of a normal distribution for 50 data samples, which are shown as ticks in the bottom of the figure. Insets a.) and b.) show the (ground-truth) probability density (PDF) and cumulative probability (CDF) of the normal distribution, empirical cumulative distribution, \({\hat{P}}(Y \le y)\), and estimated probability density, \({\hat{p}}(y)\). The estimated probability density was obtained by optimizing the bandwidth of a kernel-density estimator through 10-fold cross-validation. Further, histograms of PDF and CDF are also drawn to provide an example of how continuous distributions can be approximated by discrete discontinuous functions
1.1.1 Search strategies
There are several search strategies to identify the relevant features of a data set (Narendra and Fukunaga 1977; Siedlecki and Sklansky 1993; Pudil et al. 1994; Eberhart and Kennedy 1995; Michalewicz and Fogel 2004), ranging from optimal solvers (such as exhaustive search or accelerated methods based on the monotonic property of a feature-selection criterion), to sub-optimal solvers (such as greedy, heuristic, or stochastic solvers) (Guyon and Elisseeff 2003; Kohavi and John 1997; Narendra and Fukunaga 1977; Siedlecki and Sklansky 1993; Pudil et al. 1994; Whitney 1971; Pudil et al. 2002; Marill and Green 1963; Land and Doig 1960; Yu and Yuan 1993; Clausen 1999; Morrison et al. 2016; Forsati et al. 2011; Reunanen 2006). Optimal solvers explore all feature-subset combinations for a global optimum and, as such, are generally impractical for data sets with a large number of features due to cost and time constraints on computer resources. Sub-optimal search strategies (e.g., sequential floating forward selection (Pudil et al. 1994; Whitney 1971), sequential backward elimination (Marill and Green 1963), and minimal-redundancy-maximal-relevance criterion (Peng et al. 2005)), conversely, balance accuracy and speed, but may not find the optimal set of features with respect to a targeted property. A search strategy that can be used both as an optimal or sub-optimal solver, is branch and bound (Narendra and Fukunaga 1977; Pudil et al. 2002; Land and Doig 1960; Yu and Yuan 1993; Clausen 1999; Morrison et al. 2016). Branch and bound implicitly performs an exhaustive search, but uses an additional bounding criterion to discard feature subsets, whose feature-selection criteria are lower than the feature-selection criterion of the current best feature subset in the search.
1.1.2 Feature-selection criterion
The feature-selection criterion \({\mathcal {Q}}\) can be used as a score that allows the identified features to be ranked by relevance prior to subsequent data analyses. The academic community has extensively explored several feature-selection criteria to evaluate a feature’s relevance (Khaire and Dhanalakshmi 2019), including distance measures (Basseville 1989; Almuallim and Dietterich 1994), dependency measures (Modrzejewski 1993), consistency measures (Arauzo-Azofra et al. 2008), and information measures (Vergara and Estévez 2014). Ideally, feature-selection criteria are not restricted to specific type of dependencies, are robust against imprecise values in the data, and are deterministic, i.e., such that the feature selection is consistent and reproducible for the same set of variables, type of settings, and data. The prevailing method for quantifying multivariate dependences is mutual information, which determines the relevance of variables in terms of their joint mutual dependence to a property of interest (Shannon 1948).
There are several reasons to consider mutual-information-based quantities for feature selection. The two most important reasons are: (i) mutual information quantifies multivariate nonlinear statistical dependencies and (ii) mutual information provides an intuitive quantification of the relevance for a feature subset \(\vec {X}' \subseteq \vec {X}\) relative to an output Y (Vergara and Estévez 2014): it is bounded from below (for statistically independent variables with respect to an output) and can be bounded from above (for functional dependence variables), increases as the number of sample sizes increases, and quantifies the strength of the dependence based on a mathematical rigorous framework from communication theory (Shannon 1948). However, mutual information requires probability distributions, which are problematic for high-dimensional data sets and are difficult to obtain from real-valued data samples of continuous distributions.
1.2 Our approach
We propose total cumulative mutual information (TCMI): a non-parametric, robust, and deterministic measure of the relevance of mutual dependences of continuous distributions between variable sets of different cardinality. TCMI can be applied if the dependence between a set of variables and an output is not yet known and the dependence is nonlinear and multivariate. Like mutual information, TCMI relates the strength of the dependence between a set of variables and an output to the number of data samples. In addition, TCMI relates the strength of the dependence to the cardinality of the subsets. Thus, TCMI allows an unbiased comparison between different sets of variables without depending on externally adjustable parameters.
TCMI is based on cumulative mutual information and inherits many of the properties of mutual-information based feature-selection measures: it is bounded from below and above and monotonically increases the more features are subsequently added to a candidate feature set, but only until all variables related to an output are included. In contrast to other feature-selection measures based on cumulative mutual information, TCMI uses cumulative probability distributions. Cumulative probability distributions can be directly obtained from empirical data of continuous distributions, without the need to quantize the set of variables prior to estimating a feature subset’s dependence to a property of interest.
We combine TCMI with the branch-and-bound algorithm (Narendra and Fukunaga 1977; Pudil et al. 2002; Land and Doig 1960; Yu and Yuan 1993; Clausen 1999; Morrison et al. 2016), which has proven to be efficient in the discovery of nonlinear functional dependencies (Zheng and Kwoh 2011; Mandros et al. 2017). TCMI therefore identifies a set of variables that are statistically related to an output. As TCMI is model independent, a functional relationship must be constructed (in the following referred to as a model) to relate these features with an output. The model construction is not part of this work, but can be done, for example, through data-analytics techniques such as symbolic regression, both in the genetic-programming (Koza 1994) and in compressed-sensing implementations (Ghiringhelli et al. 2015; Ouyang et al. 2018), or regression tree-based approaches (Breiman et al. 1984).
In brief, our feature-selection procedure can be divided into three steps: In the first step, we quantify the dependence between the set of features and an output as the difference between cumulative marginal and cumulative conditional distributions. In the second step, we estimate the relevance of a feature set by comparing its strength of dependence to the mean dependence of features under the assumption of independent random variables. In the third step, we identify a set of relevant features with the branch-and-bound algorithm to find the set of variables from an initial list that best characterizes an output.
1.3 Outline
The remainder of this work is organized as follows. Section 2 discusses the relationship between TCMI and previous work. Section 3 introduces the theoretical background of cumulative mutual information. Section 4 describes the empirical estimation of cumulative mutual information for continuous distributions from limited sample data. Section 5 explains the steps introduced to adjust the cumulative mutual information with respect to the number of data samples and the cardinality of the feature subset. Section 6 introduces TCMI. Section 7 describes the implementation details of the feature-subset search using the branch-and-bound algorithm in detail. Section 8 reports on the performance evaluation of TCMI on generated data, standard data sets, and on a typical scenario in materials science. In the same section, TCMI is also compared with similar multivariate dependence measures such as cumulative mutual information (CMI: Nguyen et al. (2013)), multivariate maximal correlation analysis (MAC: Nguyen et al. (2014b)), and universal dependency analysis (UDS, Nguyen et al. (2016), Wang et al. (2017)). We would like to point out that feature selection is a broad area of research and can be achieved using a variety of techniques; this work therefore focuses on feature-selection methods based on mutual information that can be applied prior to subsequent data-analysis tasks. To illustrate this, we provide a real-world example in Sect. 8.2.2 by building a model from the identified features using TCMI, and comparing its performance to in-built feature-selection methods that perform feature selection during model construction. Finally, Sects. 9 and 10 present the discussion and conclusions of this work. Abbreviations, notations, and terminologies are summarized in Tables 1 and 2.
2 Related work
Many dependence measures such as Pearson R and Spearman’s rank \(\rho \) correlation coefficients (Pearson 1896; Spearman 1904), distance correlation (DCOR: Székely et al. (2007), Székely and Rizzo (2014)), kernel density estimation (KDE: Scott (1982), Silverman (1986)), or k-nearest neighbor estimation (k-NN: Kozachenko and Leonenko (1987)) are limited to bivariate dependencies only (Pearson, Spearman), are limited to specific types of dependencies (Spearman, DCOR), or require assumptions about the functional form of f (KDE, k-NN). Bivariate extensions (Schmid and Schmidt 2007), KDE, and k-NN are further not applicable to high-dimensional data sets.
For high-dimensional data sets, several authors proposed subspace-slicing techniques (Fouché and Böhm 2019; Fouché et al. 2021; Keller et al. 2012), which repeatedly apply conditions on each variable and perform statistical hypothesis tests to estimate the degree of dependence to an output. However, with these methods, all possible combinations of variables must be enumerated and, therefore, are computationally intractable for feature-selection tasks. In addition, the strength of their dependences are not related to the cardinality of feature subsets and therefore cannot be used to compare different sets of variables.
Model-dependent methods such as data-analytics techniques (Koza 1994; Ghiringhelli et al. 2015; Ouyang et al. 2018; Breiman et al. 1984) perform feature selection while creating a model. Alternatively, post-hoc analysis tools such as unified dependence measures (Lundberg and Lee 2017) can be used to assign each variable an importance value for a particular estimation. However, these methods add an additional degree of complexity, which makes it difficult to reliably assess the dependence among variables. Another approach are information-theoretic dependence measures. These measures are based on mutual information and ascertain whether or not the values of a set of variables are related to an output. As a result, they provide a model-independent approach to estimating the strength of dependences between variables.
Multivariate extensions to mutual information (e.g., interaction information (McGill 1954) and total correlation (Watanabe 1960)) require knowledge about the underlying probability distributions and are therefore difficult to estimate: estimations either require large amount of data and needs to be specified for each new data set at hand (Belghazi et al. 2018) or are affected by the curse of dimensionality (Bellman 1957) such as the Kozachenko-Leonenko estimator (Kozachenko and Leonenko 1987; Kraskov et al. 2004). Recently, several authors proposed related approaches to extending mutual information: cumulative mutual information (CMI: Nguyen et al. (2013)), multivariate maximal correlation analysis (MAC: Nguyen et al. (2014b)), and universal dependency analysis (UDS: Nguyen et al. (2016), Wang et al. (2017)). All three methods estimate the strength of dependence between a set of variables from cumulative probability distributions, and thus can be viewed as alternative measures of uncertainty that extend Shannon entropy (and mutual information) to random variables of multivariate continuous distributions.
CMI quantifies multivariate mutual dependences by considering the cumulative distributions of variables and by heuristically approximating the conditional cumulative entropy via data summarization and clustering. MAC is based on Shannon entropy over discretized data which MAC obtains by maximizing the normalized total correlation with respect to cumulative entropy. UDS uses optimal discretization to compute the conditional cumulative entropy, where the Shannon entropy defines the number of bins required to estimate the conditional cumulative entropy. Since all three dependence measures expose adjustable parameters to optimize the quantization of continuous distributions, the choice of these parameters therefore has a strong impact on the strength of mutual dependence between a set of variables \(\vec {X} = \{ X_1, \ldots , X_n \}\) and an output Y, and thus on the ranking induced by the relevance function and a feature-selection criterion. For feature selection, these measures are impractical.
Our approach, TCMI, extends CMI, but does not require to quantize real-valued data samples of continuous data distributions to estimate the joint cumulative distribution of continuous distributions. TCMI therefore does not require data summarization techniques to estimate multivariate dependences between continuous distributions, unlike similar approaches such as MAC and UDS. TCMI is non-parametric, as opposed to other estimation methods based on mutual information, e.g., neural networks (Belghazi et al. 2018) or mutual-information-based feature selection algorithms originally developed for discrete data (Kwak and Choi 2002; Chow and Huang 2005; Estevez et al. 2009; Hu et al. 2011; Reshef et al. 2011; Bennasar et al. 2015). TCMI therefore allows to reliably compare the strength of dependence between different sets of variables. On top of that, TCMI relates the strength of a dependence to a feature-subset’s cardinality and the number of data samples by basing the score on the dependence of the same set of variables under the independence assumption of random variables (Vinh et al. 2009, 2010).
3 Theoretical background
Mutual information and all measures presented in the following quantify relevance by means of the similarity between two distributions \(U(\vec {X}, Y)\) and \(V(\vec {X}, Y)\) using Kullback-Leibler divergence, \(D_\text {KL}(U(\vec {X}, Y) \Vert V(\vec {X}, Y))\) (Kullback and Leibler 1951). They do not require any explicit modeling to quantify linear and nonlinear dependencies, monotonically increase with the cardinality of a feature’s subset \(\vec {X}' \subseteq \vec {X}\),
and are invariant under invertible transformations such as translations and reparameterizations that preserve the order of the values of variables \(\vec {X}\) and of an output Y (Kullback 1959; Vergara and Estévez 2014).
For illustration purposes, only the case with two variables X and an output Y is discussed in the theoretical section. However, a generalization to multiple variables can be derived directly from the independence assumption of random variables, as will be done in later sections.
3.1 Mutual information
Mutual information (Shannon and Weaver 1949; Cover and Thomas 2006) relates the joint probability distribution p(x, y) of two discrete random variables with the product of their marginal distribution p(x) and p(y),
Mutual information is non-negative, is zero if and only if the variables are statistically independent, \(p(x, y) = p(x) p(y)\) (independence assumption of random variables), and increases monotonically with the mutual interdependence of variables otherwise. Further, mutual information indicates the reduction in the uncertainty of Y given X as \(I(Y; X) = H(Y) - H(Y \vert X)\), where H(Y) denotes the Shannon entropy and \(H(Y \vert X)\) the conditional entropy (Shannon 1948). Shannon entropy H(Y) is defined as the expected value of the negative logarithm of the probability density p(y),
and can be interpreted as a measure of the uncertainty on the occurrence of events y whose probability density p(y) is described by the random variable Y.
The conditional entropy \(H(Y \vert X)\) quantifies the amount of uncertainty about the value of Y, provided the value of X is known. It is given by
where \(p(y \vert x) = p(y, x) / p(x)\) is the conditional probability of y given x. Clearly, \(0 \le H(Y \vert X) \le H(Y)\) with \(H(Y \vert X) = 0\) if variables X and Y are related by functional dependency and \(H(Y \vert X) = H(Y)\) if variables are independent of each other.
Although mutual information is restricted to the closed interval \(0 \le I(Y; X) \le H(Y)\), the upper bound is still dependent on Y. To facilitate comparisons, mutual information is normalized,
Normalized mutual information, hereafter referred to as fraction of information (also known as coefficients of constraint (Coombs et al. 1970), uncertainty coefficient (Press et al. 1988), or proficiency (White et al. 2004)) quantifies the proportional reduction in uncertainty about Y when X is given. It is in the range \(D: {\mathbb {R}} \times {\mathbb {R}} \rightarrow [0, 1]\), where 0 and 1 represent statistical independence and functional dependence, respectively (Reimherr and Nicolae 2013).
3.2 Probability and cumulative distributions
Mutual information and fraction of information are only defined for discrete distributions. Although mutual information can be generalized to continuous distributions,
probability densities are not always accessible from sample data and therefore need to be estimated. Common algorithms for probability-density estimations are clustering (Nguyen et al. 2013; Pfitzner et al. 2008; Xu and Tian 2015), discretization (Fayyad and Irani 1993; Dougherty et al. 1995; Nguyen et al. 2014a), and density estimation (Keller et al. 2012; Garcia 2010; Bernacchia and Pigolotti 2011; O’Brien et al. 2014, 2016). However, these methods implicitly introduce adjustable parameters whose choice has a strong impact on the strength of mutual dependence between a set of variables \(\vec {X} = \{ X_1, \ldots , X_n \}\) and an output Y, and thus on the ranking induced by the relevance function and a feature-selection criterion (cf., Fig. 1). In practice, such approaches are extremely dependent on the applied parameter set and therefore are sensitive to the scale of variables (cf., Sect. 8).
An alternate approach to using probability distributions is cumulative probability distributions to determine the mutual dependence between variables. Cumulative probability distributions P (and residual cumulative distribution \(P' \approx 1 - P\)) of a variable X evaluated at x describe the probability that X takes on a value less than or equal to x (or a value greater than or equal to x, respectively),
If the derivatives exists, they are the anti-derivatives of probability distributions,
Both residual and cumulative distributions are defined for continuous and discrete variables and are based on accumulated statistics. As such, they are more regular and less sensitive to statistical noise than probability distributions (Crescenzo and Longobardi 2009b, a). In particular, they are monotonically increasing and decreasing, respectively, i.e., \(P(x_1) \le P(x_2)\) or \(P'(x_1) \ge P'(x_2)\), \(\forall x_1 \le x_2\), with limits
Similar to probability distributions, cumulative and residual cumulative distributions are invariant under a change of variables. However, they are invariant only for parameterizations that preserve the order of the values of each variable \(X \in \vec {X}\) and Y. Positive monotonic transformations \({\mathcal {T}}: {\mathbb {R}} \rightarrow {\mathbb {R}}\),
such as translations and nonlinear scaling of variables are among the transformations whose cumulative distributions remain invariant. In contrast, invertible and especially non-invertible mappings (Mira 2007) (such as inversions, \({\mathcal {T}}(X) \mapsto \pm X\), and non-bijective transformations, e.g., \({\mathcal {T}}(X) = \pm |X |\)) change the order of the values of a variable and with it the cumulative distribution. Consequently, these mappings must be considered as additional variables during feature selection if it is expected that such transformations might be related to an output.
3.3 Cumulative mutual information
Cumulative mutual information is an alternative measure of uncertainty that extends Shannon entropy (and mutual information) to random variables of continuous distributions. Cumulative mutual information has the same properties as mutual information to be monotonically increasing with the cardinality of the set of variables (Eq. 2). Analogous to mutual information, cumulative mutual information describes the inherent dependence expressed in the joint cumulative distribution \(P(x, y) = P(X \le x, Y \le y)\) of random variables \(x \in X\) and \(y \in Y\) relative to the product of their marginal cumulative distribution P(x) and P(y),
The independence assumption of random variables, \(P(y, x) = P(y) P(x)\), induces a measure that is again zero only if variables X and Y are statistically independent, and non-negative otherwise. Similarly to mutual information, cumulative mutual information quantifies the degree of dependency as the reduction in the uncertainty of Y given X, i.e., \({\mathcal {I}}(Y; X) = {\mathcal {H}}(Y) - {\mathcal {H}}(Y \vert X)\). It is a function of cumulative entropy \({\mathcal {H}}(Y)\) and conditional cumulative entropy \({\mathcal {H}}(Y \vert X)\),
where \(P(y \vert x) = P(y, x) / P(x)\) is the conditional cumulative distribution of \(Y \le y\) given \(X \le x\) (cf., Table 2). Again, \({\mathcal {H}}(Y \vert X) = 0\) if variables X and Y are functional dependent and \({\mathcal {H}}(Y \vert X) = {\mathcal {H}}(Y)\) if variables X and Y are independent of each other.
Bounds restrict cumulative mutual information to a closed interval \(0 \le {\mathcal {I}}(Y; X) \le {\mathcal {H}}(Y)\) with an upper bound dependent on Y. For this reason, cumulative mutual information is normalized,
and, likewise to mutual information, is hereafter referred to as fraction of cumulative mutual information.
4 Empirical estimations of cumulative entropy and cumulative mutual information
The closed-form expression of cumulative mutual information (Eq. 13) quantifies the dependence of a set of variables based on the assumption of smooth and differentiable cumulative distributions. Due to the limited availability of data, however, the exact functional shape of the cumulative distribution is not directly accessible and hence must be empirically inferred from a limited set of sample data.
For this reason, let us assume an empirical sample \(\{(y_1, x_1)\), \((y_2, x_2)\), \(\ldots \), \((y_n, x_n)\}\) drawn independently and identically distributed (i.i.d.) according to the joint distribution of X and Y. Such sample data induces empirical (cumulative) probability distributions for all variables \(Z \in \{Y, X\}\), which lead to empirical estimates \(\hat{{\mathcal {E}}}\) of an estimator \({{\mathcal {E}}}\) (cf., Table 2).
Based on the maximum likelihood estimate (Dutta 1966; Rossi 2018), the cumulative probability distribution \({\hat{P}}(Z \le z)\) can be obtained by counting the frequency of occurring values of a variable Z:
where \({\mathbf {1}}_A\) denotes the indicator function that is one if A is true, and zero otherwise. Equation 17 asymptotically converges to \(P(Z \le z)\) as \(n \rightarrow \infty \) for every value of \(z \in Z\) (Glivenko-Cantelli theorem: Glivenko (1933), Cantelli (1933)). Thus, any empirical estimate, \(\hat{{\mathcal {E}}}\), based on empirical cumulative distributions converges pointwise as \(n \rightarrow \infty \) to the actual value of \({\mathcal {E}}\), i.e., \(\hat{{\mathcal {E}}}(Z) \rightarrow {\mathcal {E}}(Z)\) (Rao et al. 2004; Crescenzo and Longobardi 2009a).
4.1 Empirical cumulative entropy
For i.i.d. random samples that contain repeated values, the maximum likelihood estimate of the cumulative entropy \({\mathcal {H}}\) (Eq. 15) can be obtained by calculating the empirical cumulative distribution \({\hat{P}}\) according to Eq. 17,
where \(y_{(i)}\) denotes the values \(y_{(0)}< y_{(1)}< \cdots < y_{(k)}\) occurring in the data set in sorted order of Y with \(y_{(0)} = -\infty \), multiplicity \(n_i = \left|\{ j \in n: y_{(i - 1)} < y_j \le y_{(i)}\} \right|\), and constraint \(n = \sum _{i = 1}^k n_i\).
4.2 Empirical conditional cumulative entropy
Similar to empirical cumulative entropy, conditional cumulative entropy can be estimated by
where \({\hat{P}}(y_i, \vec {x}_j)\) denotes the joint cumulative distribution of \(y_i \in Y\), \(\vec {x}_j \in \vec {X}\), \(\vec {X} = \{ X_1, \ldots X_d \}\), and \(x_i^{(k)} \in X_k\) is the i component of the k-th variable of the data set (\(k = 1, \ldots d\)). In contrast to the empirical cumulative entropy, which can be calculated from the set of sample data with linear time complexity \({\mathcal {O}}(n)\), the empirical conditional cumulative entropy has exponential time complexity \({\mathcal {O}}(n^d)\). The non-parametric estimation of the joint or conditional cumulative distribution therefore becomes computationally demanding for data sets with a large number d of variables and data samples n.
4.3 Empirical cumulative mutual information
By construction, cumulative entropy is sensitive to the range of Y (cf., Eq. 18). The same is true for conditional cumulative entropy \({\mathcal {H}}(Y \vert X) \) and its empirical estimate \(\hat{{\mathcal {H}}}(Y \vert X)\) (cf., Eq. 19). Fraction of cumulative mutual information \({{\mathcal {D}}}\), e.g., the ratio between cumulative entropy and conditional cumulative entropy, is independent of the scale of X and Y (Eq. 16). Formally, its empirical estimate \(\hat{{\mathcal {D}}}\) is given by
Computationally, we apply the following trick: To eliminate the implicit scale dependence of X, we use the fact that variables X are invariant under rank-order preserving transformations \({\mathcal {T}}\) (Eq. 12). Then, all variables can be scaled to \(x' = {\mathcal {T}}(x)\) such that \(\Delta x_i' = x_{i + 1} - x_i\) is constant and the volume element \(dx'\) in the integrals cancels out (cf., Eq. 15).
Such a transformation is always possible and effectively removes the implicit range dependence of variables from the fraction of cumulative information in the computation.
5 Baseline adjustment
The limited availability of data makes it challenging to estimate or calculate dependencies on empirical estimators. Because measures are meant to provide a comparison mechanism, empirical estimators need to assign a value (dependence score) close to zero for statistical independent variables and a score close to one for functional dependent variables. However, empirical estimators based on mutual information are known to never reach their theoretical maximum (functional dependence) or minimum (statistical independence), respectively, and are known to assign stronger dependences for larger sets of variables regardless of the underlying relationship (Fouché and Böhm 2019; Fouché et al. 2021; Vinh et al. 2009, 2010). Consequently, measures based on mutual information have a considerable inherent bias and therefore may incorrectly identify variables as relevant that are not related to an output Y. To actually compare dependence measures between subsets and different sizes of variable sets, an adjustment to mutual information is necessary. One solution to estimate the relevance of a set of variables X and Y is to compare the relevance of a variable X and an output Y to the mean \(\hat{{\mathcal {E}}}_0\) of an empirical estimator \(\hat{{\mathcal {E}}}\),
The mean \(\hat{{\mathcal {E}}}_0\) requires to be constant across random permutations of all variables independently for each data sample, i.e.,
where \(M \in {\mathcal {M}}\) is a specific realization of such a permutation. The underlying intuition is that the actual value of an empirical estimator \(\hat{{\mathcal {E}}}\) may be caused by spurious (random) dependences. Therefore, by considering all random permutations of all variables independently for each data sample, the spurious contribution of the empirical estimator can be factored out and an adjusted unbiased empirical estimator obtained. The permutations can be computed by enumeration, which however is impractical. An alternative description is provided by a hypergeometric model of randomness (Vinh et al. 2009; Romano et al. 2014) (also known as permutation model (Lancaster 1969)). Such a model describes the permutation of variables as (cumulative) probability distributions, where the average can be calculated separately for each sample of a data set with quadratic complexity. Under the independence assumption of random variables (Vinh et al. 2009, 2010), we derived the correction term for cumulative mutual information as follows,
where the difference \(\Delta y_i(M)\) between two consecutive values of Y can be described by a binomial distribution,
\(k_\text {max}\) is the upper limit is given by \(k_\text {max} = \min (n - b_j + 1, r - i)\). \({\mathcal {N}}\) is a normalization constant,
and \({\mathcal {P}}(n_{ij}, a_i, b_j \vert M)\) is the probability to encounter an associative cumulative contingency table subject to fixed marginals between all permutations of two variables X and Y with \(|Y_i |= a_i\), \(i = 1, \ldots , r\) and \(|X_j |= b_j\), \(j = 1, \ldots , c\). \(n_{ij}\) is a specific realization of the joint cumulative distribution \(P(y_i, x_j)\) given row marginal \(a_i\) and column marginal \(b_j\). The details can be found in the appendix and are analogous to the baseline adjustment for mutual information (Vinh et al. 2009).
The empirical estimator \(\hat{{\mathcal {E}}}_0\) in Eq. 22 is required to vanish for a large number of samples \(\hat{{\mathcal {E}}}_0(Y; X)~\rightarrow ~0\) as \(n~\rightarrow ~\infty \) in case there is an exact functional dependence between X and Y (Romano et al. 2016). Further, \(\hat{{\mathcal {E}}}_0\) is required to be zero if variables are proportional to the output, \({\mathcal {E}}_0(Y; X)~\rightarrow ~0\) as \(X~\rightarrow ~Y\).
In practice, \(\hat{{\mathcal {E}}}_0\) is generally greater than zero when the number of data samples is limited and can become as large as \(\hat{{\mathcal {E}}}\) when the number of data samples is very small. \(\hat{{\mathcal {E}}}_0\) can therefore be interpreted as a correction term for comparing empirical estimates of different sets of variables on a common baseline: In general, if the value of the correction term is large, more data samples are needed to reliably estimate the dependence between X and Y. If the value of correction term is small, the adjusted empirical estimator either indicates a strong mutual dependence between X and Y (high \(\hat{{\mathcal {E}}}^*\)) or a weak mutual dependence, if the variables of the data set are not related to Y (low \(\hat{{\mathcal {E}}}^*\)).
For cumulative mutual information, we define the empirical estimator as follows
where \(\hat{{\mathcal {I}}}^*(Y; X)\) is the adjusted empirical cumulative mutual information, \(\hat{{\mathcal {D}}}^*(Y; X)\) is the adjusted fraction of empirical cumulative information, and \(\hat{{\mathcal {I}}}_0(Y; X)\) is the expected cumulative mutual information under the independence assumption of random variables.
6 Total cumulative mutual information
Empirical cumulative mutual information provides a non-parametric deterministic measure to estimate the dependence of continuous distributions. Equation 16 estimates cumulative mutual information based on cumulative probability distributions, \(P(X) = P(X \ge x)\). Similarly, a measure can be instantiated for residual cumulative probability distributions, \(P'(X) := P(X \ge x) = 1 - P(X \le x)\),
Both measures \({\mathcal {D}}(Y; X)\) and \({\mathcal {D}}'(Y; X)\) estimate the dependence between a set of variables and an output from different sides of the distribution: therefore, they set lower and upper bounds on the information they contain. As the sample size increases to infinity, both measures converge to the same value. However, due to the limited number of data samples (cf., Sect. 5), these measures are different and need to be adjusted in practice,
The baseline adjustment turns both measures convex by relating the strength of a dependence among variables with the dependence of the same set of variables under the independence assumption of random variables (Vinh et al. 2009, 2010). They can therefore be used to efficiently search for the strongest mutual dependence between a set of variables and an output, e.g., by using the minimum contribution of fraction of empirical cumulative mutual information of the two measures,
Total cumulative mutual information (TCMI) combines \(\hat{{\mathcal {D}}}^*(Y; X)\) and \(\hat{{\mathcal {D}}}^{\prime *}(Y; X)\) into a single measure. TCMI is defined as the average strength of cumulative mutual dependence between a set of variables X and an output Y,
where
7 Feature selection
Feature selection (Eq. 1) is an optimization problem that either requires a convex dependence measure or additional criteria to judge the optimality of a feature set (Yu and Príncipe 2019). Measures based on (cumulative) mutual information do not meet either requirement, but an adjusted measure such as TCMI does.
As already mentioned in the introduction, the optimal search strategy (subset selection) of k features from an initial set of variables \(\vec {X} = \{ X_1, \ldots , X_d \}\) is a combinatorial and exhaustive search procedure that is only applicable to low-dimensional problems. An efficient alternative to the exhaustive search is the (depth-first) branch-and-bound algorithm (Land and Doig 1960; Narendra and Fukunaga 1977; Clausen 1999; Morrison et al. 2016). The branch-and-bound algorithm guarantees to find an optimal set of feature variables without evaluating all possible subsets. The performance depends crucially on the variables of a data set and the maximum strength of the mutual dependence between a set of variables and an output. It may be that an output is only weakly related to the variables in the data set, making it necessary to repeat the feature selection with a different set of variables. It may also be that prior knowledge of potential mutual dependences is available, which speeds up the feature selection (e.g., that only \(m < k\) of d variables X are related to Y and therefore not all combinations need to be implicitly enumerated).
The branch-and-bound algorithm maximizes an objective function \({\mathcal {Q}}^*: \vec {X}' \rightarrow {\mathbb {R}}\) defined on a subset of variables \(\vec {X}' \subseteq \vec {X}\) by making use of the monotonicity condition of a feature-selection criterion, \({\mathcal {Q}}: \vec {X}' \rightarrow {\mathbb {R}}\), and a bounding criterion, \(\bar{{\mathcal {Q}}}: \vec {X}' \rightarrow {\mathbb {R}}\). The monotonicity condition requires that feature subsets \(\vec {X}_1\), \(\vec {X}_2\), \(\ldots \), \(\vec {X}_k\), \(k=1,\ldots ,d\), obtained by sequentially adding k features from the set of variables \(\vec {X}\), satisfy
so that the feature-selection criterion \({\mathcal {Q}}\) and bounding criterion \(\bar{{\mathcal {Q}}}\) are monotonically increasing and decreasing respectively,
The branch-and-bound algorithm builds a search tree of feature subsets \(\vec {X}' \subseteq \vec {X}\) with increasing cardinality (Clausen 1999; Morrison et al. 2016) (Alg. 1 and Fig. 2). Initially the tree contains only the empty subset (the root node). At each iteration, a limited number of (non-redundant) sub-trees are generated by augmenting one variable \(X \in \vec {X}\) at a time to the current subset and then adding it to the search tree (branching step). While traversing the tree from the root down to terminal nodes from left to right, the algorithm keeps the information about the currently best subset \(X^* := \vec {X}_k\) and the corresponding objective function it yields (the current maximum). Anytime the objective function \({\mathcal {Q}}^*\) in some internal nodes exceeds the bounding criterion \(\bar{{\mathcal {Q}}}\) of sub-trees, it decreases (either due to the condition Eq. 35 or the bounding criterion is lower than the current maximum value of the objective function), sub-trees can be pruned and computations be skipped (bounding step). Once the entire tree has been examined, the search terminates and the optimal set of variables is returned, along with a ranking of sub-optimal variable sets in descending order of the value of the objective function values.
![figure a](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10618-022-00847-y/MediaObjects/10618_2022_847_Figa_HTML.png)
As objective and criterion function we set
the criterion function to be
and, as a pruning rule, the bounding criterion to be (cf., Eq. 31),
Proofs for the monotonicity conditions for \({\mathcal {Q}}\) and \({\bar{Q}}\) follow similar arguments as for Shannon entropy (Mandros et al. 2017) and are provided in the appendix.
Example of a depth-first tree search strategy of the branch-and-bound algorithm (Land and Doig 1960; Narendra and Fukunaga 1977; Clausen 1999; Morrison et al. 2016) to search for the optimal subset of features. Shown is the tree traversal going from top to down and left to right by dashed arrows, the estimated fraction of total cumulative information (objective function inside circles), subsets of features (labels at the bottom of the circles), fraction of cumulative information (criterion function, first number, right or left the circles), and the expected fraction of cumulative information contribution (bounding function, second number, right or left the circles). Capital roman symbols indicate applied pruning rules or updates of the current maximum objective function. Anytime the objective function in some internal nodes exceeds the bounding function of sub-trees (I), it decreases (II) – either due to the condition Eq. 35 or the bounding function is lower than the current maximum value of the objective function (III), sub-trees can be pruned and computations be skipped. On termination of the algorithm, the bound contains the optimum objective function value (IV)
7.1 Complexity Analysis
The computational complexity of the branch-and-bound algorithm is largely determined by two factors: the branching factor B and the depth D of the tree (Morrison et al. 2016). The branching factor is the maximum number of generated variables combinations at each level l of the tree and can be estimated by the central binomial coefficient \(B \le \max _{l = 1, \ldots , D} \left( {\begin{array}{c}d\\ l\end{array}}\right) \approx \left( {\begin{array}{c}d\\ d/2\end{array}}\right) \), if \(\vec {X}\) has d variables. The depth D of the tree is given by the largest cardinality of a variable set, represented as the longest path in the tree from the root to a terminal node. The ranking of the variable sets involves \({\mathcal {O}}((n \log n)^d)\) sorting operations when all variables are relevant. Thus, any branch-and-bound implementation has worst-case \({\mathcal {O}}(M \cdot B^D)\) computational time complexity, where M is the time needed to evaluate the feature-selection criterion for a combination of variables in the tree.
In the worst case, for n number of example data and d variables, cumulative mutual information requires to evaluate the integral \({\mathcal {O}}(n^d)\) times and \({\mathcal {O}}(n^2)\) times to calculate the baseline adjustment term. Thus, TCMI has time complexity \(M \sim {\mathcal {O}}(n^d)\) and a feature-subset search in the current implementation suffers from the curse of dimensionality (Koller and Sahami 1996).
As a result, the total time complexity of the feature selection algorithm is non-deterministic polynomial-time (NP)-hard and, in general, the search strategy of examining all possible subsets is not viable. In the vast majority of cases, however, dependencies are relatively simple relationships of only a small number of features. In addition, feature selection can be restricted at any time to examine subsets of variables that are less than or equal to a predefined dimensionality. Then the time complexity is greatly reduced and the feature selection can be solved in polynomial time. Whether the assumptions apply to arbitrary data sets is a case-by-case study. However, indicators such as the convergence rate of the TCMI approaching the maximum value or the estimated strength of the relationships are helpful in exploratory data analysis to search for the relevant features of a data set.
8 Experiments
To demonstrate the performance of TCMI in different settings, we first consider generated data and show that our method can detect both univariate and multivariate dependences. Then, we discuss applications of TCMI on data sets from the KEEL and UCI Machine Learning Repository (Alcalá-Fdez et al. 2009, 2011; Dua and Graff 2017) and a typical scenario from the materials-science community, namely to predict the crystal structure of octet-binary compound semiconductors (Ghiringhelli et al. 2015, 2017).
8.1 Case study on generated data
In a number of experiments, we test the theoretical properties of TCMI, i.e., its invariance properties and performance statistics. We also study an exemplified feature-selection task to find a bivariate normal distribution embedded in a multi-dimensional space.
8.1.1 Interpretability of TCMI
In the first experiment, we use TCMI, CMI, MAC, UDS, and MCDE to estimate the dependence between a linear data distribution Y of size \(n = 200\) and different distributions X as features (Table 3). Besides linear, exponential, and constant distributions (zero vector), we consider stepwise distributions generated by discretizing a linear distribution, where each value is repeated 2, 4, and 8 times. Furthermore, we consider uniform (random) and saw-tooth distributions with 2, 4, and 8 steps per ramp. Results show that (i) the TCMI score increases nonlinearly with the similarity between a variable and an output, (ii) TCMI is zero for a constant distribution, and (iii) TCMI is approaching one for an exact dependence (see also Fig. 3). CMI, MAC, UDS, and MCDE perform similarly well, but they seem to be less sensitive than TCMI in assessing the strength of a mutual dependence. In particular, the strength of a dependence with CMI, MAC, UDS, MCDE does not change with the shape of a distribution (i.e., of different cumulative probability distributions such as the step-like distributions). MCDE does not differentiate between a linear and a constant distribution, while UDS seems to be limited and does not reach the maximum score even in the presence of an exact dependence.
Due to the limited availability of data samples, a random distribution has a higher TCMI, MAC, and MCDE score, i.e., stronger dependence, than a sawtooth distribution, in agreement with Spearman’s rank coefficient of determination \(\rho ^2\) (Spearman 1904). It should be noted that the baseline adjustment \(\langle \hat{{\mathcal {D}}}_\text {TCMI, 0} \rangle \) for a random variable is larger than any other tested dependence of Table 3. A large baseline adjustment results in smaller TCMI values, such that it is unlikely that a random variable will be part in any feature selection. However, if the dependences are of the same strength as spurious dependencies induced by random variables, TCMI may select variables that are not related to an output.
8.1.2 Properties of the baseline correction term
In the second experiment (Fig. 3), we take a closer look at the baseline adjustment term that decreases monotonically with respect to the number of data samples. Baseline adjustment is given by the expected empirical cumulative mutual information (Eqs. 24 and 28). Expected empirical cumulative mutual information follows a clear downward trend in the score with increasing number of sample sizes in all our test cases. For linear dependencies, for example, we found that the baseline adjustment roughly follows a \(\langle \hat{{\mathcal {D}}}^{(\prime )}_0(Y; X) \rangle \sim n^{-2/3}\) scaling law that vanishes as \(n \rightarrow \infty \) (Fig. 3). However, the exact scaling behavior in general varies depending on the presence of duplicate values of each variable in a data set.
8.1.3 Invariance properties of TCMI
In the third experiment, we investigate the invariance properties of TCMI as compared to CMI, MAC, UDS, and MCDE. To this end, we generated random distributions X of different sizes (50, 100, 200, and 500) and reparameterized variables by applying positive monotonic transformations (cf., Sect. 3.2). Table 4 summarizes the results of comparing the dependence scores between a linear distribution and reparameterized variables, e.g., between \(\hat{{\mathcal {D}}}(Y; X)\) and \(\hat{{\mathcal {D}}}(Y; {\mathcal {T}}(X))\), where monotonic transformations \({\mathcal {T}}(X) = a X^k + b\) with \(a, b, k \in {\mathbb {R}}\) and compositions \({\mathcal {T}}(X) = {\mathcal {T}}_1(X) \pm \cdots \pm {\mathcal {T}}_m(X)\) were explored.
By construction, TCMI is invariant under positive monotonic transformations (Eq. 12). Our experiments show that TCMI is indeed both scale and permutation invariant. For CMI, MAC, and UDS, the order of the variables plays a crucial role in determining which permutation of the variable achieves either the highest dependence score (CMI, UDS) or the best discretization (MAC). Hence, deterministic dependence measures such as CMI and UDS with which TCMI is most closely related are neither scale nor permutation invariant. MAC is scale invariant, but not permutation invariant. In contrast, the stochastic dependence measure MCDE is scale and permutation invariant, but only within a probabilistic tolerance (i.e., dependence scores vary between different runs of a program within a certain threshold).
8.1.4 Baseline adjustment of TCMI
Fraction of cumulative information scores against increasing dimensionality for \( \{Y, \vec {X}\}\) using 10, 50, 100, and 500 data samples generated from mutually independent and uniform distributions of size \(\vec {X} = \{Y, X_1, \ldots , X_4 \}\). Contributions of average fraction of total cumulative mutual information, \(\langle \hat{{\mathcal {D}}}_\text {TCMI}(Y; X) \rangle \) and \(\langle \hat{{\mathcal {D}}}_\text {TCMI, 0}(Y; X) \rangle \) are shown on either side of the plot and the resulting score \(\langle \hat{{\mathcal {D}}}_\text {TCMI}^*(Y; X) \rangle \) as points. Error bars indicate standard deviations from repeating the experiment 50 times. Since X and Y are independent, average total cumulative mutual information should be constant across subsets of features independent of sample size and subset dimensionality. While \(\langle \hat{{\mathcal {D}}}_\text {TCMI}(Y; X) \rangle \) is increasing with the cardinality of the variable feature set and \(\langle \hat{{\mathcal {D}}}_\text {TCMI, 0}(Y; X) \rangle \) decreasing, \(\langle \hat{{\mathcal {D}}}_\text {TCMI}^*(Y; X) \rangle \) is approximately constant for a wide range of data samples \(10\ldots 500\) and subset dimensionality \(1\ldots 4\). The crosses represent the deviation of the TCMI from the constant baseline. By enlarging the feature subset with a shuffled version of the same variable, TCMI can be corrected. For comparison the dependence scores for the other investigated measures against increasing dimensionality – cumulative mutual information (CMI: Nguyen et al. (2013)), multivariate maximal correlation analysis (MAC: Nguyen et al. (2014b)), and universal dependency analysis (UDS: Nguyen et al. (2016), Wang et al. (2017)) – are also shown
In the fourth experiment, we investigate the necessity of a baseline adjustment to estimate mutual dependences (Sect. 5). To this end, we generated mutually independent and uniform distributions \(Z = \{ Y, X_1, \ldots , X_d \}\) of dimensionality d with sample sizes 10, 50, 100, and 500. We compared TCMI, CMI, MAC, UDS, and MCDE across subsets of variables of different subspace dimensionality while repeating the experiment 50 times. Figure 4 summarizes the results.
By definition, the score of the dependence measures for independent random variables must be zero independent of the sample size (cf., Sect. 5). However, none of the investigated dependence-measure scores are zero for all sample sizes. This is due to the fact that sample data are rarely exactly uniform. In practice, due to random sampling, we expect constant scores approaching \(\langle \hat{{\mathcal {D}}}_\text {TCMI}^*(Y; \vec {X}) \rangle \rightarrow 0.5\) and \(\hat{{\mathcal {D}}}_\text {MCDE}(Y; \vec {X}) \rightarrow 0.5\) as \(n \rightarrow \infty \) independent of the dimensionality of \(\vec {X} = \{ X_1, \ldots , X_d \}\) in the case if none of the variables are dependent to Y, and zero scores for CMI, MAC, and UDS.
Dependence scores of TCMI and MCDE are approximately constant for a wide range of data samples \(10\ldots 500\) and subset dimensionality \(1\ldots 4\) and approach \(\langle {\mathcal {D}}_\text {TCMI}^* \rangle \rightarrow 0.5\) or \(\hat{{\mathcal {D}}}_\text {MCDE} \rightarrow 0.5\) as \(n \rightarrow \infty \) as expected. In contrast, CMI, MAC, and UDS show a clear bias towards larger dependence scores at larger subset cardinalities. Furthermore, their scores are nonzero even at larger sample sizes between mutually independent random variables. In addition, their dependence scores decrease with more data samples, indicating that these measures are unreliable in estimating the strength of mutual dependences.
In comparison to MCDE, CMI, MAC, UDS, and TCMI underestimate dependencies in the one-dimensional case when noise is present in the data. By enlarging the subset with a shuffled version of the same variable, thereby simulating a variable with noise, these measures can be corrected (Fig. 4). As a result, both the corrected version of TCMI and MCDE provide a clear comparison mechanism of dependence scores across different subsets of variables, independent of the number of data samples.
8.1.5 Bivariate normal distribution
At last, we consider a simple feature-selection task with known ground truth, namely to find a bivariate normal distribution embedded in a high-dimensional space. For this purpose, we generated a bivariate normal distribution of size \(n = 500\) from features x and y, added additional variables such as normal, exponential, logistic, triangular, uniform, Laplace, Rayleigh, and Weibull distributions all with zero mean \(\mu = 0\) and identity covariance matrix \(\sigma = 1\), and augmented the feature space as described in Sect. 3.2. In terms of Pearson or Spearman’s correlation coefficient, none of the features have coefficients of determinations higher than \(1\%\) with respect to the bivariate normal distribution. Thus, without knowing the ground truth, the data set appears to be uncorrelated. However, since the ground truth is known, there are two features, namely x and y, to completely describe the bivariate normal distribution of the data set.
Subspace search
In order to find the two most relevant features from the high-dimensional data set, a subspace search is performed up to the second subset dimensionality. Further, feature selection is being performed for four sets of 50, 100, 200, and 500 data samples (Fig. 5). Results are reported in Table 5.
Overall, almost all dependence measures find at least one of the two relevant features x, y, both, or at least similar distributions, such as the normal distribution. However, scores and subset sizes of relevant features decrease with larger sample sizes for MAC and UDS, while CMI identifies exact dependencies even between distributions where none dependence exists, e.g., between a Laplacian and bivariate normal distribution. In contrast, MCDE robustly finds one of the relevant features x or y, but never finds two of them being jointly relevant. TCMI also finds one or both of the two relevant features, but scores and relevance are more determined by sample size. With sample sizes greater than 200, TCMI is the only dependence measure that correctly identifies the optimal feature subset to be \(\{x,y\}\). Still, TCMI scores are lower than of the other dependence measures, even though the score increases for larger sample sizes.
Bivariate normal probability distribution with mean \(\mu = (0, 0)\) and covariance matrix \(\Sigma = [1, 0.5; 0.5, 1]\). Shown is a scatter plot with 50, 100, 200, and 500 data samples, its cumulative probability distributions, \(P(Z \le z)\), \(Z \in \{ X, Y \}\), and contour lines of equal probability densities \(\in \{ 0.01, 0.02, 0.05, 0.08, 0.13 \}\)
Statistical power analysis
To assess the robustness of dependence measures, we performed a statistical power analysis for CMI, MAC, UDS, MCDE, and TCMI and added Gaussian noise with increasing standard deviation \(\sigma \) (Nguyen et al. 2014b, 2016; Fouché and Böhm 2019; Fouché et al. 2021). We considered \(5+1\) noise levels, distributed linearly from 0 to 1, inclusive. We computed the score of the bivariate normal distribution for each dependence \(\Lambda =\){CMI, MAC, UDS, MCDE, TCMI}, i.e., \(\langle \Lambda (Y; X) \rangle _\sigma \), with \(n = 500\) data samples and subset \(\{x,y\}\) and compared it with the score of independently drawn random data samples, \(\langle \Lambda (Y; I) \rangle _0\), of the same size (\(n = 500\)) and dimension (\(d = 1+2\)). The power of a dependence measure \(\Lambda \), was then evaluated as the probability P of a dependence score to be larger than the \(\gamma \)-th percentile of the score with respect to independence I,
Essentially, the power of a dependence measure quantifies the contrast, i.e., the difference between dependence X and independence I at noise level \(\sigma \) with \(\gamma \%\) confidence. It is a relative statistical measure and depends on the strength of the dependence. Therefore, dependence strengths that are close to independence are likely to be more sensitive to noise than stronger dependences.
For our experiments, we set \(\gamma = 95\%\) and repeated the experiment 500 times. At each iteration, we shuffled the data samples, computed the scores \(\langle \Lambda (Y; X) \rangle _\sigma \) and \(\langle \Lambda (Y; I) \rangle _0^\gamma \) for every dependence measure at noise level \(\sigma \), and recorded the average and standard deviation of the respective dependence measures. The results of the statistical power analysis, the average score of the dependence measures and independence as well as the contrast are summarized in Fig. 6.
With the exception of MAC, the statistical power of all dependence measures tends to be constant or to decrease with increasing noise level. It is remarkable that MCDE is the only dependence measure that has a high statistical power, offers a high contrast and assesses a strong dependence. In particular, the contrast of MCDE provides excellent statistics, even at noise levels much higher than TCMI. Although MAC and CMI also have high statistical power, their contrasts and dependence scores are low.
While a low contrast introduces difficulties in identifying subsets of related variables to an output, a low dependence score needs to be viewed in terms of the dependence score of all other possible subsets of features: If the subset has the highest score, it is still the subset that is most strongly related to an output given a dependence measure.
In our analysis, MCDE has the highest scores, followed by MAC, TCMI, and CMI. UDS completely fails to detect dependences in line with observations (Fouché and Böhm 2019; Fouché et al. 2021). In general, TCMI is dependent on the number of samples (Eq. 27) and its contrast generally increases with more data samples. However, TCMI seems to be more sensitive and, therefore, less robust as compared to the other dependence measures. An in-depth analysis shows: the sensitivity is merely due to the moderate strength of the dependence as the statistical power is much more robust for stronger dependences in other data sets we tested.
Statistical power analysis with \(95\%\) confidence of dependence measures at different noise levels \(\sigma = 0\ldots 1\): total cumulative mutual information (TCMI), cumulative mutual information (CMI), multivariate maximal correlation analysis (MAC), universal dependency analysis (UDS), and Monte Carlo dependency estimation (MCDE). The diagrams also show the trends in the dependence scores of the optimal feature subset \(\{x,y\}\) of the bivariate normal distribution
8.2 Case study on real-world data
Next, we study selected real-world data sets from KEEL and UCI Machine Learning Repository (Alcalá-Fdez et al. 2009, 2011; Dua and Graff 2017), and highlight TCMI for one, not restricted to, typical application of the materials-science community, namely the crystal-structure prediction of octet-binary compound semiconductors (Ghiringhelli et al. 2015, 2017).
8.2.1 KEEL and UCI regression data sets
We investigate how TCMI and similar dependence measures perform in real-world problems developed for multivariate regression tasks. Unfortunately, in practice, not every data set is known to have relevant features. Therefore, we compare our results with analyzed data sets with known relevant features. All in all, we consider one simulated data set from the KEEL database (Alcalá-Fdez et al. 2009, 2011) and two data sets from the UCI Machine Learning Repository (Dua and Graff 2017):
-
1.
Friedman #1 regression (Friedman 1991)
This data set is used for modeling computer outputs. Inputs \(X_1\) to \(X_5\) are independent features that are uniformly distributed over the interval [0, 1]. The output Y is created according to the formula:
$$\begin{aligned} Y = 10 \sin (\pi X_1 X_2) + 20 (X_3 - 0.5)^2 + 10 X_4 + 5 X_5 + \epsilon \, \end{aligned}$$(40)where \(\epsilon \) is the standard normal deviate N(0, 1). In addition, the data set has five redundant variables \(X_6\ldots X_{10}\) that are i.i.d random samples. Further, we enlarge the number of features by adding four variables \(X_{11}\ldots X_{14}\) each very strongly correlated with \(X_1\ldots X_4\) and generated by \(f(x) = x + N(0, 0.01)\).
-
2.
Concrete compressive strength Yeh (1998)
The aim of this data set is to predict the compressive strength of high performance concrete. Compressive strength is the ability of a material or structure to withstand loads that tend to reduce size. It is a highly nonlinear function of age and ingredients. These ingredients include cement, water, blast furnace slag (a by-product of iron and steel production), fly ash (a coal combustion product), superplasticizer (additive to improve the flow characteristics of concrete), coarse aggregate (e.g., crushed stone or gravel), and fine aggregate (e.g., sand).
-
3.
Forest fires Cortez and Morais (2007)
This data set focuses on wildfires in the Montesinho Natural Park, which is located at the northern border of Portugal. It includes features such as local coordinates x and y where a fire occurred, the time (day, month, and year), temperature (temp), relative humidity (RH), wind, rain, and derived forest-fire features such as fine-fuel moisture code (FFMC), duff moisture code (DMC), drought code (DC), and initial spread index (ISI) to estimate the propagation speed of fire.
For each data set, we performed feature selection using all aforementioned dependence measures (TCMI, CMI, MAC, UDS, MCDE) and compared resulting feature subsets with potentially relevant features reported from the original references. Results are summarized in Table 6.
Our results show that even in the simplest example of the Friedman regression data set, two dependence measures show extreme behavior: UDS selects no variables and MAC selects all variables of the data set and therefore both do not perform any feature selection at all. Both dependence measures do not only completely fail to identify the actual dependences of the Friedman regression data set, but also fail in the concrete compressive strength, and forest fires data set. Therefore, it is likely that these dependence measures report incorrect results in other data sets and are therefore inappropriate for feature selection and dependence-assessment tasks.
CMI and MCDE partially agree with potentially relevant features from the respective references: Therefore, they may be useful when low-dimensional feature subsets need to be identified. In contrast, TCMI effectively selects all relevant variables of the Friedman regression data set. However, TCMI it is not free from selecting non-relevant variables in sub-optimal feature subsets as it reports \(X_7\) or \(X_8\) in the fourth or fifth best feature subset. Therefore, dependence scores need to be related with respect to the baseline adjustment term, and the lower the dependence scores are, the more likely non-relevant variables are in the subsets (cf., Sect. 8.1.1).
Found feature subsets with TCMI for the Friedman regression data set as well as for the concrete compressive strength data set have high dependence scores. They agree well with relevant features as reported by the references, even though TCMI misses slag in the concrete compressive strength example: It is likely that variables such as fine and coarse aggregate or superplasticizer serve as a substitute for slag due to the limited number of data samples. However, we cannot test this assumption as all data samples were used to compute the dependence scores and no curated test sets are available for further tests.
In the forest-fires data set, temperature and relative humidity as well as duff moisture and drought code are not only reported by TCMI, but also by CMI and MCDE. It is therefore likely that these variables are also relevant in the forest-fires predictions, although none of them were mentioned in the reference (Cortez and Morais 2007). Apart from weather conditions, TCMI also includes some of the derived forest-fires variables such as duff moisture (CMD) and drought code (DC) – these variables are indirectly related to precipitation and are used to estimate the lower and deeper moisture content of the soil. Admittedly, the TCMI scores are moderate, which indicates difficulties in assessing the mutual dependences between a set of features and the burnt area of forest fires as a whole. A detailed analysis shows that although forest fires are devastating, they are isolated events – not enough to actually reliably identify the precursors of wildfires from the investigated data set.
8.2.2 Octet-binary compound semiconductors
Our last example is dedicated to a typical, well characterized, and canonical materials-science problem, namely the crystal-structure stability prediction of octet-binary compound semiconductors (Ghiringhelli et al. 2015, 2017). Octet-binary compound semiconductors are materials consisting of two elements formed by groups of I/VII, II/VI, III/V, or IV/IV elements leading to a full valence shell. They can crystallize in rock salt (RS) or zinc blende (ZB) structures, i.e., either with ionic or covalent bindings and were already studied in the 1970’s (Van Vechten 1969; Phillips 1970), followed by further studies (Zunger 1980; Pettifor 1984), and recent work using machine learning (Saad et al. 2012; Ghiringhelli et al. 2015, 2017; Ouyang et al. 2018).
The data set consists of 82 materials with two atomic species in the unit cell. The objective is to accurately predict the energy difference \(\Delta E\) between RS and ZB structures based on 8 electro-chemical atomic properties for each atomic species A/B (in total 16) such as atomic ionization potential \(\text {IP}\), electron affinity \(\text {EA}\), the energies of the highest-occupied and lowest-unoccupied Kohn-Sham levels, \(\text {H}\) and \(\text {L}\), and the expectation value of the radial probability densities of the valence s-, p-, and d-orbitals, \(r_s\), \(r_p\), and \(r_d\), respectively (Ghiringhelli et al. 2015). As a reference, we added Mulliken electronegativity \(\text {EN} = -(\text {IP}+\text {EA})/2\) to the data set and also studied the best two features from the publication (Ghiringhelli et al. 2015)
as known dependences to show the consistency of the method as well as to probe TCMI with linearly dependent features (Ghiringhelli et al. 2015).
To predict the energy difference \(\Delta E\) between RS and ZB structures, we performed a subspace search with TCMI to identify the subset of features that exhibit the strongest dependence on \(\Delta E\). Results are summarized in Table 7. In total, the strongest dependence on \(\Delta E\) was found with six features from both atomic species, A and B, before TCMI decreased again with seven features.
Results reveal that there are several feature subsets that are found to be optimal among different cardinalities. We note that TCMI never selects Mulliken electronegativity \(\text {EN}\) together with either electron affinity \(\text {EA}\) or ionization potential \(\text {IP}\) for the same atomic species. We also note that \(\text {EN}\) can be replaced by \(\text {IP}\) (see bold feature subsets in Table 7). However, \(\text {EN}\) cannot be replaced by \(\text {EA}\), as \(\text {EN}\) is found to be stronger linearly correlated with \(\text {IP}\) than with \(\text {EA}\) and hence results in slightly smaller TCMI values (by at least 0.02 in case of the optimal subsets, not shown in the table). Results therefore do not only corroborate the functional relationship between \(\text {EN}\), \(\text {IP}\), and \(\text {EA}\), but also the consistency of TCMI.
Furthermore, TCMI indicates that features, like the atomic radii \(r_s(B)\) and \(r_p(B)\) or the energies \(\text {EN}(B)\), \(\text {H}(B)\), \(\text {H}(B)\) and \(\text {IP}(B)\) of IV to VIII elements, can be used interchangeably without reducing the dependence scores. Indeed, by assessing dependences between pairwise feature combinations, TCMI identifies \(r_s(B)\) and \(r_p(B)\) to be strongly dependent and \(\text {EN}(B)\), \(\text {H}(B)\), and \(\text {IP}(B)\) strongly dependent, consistent with bivariate correlation measures such as Pearson or Spearman. In numbers, the Pearson coefficient of determination (\(r^2\)) between the atomic radii \(r_s\) and \(r_p\) are \(r^2(r_s(A), r_p(A)) = 0.94\), \(r^2(r_s(B), r_p(B)) = 0.99\) and the Pearson coefficient of determination between Mulliken electronegativity and ionization potential or electron affinity is \(r^2(\text {EN}(B), \text {IP}(B)) = 0.96\), or \(r^2(\text {EN}(B), \text {H}(B)) = 0.99\), respectively. These findings illustrate that TCMI assigns similar scores to collinear features.
Features \(D_1\) and \(D_2\) (Eq. 41) from the reference (Ghiringhelli et al. 2015), are combinations of atomic properties that best represent \(\Delta E\) linearly,
As such, they incorporate knowledge that generally lead to higher TCMI scores for the same feature subset cardinality. While this applies to the first and second subset dimensions, feature subsets with the aforementioned features \(D_1\), \(D_2\) are on par with feature subsets based on atomic properties at higher dimensions. However, \(D_1\) and \(D_2\) are not selected consistently by TCMI because TCMI does not make any assumption about the linearity of the dependency \((D_1, D_2) \mapsto \Delta E\). This suggests that the linear combination of \(D_1\) and \(D_2\) is a good, but not complete, description of the energy difference \(\Delta E\).
Feature spaces of the topmost selected feature subsets for one (left) and two dimensions (right). Shown are the two classes of crystal-lattice structures as diamonds (zinc blende) and squares (rock salt), their distribution, and the trend line/manifold in the prediction of the energy difference \(\Delta E\) between rock salt and zinc blende. The trend line/manifold was computed from with the gradient boosting decision tree algorithm (Friedman 2001) and 10-fold cross validation. For reference, some octet-binary compound semiconductors are labeled
A visualization of relevant subsets also reveals clear monotonous relationships in one and two dimensions (Fig. 7). In addition, we constructed machine-learning models for each feature subset and report model statistics for the prediction of \(\Delta E\) along with statistics of the full feature set (Table 7). The details can be found in the appendix. We partitioned the data set into \(k = 10\) groups (so-called folds) and generated k machine-learning models, using 9 folds to generate the model and the k-th fold to test the model (10-fold cross validation). To reduce variability, we performed five rounds of cross-validation with different partitions and averaged the rounds to obtain an estimate of the model’s predictive performance. For the machine-learning models we used the gradient boosting decision tree algorithm (GBDT) (Friedman 2001). GBDT is resilient to feature scaling (Eq. 12) just like TCMI and is one of the best available, award-winning, and versatile machine-learning algorithm for classification and regression (Natekin and Knoll 2013; Fernández-Delgado et al. 2014; Couronné et al. 2018). Notwithstanding this, traditional methods sensitive to feature scaling may show superior performances for data sets with sample sizes larger than the number of considered features (Lu and Petkova 2014) (compare also model performances in Table 7 with references (Ghiringhelli et al. 2015; Ouyang et al. 2018; Ghiringhelli et al. 2017)).
Machine-learning models are designed to improve with more data and a feature subset that best represents the data for the machine-learning algorithm (Friedman 2001; James et al. 2013). Therefore, we expect a general trend of higher model performances with larger feature-subset cardinalities. Furthermore, we do not expect that the optimal feature subset of TCMI performs best for every machine-learning model (“No free lunch” theorem: Wolpert (1996b, 1996a), Wolpert and Macready (1995, 1997)) as an optimal feature subset identified by the feature-selection criterion TCMI may not be same according to other evaluation criteria such as root-mean-squared error (RMSE), mean absolute error (MAE), maximum absolute error (MaxAE), or Pearson coefficient of determination (\(r^2\)). This fact is evident in our analysis. The choice of GBDT may not be optimal because its predictive performance generally decreases with the number of variables (compare the model performance with all 16 variables to a subset with two or four variables, Table 7). However, to the best of our knowledge, there is no other machine-learning algorithm that models data without making assumptions about the functional form of dependence, is independent of an intrinsic metric, and can operate on a small number of data samples. Therefore, we focus only on the predictive performance of the found subsets compared to the predictive performance of the identified features with respect to all variables in the data set (Table 7).
Results confirm the general trend of higher model performances with larger feature-subset cardinalities and show that the initial subset of 16 variables can be reduced down to 6 variables without decreasing model performances. Essentially, feature subsets with three to four variables are already as good as a machine-learning model with all 16 variables, where the large number of variables already start to degrade the prediction performance of the GBDT model. The overall performance gradually increases with the subset cardinality. However, our analysis identifies significant variability in performance with a higher standard deviation for feature subsets at smaller dependence scores than for larger values.
An exhaustive search for the best GBDT model yields an optimum of seven features to best predict the energy difference between rock salt and zinc blende crystal structures with D1 and D2 neglected,
In contrast to the optimal feature subsets of TCMI (cf., Table 7), the optimal GBDT feature set is a variation of optimal feature subsets of TCMI with highest-occupied Kohn-Sham level and ionization potential interchanged, \(\text {H}(A) \leftrightarrow \text {IP}(A)\), and lowest-unoccupied Kohn-Sham level, \(\text {L}(B)\), missing. Model performances demonstrate that the optimal feature subsets of TCMI are close to the model’s optimum and corroborate the usefulness of TCMI in finding relevant feature subsets for machine-learning predictions. Slight differences in performances are mainly due to the variances of the cross-validation procedure and the small number of 82 data samples, which effectively limited the reliable identification of larger feature subsets in the case of TCMI (Table 5).
9 Discussion
Although TCMI is a non-parametric, robust, and deterministic measure, the biggest limitation is its computational complexity. For small data sets (\(n < 500\)) and feature subsets (\(d < 5\)) feature selection finishes in minutes to hours on a modern computer. For larger data sets, however, TCMI scales with \({\mathcal {O}}(n^d)\) and quickly exceeds any realizable runtime. Furthermore, the search for the optimal feature subset also needs to be improved. Even though in our analysis only a fraction of less than one percent of the possible search space had to be evaluated, TCMI was evaluated hundreds of thousands of times. Future research towards pairwise evaluations (Peng et al. 2005), Monte Carlo sampling (Fouché and Böhm 2019; Fouché et al. 2021), or gradual evaluation of features based on iterative refinement strategies of sampling will show to what extent the computational costs of TCMI can be reduced.
A further limitation is that non-relevant variables may be selected in the optimal feature subsets, when only a limited amount of data points is available (cf., Sect. 8.1.5). By construction, the identification of feature subsets is dependent on the feature-selection search strategy (cf., Sect. 1). The results show that it is critical to use optimal search strategies because sub-optimal search strategies can report subsets of features that are not related to an output. Even if the exhaustive search for feature subsets is computationally intensive, it can be implemented efficiently, e.g., by using the branch-and-bound algorithm. In our implementation, the branch-and-bound algorithm was used to search for optimal, i.e., minimal non-redundant feature subsets. However, as our results demonstrate, different feature subsets with few or no common features may lead to similar dependence scores. The main rationale for this outcome is that the features may be correlated with each other and therefore contain redundant information about dependences. Including these redundant features will surely lead to a higher stability of the method, more consistent results, and better insights into the actual dependence. If a machine-learning algorithm is given, the best option at present is to generate predictive models for each of the found feature subsets and select the one that works best.
10 Conclusions
We constructed a non-parametric and deterministic dependence measure based on cumulative probability distribution (Rao et al. 2004; Rao 2005) to propose fraction of cumulative mutual information \({\mathcal {D}}(Y; \vec {X})\), an information-theoretic divergence measure to quantify dependences of multivariate continuous distributions. Our measure can be directly estimated from sample data using well-defined empirical estimates (Sect. 3). Fraction of cumulative mutual information quantifies dependences without breaking permutation invariance of feature exchanges, i.e., \({\mathcal {D}}(Y; \vec {X}) = {\mathcal {D}}(Y; \vec {X}')\) for all \(\vec {X}' \in {\text {perm}}(\vec {X})\), while being invariant under positive monotonic transformations. Measures based on mutual information are monotonously increasing with respect to the cardinality of feature subsets and sample size. To turn fraction of cumulative mutual information into a convex measure, we related the strength of a dependence with the dependence of the same set of variables under the independence assumption of random variables (Vinh et al. 2009, 2010). We further constructed a measure based on residual cumulative probability distributions and introduced total cumulative mutual information \(\langle \hat{{\mathcal {D}}}_\text {TCMI}^*(Y; \vec {X}) \rangle \).
Tests with simulated and real data corroborate that total cumulative mutual information is capable of identifying relevant features of linear and nonlinear dependences. The main application of total cumulative mutual information is to assess dependences, to reduce an initial set of variables before processing scientific data, and to identify relevant subsets of variables, which jointly have the largest mutual dependence and minimum redundancy with respect to an output. The performance of the total cumulative mutual information is still exponential and thus outweighs potential benefits of TCMI. In future works, we will address the performance issues of TCMI, the stability of identified feature subsets, and provide a feature-selection framework that is also suitable for discrete, continuous, and mixed data types. We will also apply TCMI to current problems in the physical sciences with a practical focus on the identification of feature subsets to simplify subsequent data-analysis tasks.
Since total cumulative mutual information identifies dependences with strong mutual contributions, it is applicable to a wide range of problems directly operating on multivariate continuous data distributions. In particular, it does not need to quantize variables by using probability density estimation, clustering, or discretization prior to estimating the mutual dependence between variables. Thus, total cumulative mutual information has the potential to promote an information-theoretic understanding of functional dependences in different research areas and to gain more insights from data.
Supplementary information. We implemented total cumulative mutual information in Python. Our Python-based implementation is part of B.R.’s doctoral thesis and is made publicly available under a Apache License 2.0.
All data and scripts involved in producing the results can be downloaded from Zenodo (https://doi.org/10.5281/zenodo.6577261). An online tutorial to reproduce the main results presented in this work can also be found on GitHub (https://github.com/benjaminregler/tcmi) or in the NOMAD Analytics Toolkit (https://analytics-toolkit.nomad-coe.eu/public/user-redirect/notebooks/tutorials/tcmi.ipynb).
References
Alcalá-Fdez J, Sánchez L, García S et al (2009) Keel: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput 13(3):307–318. https://doi.org/10.1007/s00500-008-0323-y
Alcalá-Fdez J, Fernandez A, Luengo J et al (2011) Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. J Multi-Valued Log Soft Comput 17(2–3):255–287
Almuallim H, Dietterich TG (1994) Learning boolean concepts in the presence of many irrelevant features. Artif Intell 69(1):279–305. https://doi.org/10.1016/0004-3702(94)90084-1
Arauzo-Azofra A, Benitez JM, Castro JL (2008) Consistency measures for feature selection. J Intell Inf Syst 30(3):273–292. https://doi.org/10.1007/s10844-007-0037-0
Basseville M (1989) Distance measures for signal processing and pattern recognition. Signal Process 18(4):349–369. https://doi.org/10.1016/0165-1684(89)90079-0
Belghazi MI, Baratin A, Rajeshwar S et al (2018) Mutual information neural estimation. In: Dy J, Krause A (eds) Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol 80. PMLR, Stockholm, Sweden, pp 531–540, https://proceedings.mlr.press/v80/belghazi18a.html
Bellman R (1957) Dynamic Programming. Princeton University Press, New Jersey, USA, https://press.princeton.edu/books/paperback/9780691146683/dynamic-programming
Bennasar M, Hicks Y, Setchi R (2015) Feature selection using joint mutual information maximisation. Expert Syst Appl 42(22):8520–8532. https://doi.org/10.1016/j.eswa.2015.07.007
Bernacchia A, Pigolotti S (2011) Self-consistent method for density estimation. J R Stat Soc: Ser B (Statistical Methodology) 73(3):407–422. https://doi.org/10.1111/j.1467-9868.2011.00772.x
Blum AL, Langley P (1997) Selection of relevant features and examples in machine learning. Artif Intell 97(1):245–271. https://doi.org/10.1016/S0004-3702(97)00063-5
Breiman L, Friedman J, Stone CJ et al (1984) Classification and regression trees. Chapman and Hall/CRC, Florida, USA. https://doi.org/10.1201/9781315139470
Cantelli FP (1933) Sulla determinazione empirica delle leggi di probabilita. Giorn Ist Ital Attuari 4(421–424)
Chow TWS, Huang D (2005) Estimating optimal feature subsets using efficient estimation of high-dimensional mutual information. IEEE Trans Neural Networks 16(1):213–224. https://doi.org/10.1109/TNN.2004.841414
Clausen J (1999) Branch and bound algorithms – principles and examples. Tech. rep., Department of Computer Science, University of Copenhagen, Universitetsparken 1, DK2100 Copenhagen, Denmark
Coombs C, Dawes R, Tversky A (1970) Mathematical Psychology: An Elementary Introduction. Prentice-Hall, Englewood Cliffs, NJ
Cortez P, Morais A (2007) A data mining approach to predict forest fires using meteorological data. In: Neves J, Santos MF, Machado J (eds) New Trends in Artificial Intelligence,. Proceedings of the 13th EPIA 2007 - Portuguese Conference on Artificial Intelligence, Guimaraes, Portugal, pp 512–523, https://hdl.handle.net/1822/8039
Couronné R, Probst P, Boulesteix AL (2018) Random forest versus logistic regression: a large-scale benchmark experiment. BMC Bioinform 19(1):270. https://doi.org/10.1186/s12859-018-2264-5
Cover TM, Thomas JA (2006) Elements of Information Theory. Wiley Series in Telecommunications and Signal Processing, Wiley-Interscience, New York, USA, https://doi.org/10.1002/047174882X
Crescenzo AD, Longobardi M (2009) On cumulative entropies. J Stat Plan Inference 139(12):4072–4087. https://doi.org/10.1016/j.jspi.2009.05.038
Crescenzo AD, Longobardi M (2009b) On cumulative entropies and lifetime estimations. In: Mira J, Ferrández JM, Álvarez JR, et al (eds) Methods and Models in Artificial and Natural Computation. A Homage to Professor Mira’s Scientific Legacy: Third International Work-Conference on the Interplay Between Natural and Artificial Computation, IWINAC 2009, Santiago de Compostela, Spain, June 22-26, 2009, Proceedings, Part I. Springer, Berlin, Heidelberg, pp 132–141, https://doi.org/10.1007/978-3-642-02264-7_15
Dougherty J, Kohavi R, Sahami M (1995) Supervised and unsupervised discretization of continuous features. In: Prieditis A, Russell SJ (eds) Machine Learning: Proceedings of the Twelfth International Conference. Morgan Kaufmann, San Francisco, USA, pp 194–202. https://doi.org/10.1016/B978-1-55860-377-6.50032-3
Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml
Dutta M (1966) On maximum (information-theoretic) entropy estimation. Sankhyā: The Indian Journal of Statistics, Series A (1961-2002) 28(4):319–328. https://www.jstor.org/stable/25049432
Eberhart R, Kennedy J (1995) A new optimizer using particle swarm theory. In: MHS’95. Proceedings of the Sixth International Symposium on Micro Machine and Human Science, pp 39–43, https://doi.org/10.1109/MHS.1995.494215
Estevez PA, Tesmer M, Perez CA et al (2009) Normalized mutual information feature selection. IEEE Trans Neural Networks 20(2):189–201. https://doi.org/10.1109/TNN.2008.2005601
Fayyad U, Irani K (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the 13th Int. Joint Conference on Artificial Intelligence. Morgan Kaufmann, Chambery, France, pp 1022–1027
Fernández-Delgado M, Cernadas E, Barro S et al (2014) Do we need hundreds of classifiers to solve real world classification problems? J Mach Learn Res 15(1):3133–3181. https://jmlr.org/papers/v15/delgado14a.html
Forsati R, Moayedikia A, Safarkhani B (2011) Heuristic approach to solve feature selection problem. In: Cherifi H, Zain JM, El-Qawasmeh E (eds) Digital Information and Communication Technology and Its Applications. Springer, Berlin, Heidelberg, pp 707–717. https://doi.org/10.1007/978-3-642-22027-2_59
Fouché E, Böhm K (2019) Monte carlo dependency estimation. In: Proceedings of the 31st International Conference on Scientific and Statistical Database Management. ACM, New York, NY, USA, SSDBM ’19, pp 13–24, https://doi.org/10.1145/3335783.3335795
Fouché E, Mazankiewicz A, Kalinke F et al (2021) A framework for dependency estimation in heterogeneous data streams. Distributed and Parallel Databases 39(2):415–444. https://doi.org/10.1007/s10619-020-07295-x
Friedman JH (1991) Multivariate adaptive regression splines. Ann Stat 19(1):1–67. https://doi.org/10.1214/aos/1176347963
Friedman JH (2001) Greedy function approximation: A gradient boosting machine. Ann Stat 29(5):1189–1232. https://www.jstor.org/stable/2699986
Garcia D (2010) Robust smoothing of gridded data in one and higher dimensions with missing values. Comput Stat & Data Analysis 54(4):1167–1178. https://doi.org/10.1016/j.csda.2009.09.020
Ghiringhelli LM, Vybiral J, Levchenko SV et al (2015) Big data of materials science: Critical role of the descriptor. Phys Rev Lett 114(10):105,503. https://doi.org/10.1103/PhysRevLett.114.105503
Ghiringhelli LM, Vybiral J, Ahmetcik E et al (2017) Learning physical descriptors for materials science by compressed sensing. New J Phys 19(2):023,017. https://doi.org/10.1088/1367-2630/aa57bf
Glivenko V (1933) Sulla determinazione empirica delle leggi di probabilita. Gion Ist Ital Attauri 4:92–99
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182. https://doi.org/10.5555/944919.944968
Hey T, Tansley S, Tolle K (2009) The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, Washington, USA, https://www.microsoft.com/en-us/research/publication/fourth-paradigm-data-intensive-scientific-discovery/
Hu Q, Zhang L, Zhang D et al (2011) Measuring relevance between discrete and continuous features based on neighborhood mutual information. Expert Syst Appl 38(9):10,737-10,750. https://doi.org/10.1016/j.eswa.2011.01.023
James G, Witten D, Hastie T et al (2013) An Introduction to Statistical Learning, Springer Texts in Statistics, vol 103. Springer, New York, https://doi.org/10.1007/978-1-4614-7138-7
Ke G, Meng Q, Finley T et al (2017) Lightgbm: A highly efficient gradient boosting decision tree. In: Guyon I, Luxburg UV, Bengio S, et al (eds) Advances in Neural Information Processing Systems 30. Curran Associates, Inc., New York, USA, p 3146–3154, http://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf
Keller F, Muller E, Bohm K (2012) Hics: High contrast subspaces for density-based outlier ranking. In: 28th IEEE International Conference on Data Engineering, Washington, USA, pp 1037–1048, https://doi.org/10.1109/ICDE.2012.88
Khaire UM, Dhanalakshmi R (2019) Stability of feature selection algorithm: A review. J King Saud University - Comput Inf Sci 34(4):1060–1073. https://doi.org/10.1016/j.jksuci.2019.06.012
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1):273–324. https://doi.org/10.1016/S0004-3702(97)00043-X
Koller D, Sahami M (1996) Toward optimal feature selection. In: Proceedings of the 13th International Conference on Machine Learning, Bari, Italy, pp 284–292, http://ilpubs.stanford.edu:8090/208/
Koza JR (1994) Genetic programming as a means for programming computers by natural selection. Stat Comput 4(2):87–112. https://doi.org/10.1007/BF00175355
Kozachenko LF, Leonenko NN (1987) Sample estimate of the entropy of a random vector. Problemy Peredachi Informatsii 23(2):9–16. http://mi.mathnet.ru/eng/ppi/v23/i2/p9
Kraskov A, Stögbauer H, Grassberger P (2004) Estimating mutual information. Phys Rev E 69(6):066,138. https://doi.org/10.1103/PhysRevE.69.066138
Kullback S (1959) Information Theory and Statistics. John Wiley and Sons, New York
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86. https://www.jstor.org/stable/2236703
Kwak N, Choi C-H (2002) Input feature selection by mutual information based on parzen window. IEEE Trans Pattern Anal Mach Intell 24(12):1667–1671. https://doi.org/10.1109/TPAMI.2002.1114861
Lancaster HO (1969) The Chi-squared Distribution. Wiley & Sons Inc, New York
Land AH, Doig AG (1960) An automatic method of solving discrete programming problems. Econom 28(3):497–520. https://doi.org/10.2307/1910129
Lu F, Petkova E (2014) A comparative study of variable selection methods in the context of developing psychiatric screening instruments. Stat Med 33(3):401–421. https://doi.org/10.1002/sim.5937
Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Curran Associates Inc., New York, USA, NIPS’17, p 4768-4777, https://proceedings.neurips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html
Mandros P, Boley M, Vreeken J (2017) Discovering reliable approximate functional dependencies. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, USA, KDD ’17, pp 355–363, https://doi.org/10.1145/3097983.3098062
Marill T, Green D (1963) On the effectiveness of receptors in recognition systems. IEEE Trans Inf Theory 9(1):11–17. https://doi.org/10.1109/TIT.1963.1057810
McGill WJ (1954) Multivariate information transmission. Psychom 19(2):97–116. https://doi.org/10.1007/BF02289159
Michalewicz Z, Fogel DB (2004) How to Solve It: Modern Heuristics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-07807-5
Mira C (2007) Noninvertible maps. Scholarpedia 2(9):2328. https://doi.org/10.4249/scholarpedia.2328
Modrzejewski M (1993) Feature selection using rough sets theory. In: Brazdil PB (ed) Machine Learning: ECML-93. Springer, Berlin, Heidelberg, pp 213–226. https://doi.org/10.1007/3-540-56602-3_138
Morrison DR, Jacobson SH, Sauppe JJ et al (2016) Branch-and-bound algorithms: A survey of recent advances in searching, branching, and pruning. Discret Optim 19:79–102. https://doi.org/10.1016/j.disopt.2016.01.005
Narendra PM, Fukunaga K (1977) A branch and bound algorithm for feature subset selection. IEEE Trans Comput C–26(9):917–922. https://doi.org/10.1109/TC.1977.1674939
Natekin A, Knoll A (2013) Gradient boosting machines, a tutorial. Front Neurorobot 7:21–21. https://doi.org/10.3389/fnbot.2013.00021
Nguyen HV, Müller E, Vreeken J et al (2013) CMI: An Information-Theoretic Contrast Measure for Enhancing Subspace Cluster and Outlier Detection, Proceedings of the 2013 SIAM International Conference on Data Mining (SDM), Austin, Texas, USA, chap 21, pp 198–206. https://doi.org/10.1137/1.9781611972832.22
Nguyen HV, Müller E, Vreeken J et al (2014) Unsupervised interaction-preserving discretization of multivariate data. Data Min Knowl Disc 28(5):1366–1397. https://doi.org/10.1007/s10618-014-0350-5
Nguyen HV, Müller E, Vreeken J, et al (2014b) Multivariate maximal correlation analysis. In: Jebara T, Xing EP (eds) Proceedings of the 31st International Conference on Machine Learning (ICML-14), vol 32. JMLR Workshop and Conference Proceedings, Beijing, China, pp 775–783, https://proceedings.mlr.press/v32/nguyenc14.html
Nguyen HV, Mandros P, Vreeken J (2016) Universal Dependency Analysis, Society for Industrial and Applied Mathematics, Florida, USA, pp 792–800. Proceedings, https://doi.org/10.1137/1.9781611974348.89, https://epubs.siam.org/doi/pdf/10.1137/1.9781611974348.89
O’Brien TA, Collins WD, Rauscher SA et al (2014) Reducing the computational cost of the ECF using a nufft: A fast and objective probability density estimation method. Comput Stat & Data Analysis 79:222–234. https://doi.org/10.1016/j.csda.2014.06.002
O’Brien TA, Kashinath K, Cavanaugh NR et al (2016) A fast and objective multidimensional kernel density estimation method: fastkde. Comput Stat & Data Analysis 101:148–160. https://doi.org/10.1016/j.csda.2016.02.014
Ouyang R, Curtarolo S, Ahmetcik E et al (2018) Sisso: A compressed-sensing method for identifying the best low-dimensional descriptor in an immensity of offered candidates. Phys Rev Materials 2(8):083,802 (11). https://doi.org/10.1103/PhysRevMaterials.2.083802
Pearson K (1896) Mathematical contributions to the theory of evolution. iii. regression, heredity, and panmixia. Philos Trans R Soc Lond Ser A 187:253–318. https://doi.org/10.1098/rsta.1896.0007
Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238. https://doi.org/10.1109/TPAMI.2005.159
Pettifor D (1984) A chemical scale for crystal-structure maps. Solid State Commun 51(1):31–34. https://doi.org/10.1016/0038-1098(84)90765-8
Pfitzner D, Leibbrandt R, Powers D (2008) Characterization and evaluation of similarity measures for pairs of clusterings. Knowl Inf Syst 19(3):361. https://doi.org/10.1007/s10115-008-0150-6
Phillips JC (1970) Ionicity of the chemical bond in crystals. Rev Mod Phys 42(3):317–356. https://doi.org/10.1103/RevModPhys.42.317
Press WH, Flannery BP, Teukolsky SA et al (1988) Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press, Cambridge. https://doi.org/10.1137/1031025
Pudil P, Novovičová J, Kittler J (1994) Floating search methods in feature selection. Pattern Recogn Lett 15(11):1119–1125. https://doi.org/10.1016/0167-8655(94)90127-9
Pudil P, Novovičová J, Somol P (2002) Recent Feature Selection Methods in Statistical Pattern Recognition. Springer, Boston, MA, pp 565–615. https://doi.org/10.1007/978-1-4613-0231-5_23
Rao M (2005) More on a new concept of entropy and information. J Theor Probab 18(4):967–981. https://doi.org/10.1007/s10959-005-7541-3
Rao M, Chen Y, Vemuri BC et al (2004) Cumulative residual entropy: a new measure of information. IEEE Trans Inf Theory 50(6):1220–1228. https://doi.org/10.1109/TIT.2004.828057
Reimherr M, Nicolae DL (2013) On quantifying dependence: A framework for developing interpretable measures. Stat Sci 28(1):116–130. https://doi.org/10.1214/12-STS405
Reshef DN, Reshef YA, Finucane HK et al (2011) Detecting novel associations in large data sets. Sci 334(6062):1518–1524. https://doi.org/10.1126/science.1205438
Reunanen J (2006) Search Strategies. Springer, Berlin, Heidelberg, pp 119–136. https://doi.org/10.1007/978-3-540-35488-8_5
Romano S, Bailey J, Nguyen V et al (2014) Standardized mutual information for clustering comparisons: One step further in adjustment for chance. In: Jebara T, Xing EP (eds) Proceedings of the 31st International Conference on Machine Learning (ICML-14), vol 32. JMLR Workshop and Conference Proceedings, Beijing, China, pp 1143–1151, https://proceedings.mlr.press/v32/romano14.html
Romano S, Vinh NX, Bailey J et al (2016) A framework to adjust dependency measure estimates for chance. In: Proceedings of the 2016 SIAM International Conference on Data Mining, pp 423–431, https://doi.org/10.1137/1.9781611974348.48
Rossi RJ (2018) Mathematical Statistics: An Introduction to Likelihood Based Inference. New Jersey, USA, https://www.wiley.com/en-us/MathematicalStatistics:AnIntroductiontoLikelihoodBasedInference-p-9781118771044
Saad Y, Gao D, Ngo T et al (2012) Data mining for materials: Computational experiments with \(ab\) compounds. Phys Rev B 85(10):104,104. https://doi.org/10.1103/PhysRevB.85.104104
Schmid F, Schmidt R (2007) Multivariate extensions of spearman’s rho and related statistics. Stat & Probab Lett 77(4):407–416. https://doi.org/10.1016/j.spl.2006.08.007
Scott DW (1982) Multivariate Density Estimation: Theory, Practice, and Visualization. Wiley, New York. https://doi.org/10.1002/9780470316849
Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27(3):379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
Shannon CE, Weaver W (1949) The Mathematical Theory of Communication, vol III. Illinois Press, Illinois, USA
Siedlecki W, Sklansky J (1993) On automatic feature selection. World Scientific, Singapore, New Yersey, London, Hong Kong, pp 63–87. https://doi.org/10.1142/9789814343138_0004
Silverman BW (1986) Density Estimation for Statistics and Data Analysis, vol 1. Chapman and Hall/CRC, New York. https://doi.org/10.1201/9781315140919
Spearman C (1904) The proof and measurement of association between two things. Am J Psychol 15(1):72–101. https://doi.org/10.2307/1412159
Székely GJ, Rizzo ML (2014) Partial distance correlation with methods for dissimilarities. Ann Stat 42(6):2382–2412. https://doi.org/10.1214/14-AOS1255
Székely GJ, Rizzo ML, Bakirov NK (2007) Measuring and testing dependence by correlation of distances. Ann Stat 35(6):2769–2794. https://doi.org/10.1214/009053607000000505
Van Vechten JA (1969) Quantum dielectric theory of electronegativity in covalent systems. i. electronic dielectric constant. Phys Rev 182(3):891–905. https://doi.org/10.1103/PhysRev.182.891
Vergara JR, Estévez PA (2014) A review of feature selection methods based on mutual information. Neural Comput Appl 24(1):175–186. https://doi.org/10.1007/s00521-013-1368-0
Vinh NX, Epps J, Bailey J (2009) Information theoretic measures for clusterings comparison: Is a correction for chance necessary? In: Proceedings of the 26th Annual International Conference on Machine Learning. ACM, New York, NY, USA, ICML ’09, pp 1073–1080, https://doi.org/10.1145/1553374.1553511
Vinh NX, Epps J, Bailey J (2010) Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J Mach Learn Res 11:2837–2854. https://doi.org/10.1145/1553374.1553511
Wang F, Vemuri BC, Rao M et al (2003) A New & Robust Information Theoretic Measure and Its Application to Image Alignment. Springer, Berlin, Heidelberg, pp 388–400. https://doi.org/10.1007/978-3-540-45087-0_33
Wang Y, Romano S, Nguyen V et al (2017) Unbiased multivariate correlation analysis. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), https://ojs.aaai.org/index.php/AAAI/article/view/10778
Watanabe S (1960) Information theoretical analysis of multivariate correlation. IBM J Res Dev 4(1):66–82. https://doi.org/10.1147/rd.41.0066
White JV, Steingold S, Fournelle C (2004) Performance metrics for group-detection algorithms. In: Said YH, Marchette DJ, Solka JL (eds) Computing Science and Statistics: Computational Biology and Informatics - Proceedings of the 36th Symposium on the Interface, Baltimore, Maryland, https://www.interfacesymposia.org/I04/I2004Proceedings/WhiteJim/WhiteJim.paper.pdf
Whitney AW (1971) A direct method of nonparametric measurement selection. IEEE Trans Comput C–20(9):1100–1103. https://doi.org/10.1109/T-C.1971.223410
Wolpert DH (1996) The existence of a priori distinctions between learning algorithms. Neural Comput 8(7):1391–1420. https://doi.org/10.1162/neco.1996.8.7.1391
Wolpert DH (1996) The lack of a priori distinctions between learning algorithms. Neural Comput 8(7):1341–1390. https://doi.org/10.1162/neco.1996.8.7.1341
Wolpert DH, Macready WG (1995) No free lunch theorems for search. Technical Report SFI-TR-95-02-010 10, Santa Fe Institute, https://www.santafe.edu/research/results/working-papers/no-free-lunch-theorems-for-search
Wolpert DH, Macready WG (1997) No free lunch theorems for optimization. IEEE Trans Evol Comput 1(1):67–82. https://doi.org/10.1109/4235.585893
Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci 2(2):165–193. https://doi.org/10.1007/s40745-015-0040-1
Yeh IC (1998) Modeling of strength of high-performance concrete using artificial neural networks. Cem Concr Res 28(12):1797–1808. https://doi.org/10.1016/S0008-8846(98)00165-3
Yu B, Yuan B (1993) A more efficient branch and bound algorithm for feature selection. Pattern Recogn 26(6):883–889. https://doi.org/10.1016/0031-3203(93)90054-Z
Yu S, Príncipe JC (2019) Simple stopping criteria for information theoretic feature selection. Entropy 21(1):99. https://doi.org/10.3390/e21010099
Zheng Y, Kwoh CK (2011) A feature subset selection method based on high-dimensional mutual information. Entropy 13(4):860–901. https://doi.org/10.3390/e13040860
Zunger A (1980) Systematization of the stable crystal structure of all \({\rm AB}\)-type binary compounds: A pseudopotential orbital-radii approach. Phys Rev B 22(12):5839–5872. https://doi.org/10.1103/PhysRevB.22.5839
Acknowledgements
This research received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation program (Grant Agreement No. 676580: the NOMAD Laboratory and European Center of Excellence and Grant Agreement No. 740233: TEC1p) and from BiGmax, the Max Planck Society’s Research Network on Big-Data-Driven Materials Science. B.R. acknowledges financial support from the Max Planck Society. L.M.G. acknowledges support from Berlin Big-Data Center (Grant Agreement No. 01IS14013E). The authors thank J. Vreeken, M. Boley and P. Mandros for inspiring discussions and for carefully reading the manuscript. The authors would also like to thank the two reviewers for suggestions to significantly improve the manuscript.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Responsible editor: Johannes Fürnkranz.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Appendices
Appendix A Baseline adjustment
Dependence measurements that assign stronger dependences for larger subsets of features independently of the underlying relationship are considered biased (Vinh et al. 2009). To actually compare dependence measures across subsets of variables and different cardinality, dependence measures need to be adjusted. Baseline adjustment is addressed by eliminating the inherent bias of the measure, so that the dependence measure becomes constant under the independence assumption of random variables. The baseline adjustment was discussed for mutual information in (Vinh et al. 2009, 2010; Romano et al. 2014; Mandros et al. 2017). Following the notation of Vinh et al. (2009), we derive the baseline adjustment for cumulative mutual information.
A common model of randomness is the hypergeometric model (also called permutation model) (Lancaster 1969; Vinh et al. 2009; Romano et al. 2014). It uniformly and randomly generates m distinct permutations of pairs M with probability \({\mathcal {P}}(Y; X \vert M)\) by permuting all values of each variable in the data set,
The baseline-adjusted cumulative fraction of the information can be obtained by subtracting fraction of the cumulative information (Eq. 13) from the expected fraction of the cumulative information under the assumption of independent and identical distributed random variables,
Specifically, the average cumulative mutual information between all different permutations with \(|X_i |= a_i\), \(i = 1, \ldots , r\) and \(|Y_j |= b_j\), \(j = 1, \ldots , c\) has constant marginal sum vectors \(a =[a_i]\) and \(b =[b_j]\). Therefore, the cumulative information overlap between X and Y,
can be summarized in the form of a \(r \times c\) cumulative contingency table, \(M = [n_{ij}]_{j = 1\cdots c}^{i = 1 \cdots r}\) (Fig. 8), with \(n_{ij}\) as being a specific realization of the joint cumulative probability given row marginal \(a_i\) and column marginal \(b_j\).
By rearranging the sums in Eq. A4 and writing the sum over the entire permutation of variable values as a sum over all permutations of possible values of \(n_{ij}\), we get
where \({\mathcal {P}}(n_{ij}, a_i, b_j \vert M)\) is the probability to encounter an associative cumulative contingency table subject to fixed marginals.
A \(r \times c\) cumulative contingency table \({\mathcal {M}}\) related to two clusterings \({\tilde{X}}\) and \({\tilde{Y}}\) with row marginals, \(a_i = \sum _{j = 1}^c n_{ij}\), and column marginals, \(b_j = \sum _{i =1}^r n_{ij}\). The two marginal sum vectors \(a = [a_i]\) and \(b = [b_j]\) are constant and satisfy the fixed marginals condition, \(\sum _{i = 1}^{r} a_i = \sum _{j = 1}^{c} b_j = N\)
The probability to encounter an associative cumulative contingency table subject to fixed marginals, with the cell at the i-th row and j-th column equals to \(n_{ij}\), is given by the hypergeometric distribution,
The hypergeometric distribution describes the probability of \(b_j - n_{ij}\) successes in \(b_j - 1\) draws without replacement where the finite population consists of \(r - 1\) elements, of which \(r - i\) are classified as successes. It is limited by the number of successes that must not exceed the limit of \(\max (0, i + b_j - r) \le n_{ij} \le \min (i, b_j)\).
Similar, the distance \(\Delta y_i(M)\) between two consecutive ordered values is described by a binomial distribution,
where the upper limit is given by \(k_\text {max} = \min (n - b_j + 1, r - i)\) and \({\mathcal {N}}\) is the normalization constant:
Summarizing all the single parts of Eq. A4, the final formula for the expected fraction of cumulative information under the assumption of the hypergeometric model of randomness is given by
Appendix B Monotonicity conditions for total cumulative mutual information
In the following we will prove that expected cumulative mutual information under the independence assumption of random variables \(\hat{{\mathcal {I}}}_0(Y; X)\) is monotonically increasing with respect to the number of features in the subset, i.e.,
with \(X' = X \cup \{ \chi \}\) and some \(\chi \not \in X\). For reference, we will closely follow the proof for the baseline correction term in the discrete case with mutual information (Mandros et al. 2017).
Let the row and column marginals of \(Y, X, X'\) be \(a_i\) for \(i = 1\ldots R\), \(b_j\) for \(j = 1\ldots C\) and \(b_j'\) for \(j = 1\ldots C'\), respectively. We note that \(C' > C\). In order to show that
we define a relation between the cumulative contingency tables \({\mathcal {M}} = {\mathcal {M}}(Y; X)\) and \(\mathcal {M'} = {\mathcal {M}}(Y; X')\) via the projection operator \(\pi : \mathcal {M'} \rightarrow {\mathcal {M}}\). The projection operator links the projection \(\pi : V(X') \rightarrow V(X)\) of values from \(X'\) to values of X defined by \(\pi (X') = X\) with the projection to the sets of cumulative contingency tables by finding the counts in the column corresponding to \(X \in V(X)\) of \(\pi (M')\) as the sum of the columns in \(M'\) corresponding to \(\pi ^{-1}(X)\). Therefore, it remains to show that for all \(M \in {\mathcal {M}}\) holds:
From the chain rule of cumulative mutual information (Rao et al. 2004; Wang et al. 2003; Rao 2005), it follows that \(\hat{{\mathcal {I}}}(Y; X \vert M) \le \hat{{\mathcal {I}}}(Y; X \vert M')\) for \(M = \pi (M')\). Thus, showing the relation \({\mathcal {P}}(Y; X \vert M) = \sum _{M' \in \pi (M)} {\mathcal {P}}(Y; X \vert M')\) concludes the proof. We will show the proof by contradiction.
Formally, let \(S_n\) denote the symmetric group of degree n, i.e., \(S_n\) consists of all n! bijections \(\sigma : \{1\ldots n\} \rightarrow \{1\ldots n\}\). For a bijection \(\sigma \in S_n\), we denote the permuted version of Y as \(Y_\sigma \). Then, for any cumulative contingency table \(N \in {\mathcal {M}}(Y; Z)\) \(S_n[N] = \{\sigma \in S_n: M(Y_\sigma ; Z) = M \}\) denotes the permutations that result in Z. Let \(\sigma \in S_n \setminus S_n[M]\). This means that \(M_{ij}(Y; X) \ne M_{ij}(Y_\sigma ; X)\) for at least one cell i, j. Further, denote the set of all indices of values of \(X'\) that are projected down to X by
for which, by definition, follows that
Since for at least one index \(j' \in \pi ^{-1}(j)\) we get \(M'_{ij'}(Y; X') \ne M'_{ij'}(Y_\sigma ; X')\), we also find \(\sigma \not \in S_n[M']\) and can conclude
Now let \(N' \in {\mathcal {M}}(Y; X')\) with \(\pi (N') \ne M\) and assume that \(S_n[M] \supset S_n[M']\), i.e., there is a \(\sigma \in S_n[M] \cap S_n[N']\). Let us denote \(N = \pi (N')\). Since \(S_n[M] \cap S_n[N] = \emptyset \), we know that \(\sigma \not \in S_n[N]\). However, it follows from Eq. B14 that \(\sigma \not \in S_n[N']\) – a contradiction and, hence,
and
\(\square \)
Appendix C Gradient boosting decision trees
We used LightGBM (Ke et al. 2017), a recent modification of the gradient-boosting decision trees algorithm (Friedman 2001). LightGBM improves the efficiency and scalability without sacrificing performance. The following settings were used and were found by hyper-parameter tuning: number of leaves (num_leaves, 1% of the number of samples), number of iterations (n_estimators, 2000), and model depth (max_depth, -1).
During the training, i.e., the model optimization, we performed a regularization to automatically select the inflection point at which the performance of the test data set begins to decrease while the performance of the training data set continues to improve. The data set was partitioned into 10 groups (so-called folds), using 9 folds to generate the model and the remaining fold to test the model (10-fold cross validation). To reduce variability, we performed five rounds of cross-validation with different partitions and averaged the rounds to obtain an estimate of the model’s predictive performance. We monitored the \(\ell _1\) and \(\ell _2\) norms (Friedman 2001; James et al. 2013) and simultaneously penalized the model optimization (“learning”) process on the 9 folds to minimize the squared residuals and the complexity of the model (eval_metric, [“l1”, “l2_root”]), while stopping the learning process as soon as one metric of the remaining fold in the last \(n = 50\) rounds did not improved (early_stopping_rounds, 50).
Appendix D Feature-subset search
We performed a feature-subset search using CMI, MAC, UDS, MCDE, and TCMI on the octet-binary compound semiconductors data set. Results of the feature-subset searches with TCMI can be found in Table 7 and with CMI, MAC, UDS, and MCDE in Table 8. CMI, MAC, and MCDE dependence measures identify feature subsets with one atomic species only. Since the octet-binary compound semiconductor is uniquely determined by the atomic number of both atomic species, i.e., by at least one atomic property of each atomic species considered, CMI, MAC, and MCDE led to unreliable results. Due to issues with permutation and scale invariance (cf., Table 4), UDS along with CMI, MAC, UDS, and MCDE were therefore not used further for model construction.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Regler, B., Scheffler, M. & Ghiringhelli, L.M. TCMI: a non-parametric mutual-dependence estimator for multivariate continuous distributions. Data Min Knowl Disc 36, 1815–1864 (2022). https://doi.org/10.1007/s10618-022-00847-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-022-00847-y