1 Introduction

The past two decades have been marked by an explosion in the availability of scientific data and significant improvements in statistical data analysis. In particular, the physical sciences have seen an unprecedented surge in data exploration aimed at the data-driven discovery of statistical dependencies of physical variables relevant to a property of interest. These observations culminated in the emergence of a new paradigm in science, the so-called “big-data driven” science (Hey et al. 2009).

1.1 Feature selection

The identification of relevant variables, i.e., the properties or the driving variables of a process or system’s property, has propelled investigations for an understanding of the underlying processes that generated the data (Guyon and Elisseeff 2003). Such a variable \(X \in \vec {X}\) may be an attribute, parameter, or a combination of properties measured or obtained from experiments or simulations. The fundamental challenge is to find a functional dependency \(f: \vec {X} \mapsto Y\), between a set of variables \(\vec {X}' \in \vec {X}\) related to a certain output Y (target, response function). The objective is to find a set of variables (the so-called features) that maximizes a feature-selection criterion \({\mathcal {Q}}\) with respect to a property of interest Y (Blum and Langley 1997; Kohavi and John 1997),

$$\begin{aligned} \vec {X}^* = \mathop {\hbox {arg max}}\limits _{\vec {X}' \subseteq \vec {X}} {\mathcal {Q}}(Y; \vec {X}') \ . \end{aligned}$$
(1)

Feature selection comprises two parts: (i) the choice of a search strategy and (ii) a feature-selection criterion \({\mathcal {Q}}\) for evaluating a feature-subset’s relevance.

Fig. 1
figure 1

Empirical cumulative entropy \(\hat{{\mathcal {H}}}(Y)\) of a normal distribution for 50 data samples, which are shown as ticks in the bottom of the figure. Insets a.) and b.) show the (ground-truth) probability density (PDF) and cumulative probability (CDF) of the normal distribution, empirical cumulative distribution, \({\hat{P}}(Y \le y)\), and estimated probability density, \({\hat{p}}(y)\). The estimated probability density was obtained by optimizing the bandwidth of a kernel-density estimator through 10-fold cross-validation. Further, histograms of PDF and CDF are also drawn to provide an example of how continuous distributions can be approximated by discrete discontinuous functions

1.1.1 Search strategies

There are several search strategies to identify the relevant features of a data set (Narendra and Fukunaga 1977; Siedlecki and Sklansky 1993; Pudil et al. 1994; Eberhart and Kennedy 1995; Michalewicz and Fogel 2004), ranging from optimal solvers (such as exhaustive search or accelerated methods based on the monotonic property of a feature-selection criterion), to sub-optimal solvers (such as greedy, heuristic, or stochastic solvers) (Guyon and Elisseeff 2003; Kohavi and John 1997; Narendra and Fukunaga 1977; Siedlecki and Sklansky 1993; Pudil et al. 1994; Whitney 1971; Pudil et al. 2002; Marill and Green 1963; Land and Doig 1960; Yu and Yuan 1993; Clausen 1999; Morrison et al. 2016; Forsati et al. 2011; Reunanen 2006). Optimal solvers explore all feature-subset combinations for a global optimum and, as such, are generally impractical for data sets with a large number of features due to cost and time constraints on computer resources. Sub-optimal search strategies (e.g., sequential floating forward selection (Pudil et al. 1994; Whitney 1971), sequential backward elimination (Marill and Green 1963), and minimal-redundancy-maximal-relevance criterion (Peng et al. 2005)), conversely, balance accuracy and speed, but may not find the optimal set of features with respect to a targeted property. A search strategy that can be used both as an optimal or sub-optimal solver, is branch and bound (Narendra and Fukunaga 1977; Pudil et al. 2002; Land and Doig 1960; Yu and Yuan 1993; Clausen 1999; Morrison et al. 2016). Branch and bound implicitly performs an exhaustive search, but uses an additional bounding criterion to discard feature subsets, whose feature-selection criteria are lower than the feature-selection criterion of the current best feature subset in the search.

1.1.2 Feature-selection criterion

The feature-selection criterion \({\mathcal {Q}}\) can be used as a score that allows the identified features to be ranked by relevance prior to subsequent data analyses. The academic community has extensively explored several feature-selection criteria to evaluate a feature’s relevance (Khaire and Dhanalakshmi 2019), including distance measures (Basseville 1989; Almuallim and Dietterich 1994), dependency measures (Modrzejewski 1993), consistency measures (Arauzo-Azofra et al. 2008), and information measures (Vergara and Estévez 2014). Ideally, feature-selection criteria are not restricted to specific type of dependencies, are robust against imprecise values in the data, and are deterministic, i.e., such that the feature selection is consistent and reproducible for the same set of variables, type of settings, and data. The prevailing method for quantifying multivariate dependences is mutual information, which determines the relevance of variables in terms of their joint mutual dependence to a property of interest (Shannon 1948).

There are several reasons to consider mutual-information-based quantities for feature selection. The two most important reasons are: (i) mutual information quantifies multivariate nonlinear statistical dependencies and (ii) mutual information provides an intuitive quantification of the relevance for a feature subset \(\vec {X}' \subseteq \vec {X}\) relative to an output Y (Vergara and Estévez 2014): it is bounded from below (for statistically independent variables with respect to an output) and can be bounded from above (for functional dependence variables), increases as the number of sample sizes increases, and quantifies the strength of the dependence based on a mathematical rigorous framework from communication theory (Shannon 1948). However, mutual information requires probability distributions, which are problematic for high-dimensional data sets and are difficult to obtain from real-valued data samples of continuous distributions.

1.2 Our approach

We propose total cumulative mutual information (TCMI): a non-parametric, robust, and deterministic measure of the relevance of mutual dependences of continuous distributions between variable sets of different cardinality. TCMI can be applied if the dependence between a set of variables and an output is not yet known and the dependence is nonlinear and multivariate. Like mutual information, TCMI relates the strength of the dependence between a set of variables and an output to the number of data samples. In addition, TCMI relates the strength of the dependence to the cardinality of the subsets. Thus, TCMI allows an unbiased comparison between different sets of variables without depending on externally adjustable parameters.

TCMI is based on cumulative mutual information and inherits many of the properties of mutual-information based feature-selection measures: it is bounded from below and above and monotonically increases the more features are subsequently added to a candidate feature set, but only until all variables related to an output are included. In contrast to other feature-selection measures based on cumulative mutual information, TCMI uses cumulative probability distributions. Cumulative probability distributions can be directly obtained from empirical data of continuous distributions, without the need to quantize the set of variables prior to estimating a feature subset’s dependence to a property of interest.

We combine TCMI with the branch-and-bound algorithm (Narendra and Fukunaga 1977; Pudil et al. 2002; Land and Doig 1960; Yu and Yuan 1993; Clausen 1999; Morrison et al. 2016), which has proven to be efficient in the discovery of nonlinear functional dependencies (Zheng and Kwoh 2011; Mandros et al. 2017). TCMI therefore identifies a set of variables that are statistically related to an output. As TCMI is model independent, a functional relationship must be constructed (in the following referred to as a model) to relate these features with an output. The model construction is not part of this work, but can be done, for example, through data-analytics techniques such as symbolic regression, both in the genetic-programming (Koza 1994) and in compressed-sensing implementations (Ghiringhelli et al. 2015; Ouyang et al. 2018), or regression tree-based approaches (Breiman et al. 1984).

In brief, our feature-selection procedure can be divided into three steps: In the first step, we quantify the dependence between the set of features and an output as the difference between cumulative marginal and cumulative conditional distributions. In the second step, we estimate the relevance of a feature set by comparing its strength of dependence to the mean dependence of features under the assumption of independent random variables. In the third step, we identify a set of relevant features with the branch-and-bound algorithm to find the set of variables from an initial list that best characterizes an output.

Table 1 Abbreviations used in the manuscript
Table 2 List of symbols and notations used in this paper

1.3 Outline

The remainder of this work is organized as follows. Section 2 discusses the relationship between TCMI and previous work. Section 3 introduces the theoretical background of cumulative mutual information. Section 4 describes the empirical estimation of cumulative mutual information for continuous distributions from limited sample data. Section 5 explains the steps introduced to adjust the cumulative mutual information with respect to the number of data samples and the cardinality of the feature subset. Section 6 introduces TCMI. Section 7 describes the implementation details of the feature-subset search using the branch-and-bound algorithm in detail. Section 8 reports on the performance evaluation of TCMI on generated data, standard data sets, and on a typical scenario in materials science. In the same section, TCMI is also compared with similar multivariate dependence measures such as cumulative mutual information (CMI: Nguyen et al. (2013)), multivariate maximal correlation analysis (MAC: Nguyen et al. (2014b)), and universal dependency analysis (UDS, Nguyen et al. (2016), Wang et al. (2017)). We would like to point out that feature selection is a broad area of research and can be achieved using a variety of techniques; this work therefore focuses on feature-selection methods based on mutual information that can be applied prior to subsequent data-analysis tasks. To illustrate this, we provide a real-world example in Sect. 8.2.2 by building a model from the identified features using TCMI, and comparing its performance to in-built feature-selection methods that perform feature selection during model construction. Finally, Sects. 9 and 10 present the discussion and conclusions of this work. Abbreviations, notations, and terminologies are summarized in Tables 1 and 2.

2 Related work

Many dependence measures such as Pearson R and Spearman’s rank \(\rho \) correlation coefficients (Pearson 1896; Spearman 1904), distance correlation (DCOR: Székely et al. (2007), Székely and Rizzo (2014)), kernel density estimation (KDE: Scott (1982), Silverman (1986)), or k-nearest neighbor estimation (k-NN: Kozachenko and Leonenko (1987)) are limited to bivariate dependencies only (Pearson, Spearman), are limited to specific types of dependencies (Spearman, DCOR), or require assumptions about the functional form of f (KDE, k-NN). Bivariate extensions (Schmid and Schmidt 2007), KDE, and k-NN are further not applicable to high-dimensional data sets.

For high-dimensional data sets, several authors proposed subspace-slicing techniques (Fouché and Böhm 2019; Fouché et al. 2021; Keller et al. 2012), which repeatedly apply conditions on each variable and perform statistical hypothesis tests to estimate the degree of dependence to an output. However, with these methods, all possible combinations of variables must be enumerated and, therefore, are computationally intractable for feature-selection tasks. In addition, the strength of their dependences are not related to the cardinality of feature subsets and therefore cannot be used to compare different sets of variables.

Model-dependent methods such as data-analytics techniques (Koza 1994; Ghiringhelli et al. 2015; Ouyang et al. 2018; Breiman et al. 1984) perform feature selection while creating a model. Alternatively, post-hoc analysis tools such as unified dependence measures (Lundberg and Lee 2017) can be used to assign each variable an importance value for a particular estimation. However, these methods add an additional degree of complexity, which makes it difficult to reliably assess the dependence among variables. Another approach are information-theoretic dependence measures. These measures are based on mutual information and ascertain whether or not the values of a set of variables are related to an output. As a result, they provide a model-independent approach to estimating the strength of dependences between variables.

Multivariate extensions to mutual information (e.g., interaction information (McGill 1954) and total correlation (Watanabe 1960)) require knowledge about the underlying probability distributions and are therefore difficult to estimate: estimations either require large amount of data and needs to be specified for each new data set at hand (Belghazi et al. 2018) or are affected by the curse of dimensionality (Bellman 1957) such as the Kozachenko-Leonenko estimator (Kozachenko and Leonenko 1987; Kraskov et al. 2004). Recently, several authors proposed related approaches to extending mutual information: cumulative mutual information (CMI: Nguyen et al. (2013)), multivariate maximal correlation analysis (MAC: Nguyen et al. (2014b)), and universal dependency analysis (UDS: Nguyen et al. (2016), Wang et al. (2017)). All three methods estimate the strength of dependence between a set of variables from cumulative probability distributions, and thus can be viewed as alternative measures of uncertainty that extend Shannon entropy (and mutual information) to random variables of multivariate continuous distributions.

CMI quantifies multivariate mutual dependences by considering the cumulative distributions of variables and by heuristically approximating the conditional cumulative entropy via data summarization and clustering. MAC is based on Shannon entropy over discretized data which MAC obtains by maximizing the normalized total correlation with respect to cumulative entropy. UDS uses optimal discretization to compute the conditional cumulative entropy, where the Shannon entropy defines the number of bins required to estimate the conditional cumulative entropy. Since all three dependence measures expose adjustable parameters to optimize the quantization of continuous distributions, the choice of these parameters therefore has a strong impact on the strength of mutual dependence between a set of variables \(\vec {X} = \{ X_1, \ldots , X_n \}\) and an output Y, and thus on the ranking induced by the relevance function and a feature-selection criterion. For feature selection, these measures are impractical.

Our approach, TCMI, extends CMI, but does not require to quantize real-valued data samples of continuous data distributions to estimate the joint cumulative distribution of continuous distributions. TCMI therefore does not require data summarization techniques to estimate multivariate dependences between continuous distributions, unlike similar approaches such as MAC and UDS. TCMI is non-parametric, as opposed to other estimation methods based on mutual information, e.g., neural networks (Belghazi et al. 2018) or mutual-information-based feature selection algorithms originally developed for discrete data (Kwak and Choi 2002; Chow and Huang 2005; Estevez et al. 2009; Hu et al. 2011; Reshef et al. 2011; Bennasar et al. 2015). TCMI therefore allows to reliably compare the strength of dependence between different sets of variables. On top of that, TCMI relates the strength of a dependence to a feature-subset’s cardinality and the number of data samples by basing the score on the dependence of the same set of variables under the independence assumption of random variables (Vinh et al. 2009, 2010).

3 Theoretical background

Mutual information and all measures presented in the following quantify relevance by means of the similarity between two distributions \(U(\vec {X}, Y)\) and \(V(\vec {X}, Y)\) using Kullback-Leibler divergence, \(D_\text {KL}(U(\vec {X}, Y) \Vert V(\vec {X}, Y))\) (Kullback and Leibler 1951). They do not require any explicit modeling to quantify linear and nonlinear dependencies, monotonically increase with the cardinality of a feature’s subset \(\vec {X}' \subseteq \vec {X}\),

$$\begin{aligned}&\min _{X \in \vec {X}'} D_\text {KL}(U(\vec {X}' \setminus X, Y) \Vert V(\vec {X}' \setminus X, Y)) \nonumber \\&\quad \le D_\text {KL}(U(\vec {X}', Y) \Vert V(\vec {X}', Y)) \ , \end{aligned}$$
(2)

and are invariant under invertible transformations such as translations and reparameterizations that preserve the order of the values of variables \(\vec {X}\) and of an output Y (Kullback 1959; Vergara and Estévez 2014).

For illustration purposes, only the case with two variables X and an output Y is discussed in the theoretical section. However, a generalization to multiple variables can be derived directly from the independence assumption of random variables, as will be done in later sections.

3.1 Mutual information

Mutual information (Shannon and Weaver 1949; Cover and Thomas 2006) relates the joint probability distribution p(xy) of two discrete random variables with the product of their marginal distribution p(x) and p(y),

$$\begin{aligned} I(Y; X)&= \sum _{y \in Y} \sum _{x \in X} p(y, x) \log \frac{p(y, x)}{p(y) p(x)} \nonumber \\&\equiv D_\text {KL}(p(y, x) \Vert p(y) p(x)) \ . \end{aligned}$$
(3)

Mutual information is non-negative, is zero if and only if the variables are statistically independent, \(p(x, y) = p(x) p(y)\) (independence assumption of random variables), and increases monotonically with the mutual interdependence of variables otherwise. Further, mutual information indicates the reduction in the uncertainty of Y given X as \(I(Y; X) = H(Y) - H(Y \vert X)\), where H(Y) denotes the Shannon entropy and \(H(Y \vert X)\) the conditional entropy (Shannon 1948). Shannon entropy H(Y) is defined as the expected value of the negative logarithm of the probability density p(y),

$$\begin{aligned} H(Y) = - \sum _{y \in Y} p(y) \log p(y) \ , \end{aligned}$$
(4)

and can be interpreted as a measure of the uncertainty on the occurrence of events y whose probability density p(y) is described by the random variable Y.

The conditional entropy \(H(Y \vert X)\) quantifies the amount of uncertainty about the value of Y, provided the value of X is known. It is given by

$$\begin{aligned} H(Y \vert X) = - \sum _{y \in Y} \sum _{x \in X} p(y, x) \log p(y \vert x) \ , \end{aligned}$$
(5)

where \(p(y \vert x) = p(y, x) / p(x)\) is the conditional probability of y given x. Clearly, \(0 \le H(Y \vert X) \le H(Y)\) with \(H(Y \vert X) = 0\) if variables X and Y are related by functional dependency and \(H(Y \vert X) = H(Y)\) if variables are independent of each other.

Although mutual information is restricted to the closed interval \(0 \le I(Y; X) \le H(Y)\), the upper bound is still dependent on Y. To facilitate comparisons, mutual information is normalized,

$$\begin{aligned} D(Y; X) = \frac{I(Y; X)}{H(Y)} = \frac{H(Y) - H(Y \vert X)}{H(Y)} \ . \end{aligned}$$
(6)

Normalized mutual information, hereafter referred to as fraction of information (also known as coefficients of constraint (Coombs et al. 1970), uncertainty coefficient (Press et al. 1988), or proficiency (White et al. 2004)) quantifies the proportional reduction in uncertainty about Y when X is given. It is in the range \(D: {\mathbb {R}} \times {\mathbb {R}} \rightarrow [0, 1]\), where 0 and 1 represent statistical independence and functional dependence, respectively (Reimherr and Nicolae 2013).

3.2 Probability and cumulative distributions

Mutual information and fraction of information are only defined for discrete distributions. Although mutual information can be generalized to continuous distributions,

$$\begin{aligned} I(Y; X) = \int _{y \in Y} \int _{x \in X} p(y, x) \log \frac{p(y, x)}{p(y) p(x)}\,dx\,dy \ , \end{aligned}$$
(7)

probability densities are not always accessible from sample data and therefore need to be estimated. Common algorithms for probability-density estimations are clustering (Nguyen et al. 2013; Pfitzner et al. 2008; Xu and Tian 2015), discretization (Fayyad and Irani 1993; Dougherty et al. 1995; Nguyen et al. 2014a), and density estimation (Keller et al. 2012; Garcia 2010; Bernacchia and Pigolotti 2011; O’Brien et al. 2014, 2016). However, these methods implicitly introduce adjustable parameters whose choice has a strong impact on the strength of mutual dependence between a set of variables \(\vec {X} = \{ X_1, \ldots , X_n \}\) and an output Y, and thus on the ranking induced by the relevance function and a feature-selection criterion (cf., Fig. 1). In practice, such approaches are extremely dependent on the applied parameter set and therefore are sensitive to the scale of variables (cf., Sect. 8).

An alternate approach to using probability distributions is cumulative probability distributions to determine the mutual dependence between variables. Cumulative probability distributions P (and residual cumulative distribution \(P' \approx 1 - P\)) of a variable X evaluated at x describe the probability that X takes on a value less than or equal to x (or a value greater than or equal to x, respectively),

$$\begin{aligned} P(x)&:= P(X \le x) \ , \end{aligned}$$
(8)
$$\begin{aligned} P'(x)&:= P(X \ge x) = 1 - P(X < x) \ . \end{aligned}$$
(9)

If the derivatives exists, they are the anti-derivatives of probability distributions,

$$\begin{aligned} P(x) := P(X \le x) = \int _{-\infty }^x p(x')\,dx' \ . \end{aligned}$$
(10)

Both residual and cumulative distributions are defined for continuous and discrete variables and are based on accumulated statistics. As such, they are more regular and less sensitive to statistical noise than probability distributions (Crescenzo and Longobardi 2009b, a). In particular, they are monotonically increasing and decreasing, respectively, i.e., \(P(x_1) \le P(x_2)\) or \(P'(x_1) \ge P'(x_2)\), \(\forall x_1 \le x_2\), with limits

$$\begin{aligned} \begin{array}{r} \displaystyle \lim _{x \rightarrow -\infty } P(x) = 0\\ \displaystyle \lim _{x \rightarrow \infty } P(x) = 1 \end{array}\ , \quad \begin{array}{r} \displaystyle \lim _{x \rightarrow -\infty } P'(x) = 1\\ \displaystyle \lim _{x \rightarrow \infty } P'(x) = 0 \end{array} \ . \end{aligned}$$
(11)

Similar to probability distributions, cumulative and residual cumulative distributions are invariant under a change of variables. However, they are invariant only for parameterizations that preserve the order of the values of each variable \(X \in \vec {X}\) and Y. Positive monotonic transformations \({\mathcal {T}}: {\mathbb {R}} \rightarrow {\mathbb {R}}\),

$$\begin{aligned}&P(x) = P({\mathcal {T}}(x)) \quad \forall x \in X: x \mapsto {\mathcal {T}}(x) \nonumber \\&\quad \text {such that} \quad x_1< x_2 \Rightarrow {\mathcal {T}}(x_1) < {\mathcal {T}}(x_2) \ , \end{aligned}$$
(12)

such as translations and nonlinear scaling of variables are among the transformations whose cumulative distributions remain invariant. In contrast, invertible and especially non-invertible mappings (Mira 2007) (such as inversions, \({\mathcal {T}}(X) \mapsto \pm X\), and non-bijective transformations, e.g., \({\mathcal {T}}(X) = \pm |X |\)) change the order of the values of a variable and with it the cumulative distribution. Consequently, these mappings must be considered as additional variables during feature selection if it is expected that such transformations might be related to an output.

3.3 Cumulative mutual information

Cumulative mutual information is an alternative measure of uncertainty that extends Shannon entropy (and mutual information) to random variables of continuous distributions. Cumulative mutual information has the same properties as mutual information to be monotonically increasing with the cardinality of the set of variables (Eq. 2). Analogous to mutual information, cumulative mutual information describes the inherent dependence expressed in the joint cumulative distribution \(P(x, y) = P(X \le x, Y \le y)\) of random variables \(x \in X\) and \(y \in Y\) relative to the product of their marginal cumulative distribution P(x) and P(y),

$$\begin{aligned} {\mathcal {I}}(Y; X)&= \int _{y \in Y} \int _{x \in X} P(y, x) \log \frac{P(y, x)}{P(y) P(x)} \,dx\,dy \nonumber \\&= D_\text {KL}(P(y, x) \Vert P(y) P(x)) \ . \end{aligned}$$
(13)

The independence assumption of random variables, \(P(y, x) = P(y) P(x)\), induces a measure that is again zero only if variables X and Y are statistically independent, and non-negative otherwise. Similarly to mutual information, cumulative mutual information quantifies the degree of dependency as the reduction in the uncertainty of Y given X, i.e., \({\mathcal {I}}(Y; X) = {\mathcal {H}}(Y) - {\mathcal {H}}(Y \vert X)\). It is a function of cumulative entropy \({\mathcal {H}}(Y)\) and conditional cumulative entropy \({\mathcal {H}}(Y \vert X)\),

$$\begin{aligned} {\mathcal {H}}(Y)&= - \int _{y \in Y} \int _{x \in X} P(y, x) \log P(y) \,dx\,dy \end{aligned}$$
(14)
$$\begin{aligned} {\mathcal {H}}(Y \vert X)&= - \int _{y \in Y} \int _{x \in X} P(y, x) \log P(y \vert x) \,dx\,dy \ , \end{aligned}$$
(15)

where \(P(y \vert x) = P(y, x) / P(x)\) is the conditional cumulative distribution of \(Y \le y\) given \(X \le x\) (cf., Table 2). Again, \({\mathcal {H}}(Y \vert X) = 0\) if variables X and Y are functional dependent and \({\mathcal {H}}(Y \vert X) = {\mathcal {H}}(Y)\) if variables X and Y are independent of each other.

Bounds restrict cumulative mutual information to a closed interval \(0 \le {\mathcal {I}}(Y; X) \le {\mathcal {H}}(Y)\) with an upper bound dependent on Y. For this reason, cumulative mutual information is normalized,

$$\begin{aligned} {\mathcal {D}}(Y; X) = \frac{{\mathcal {I}}(Y; X)}{{\mathcal {H}}(Y)} = \frac{{\mathcal {H}}(Y) - {\mathcal {H}}(Y \vert X)}{{\mathcal {H}}(Y)} \ , \end{aligned}$$
(16)

and, likewise to mutual information, is hereafter referred to as fraction of cumulative mutual information.

4 Empirical estimations of cumulative entropy and cumulative mutual information

The closed-form expression of cumulative mutual information (Eq. 13) quantifies the dependence of a set of variables based on the assumption of smooth and differentiable cumulative distributions. Due to the limited availability of data, however, the exact functional shape of the cumulative distribution is not directly accessible and hence must be empirically inferred from a limited set of sample data.

For this reason, let us assume an empirical sample \(\{(y_1, x_1)\), \((y_2, x_2)\), \(\ldots \), \((y_n, x_n)\}\) drawn independently and identically distributed (i.i.d.) according to the joint distribution of X and Y. Such sample data induces empirical (cumulative) probability distributions for all variables \(Z \in \{Y, X\}\), which lead to empirical estimates \(\hat{{\mathcal {E}}}\) of an estimator \({{\mathcal {E}}}\) (cf., Table 2).

Based on the maximum likelihood estimate (Dutta 1966; Rossi 2018), the cumulative probability distribution \({\hat{P}}(Z \le z)\) can be obtained by counting the frequency of occurring values of a variable Z:

$$\begin{aligned}&{\hat{P}}(Z \le z) = \frac{1}{n} \sum _{i = 1}^n {\mathbf {1}}_{z_i \le z} = \frac{1}{n} \left|\{\,i \mid z_i \le z\,\} \right|\ ,\nonumber \\&\quad \forall z_i \in Z \ , \quad z \in Z,\ Z \in \{ Y, X \} \ , \end{aligned}$$
(17)

where \({\mathbf {1}}_A\) denotes the indicator function that is one if A is true, and zero otherwise. Equation 17 asymptotically converges to \(P(Z \le z)\) as \(n \rightarrow \infty \) for every value of \(z \in Z\) (Glivenko-Cantelli theorem: Glivenko (1933), Cantelli (1933)). Thus, any empirical estimate, \(\hat{{\mathcal {E}}}\), based on empirical cumulative distributions converges pointwise as \(n \rightarrow \infty \) to the actual value of \({\mathcal {E}}\), i.e., \(\hat{{\mathcal {E}}}(Z) \rightarrow {\mathcal {E}}(Z)\) (Rao et al. 2004; Crescenzo and Longobardi 2009a).

4.1 Empirical cumulative entropy

For i.i.d. random samples that contain repeated values, the maximum likelihood estimate of the cumulative entropy \({\mathcal {H}}\) (Eq. 15) can be obtained by calculating the empirical cumulative distribution \({\hat{P}}\) according to Eq. 17,

$$\begin{aligned} \begin{aligned} \hat{{\mathcal {H}}}(Y)&= - \sum _{i = 1}^{k - 1} \Delta y_i {\hat{P}}(y) \log {\hat{P}}(y) \\&= - \sum _{i = 1}^{k - 1} \left( y_{(i + 1)} - y_{(i)} \right) \frac{n_i}{n} \log \frac{n_i}{n} \ , \end{aligned} \end{aligned}$$
(18)

where \(y_{(i)}\) denotes the values \(y_{(0)}< y_{(1)}< \cdots < y_{(k)}\) occurring in the data set in sorted order of Y with \(y_{(0)} = -\infty \), multiplicity \(n_i = \left|\{ j \in n: y_{(i - 1)} < y_j \le y_{(i)}\} \right|\), and constraint \(n = \sum _{i = 1}^k n_i\).

4.2 Empirical conditional cumulative entropy

Similar to empirical cumulative entropy, conditional cumulative entropy can be estimated by

$$\begin{aligned} \hat{{\mathcal {H}}}(Y; \vec {X})&= - \sum _{i = 1}^{n - 1} \sum _{j = 1}^{n - 1} \Delta y_i \Delta \vec {x}_j {\hat{P}}(y_i, \vec {x}_j) \log {\hat{P}}(y_i \vert \vec {x}_j) \nonumber \\&= - \sum _{i = 1}^{n - 1} \sum _{j1 = 1}^{n - 1} \ldots \sum _{jd = 1}^{n - 1} \bigl ( y_{i + 1} - y_i \bigr ) \bigl ( x^{(1)}_{j1 + 1} - x^{(1)}_{j1} \bigr ) \cdot \nonumber \\&\qquad \ldots \cdot \bigl ( x^{(d)}_{jd + 1} - x^{(d)}_{jd} \bigr ) {\hat{P}}(y_i, \vec {x}_j) \log {\hat{P}}(y_i \vert \vec {x}_j) \ , \end{aligned}$$
(19)

where \({\hat{P}}(y_i, \vec {x}_j)\) denotes the joint cumulative distribution of \(y_i \in Y\), \(\vec {x}_j \in \vec {X}\), \(\vec {X} = \{ X_1, \ldots X_d \}\), and \(x_i^{(k)} \in X_k\) is the i component of the k-th variable of the data set (\(k = 1, \ldots d\)). In contrast to the empirical cumulative entropy, which can be calculated from the set of sample data with linear time complexity \({\mathcal {O}}(n)\), the empirical conditional cumulative entropy has exponential time complexity \({\mathcal {O}}(n^d)\). The non-parametric estimation of the joint or conditional cumulative distribution therefore becomes computationally demanding for data sets with a large number d of variables and data samples n.

4.3 Empirical cumulative mutual information

By construction, cumulative entropy is sensitive to the range of Y (cf., Eq. 18). The same is true for conditional cumulative entropy \({\mathcal {H}}(Y \vert X) \) and its empirical estimate \(\hat{{\mathcal {H}}}(Y \vert X)\) (cf., Eq. 19). Fraction of cumulative mutual information \({{\mathcal {D}}}\), e.g., the ratio between cumulative entropy and conditional cumulative entropy, is independent of the scale of X and Y (Eq. 16). Formally, its empirical estimate \(\hat{{\mathcal {D}}}\) is given by

$$\begin{aligned} \hat{{\mathcal {D}}}(Y; \vec {X})= & {} 1 - \frac{1}{n} \Biggl [ \sum _{i, j = 1}^{n - 1} \Delta y_i \Delta x_j {\hat{P}}(y_i, \vec {x}_j) \log {\hat{P}}(y_i \vert \vec {x}_j) \Bigm /\ \nonumber \\&\sum _{i,j = 1}^{n - 1} \Delta y_i \Delta x_j {\hat{P}}(y_i) \log {\hat{P}}(y_i) \Biggr ] \ . \end{aligned}$$
(20)

Computationally, we apply the following trick: To eliminate the implicit scale dependence of X, we use the fact that variables X are invariant under rank-order preserving transformations \({\mathcal {T}}\) (Eq. 12). Then, all variables can be scaled to \(x' = {\mathcal {T}}(x)\) such that \(\Delta x_i' = x_{i + 1} - x_i\) is constant and the volume element \(dx'\) in the integrals cancels out (cf., Eq. 15).

$$\begin{aligned} \hat{{\mathcal {D}}}(Y; X) = 1 - \frac{1}{n} \sum _{j = 1}^{n - 1} \frac{\sum _{i = 1}^{n - 1} \Delta y_i P(y_i, x_j) \log P(y_i \vert x_j)}{\sum _{i = 1}^{n - 1} \Delta y_i P(y_i, x_j) \log P(y_i)} \ . \end{aligned}$$
(21)

Such a transformation is always possible and effectively removes the implicit range dependence of variables from the fraction of cumulative information in the computation.

5 Baseline adjustment

The limited availability of data makes it challenging to estimate or calculate dependencies on empirical estimators. Because measures are meant to provide a comparison mechanism, empirical estimators need to assign a value (dependence score) close to zero for statistical independent variables and a score close to one for functional dependent variables. However, empirical estimators based on mutual information are known to never reach their theoretical maximum (functional dependence) or minimum (statistical independence), respectively, and are known to assign stronger dependences for larger sets of variables regardless of the underlying relationship (Fouché and Böhm 2019; Fouché et al. 2021; Vinh et al. 2009, 2010). Consequently, measures based on mutual information have a considerable inherent bias and therefore may incorrectly identify variables as relevant that are not related to an output Y. To actually compare dependence measures between subsets and different sizes of variable sets, an adjustment to mutual information is necessary. One solution to estimate the relevance of a set of variables X and Y is to compare the relevance of a variable X and an output Y to the mean \(\hat{{\mathcal {E}}}_0\) of an empirical estimator \(\hat{{\mathcal {E}}}\),

$$\begin{aligned} \hat{{\mathcal {E}}}^*(Y; X) = \hat{{\mathcal {E}}}(Y; X) - \hat{{\mathcal {E}}}_0(Y; X) \ . \end{aligned}$$
(22)

The mean \(\hat{{\mathcal {E}}}_0\) requires to be constant across random permutations of all variables independently for each data sample, i.e.,

$$\begin{aligned} \hat{{\mathcal {E}}}_0(Y; X) := \frac{1}{|{\mathcal {M}} |} \sum _{M \in {\mathcal {M}}} \hat{{\mathcal {E}}}(Y_M; X_M) \ , \end{aligned}$$
(23)

where \(M \in {\mathcal {M}}\) is a specific realization of such a permutation. The underlying intuition is that the actual value of an empirical estimator \(\hat{{\mathcal {E}}}\) may be caused by spurious (random) dependences. Therefore, by considering all random permutations of all variables independently for each data sample, the spurious contribution of the empirical estimator can be factored out and an adjusted unbiased empirical estimator obtained. The permutations can be computed by enumeration, which however is impractical. An alternative description is provided by a hypergeometric model of randomness (Vinh et al. 2009; Romano et al. 2014) (also known as permutation model (Lancaster 1969)). Such a model describes the permutation of variables as (cumulative) probability distributions, where the average can be calculated separately for each sample of a data set with quadratic complexity. Under the independence assumption of random variables (Vinh et al. 2009, 2010), we derived the correction term for cumulative mutual information as follows,

$$\begin{aligned} \hat{{\mathcal {I}}}_0(Y; X)= & {} - \sum _{i = 1}^{r - 1} \sum _{j = 1}^{c} \sum _{n_{ij}} \Delta y_i(n_{ij}, a_i, b_j \vert M ) \cdot \nonumber \\&\frac{n_{ij}}{n} \log \frac{n_{ij}}{b_j} {\mathcal {P}}(n_{ij}, a_i, b_j \vert M) \ , \end{aligned}$$
(24)

where the difference \(\Delta y_i(M)\) between two consecutive values of Y can be described by a binomial distribution,

$$\begin{aligned} \Delta y_i(n_{ij}, a_i, b_j \vert M) = \frac{1}{{\mathcal {N}}} \sum _{k = 1}^{k_\text {max}} \left( {\begin{array}{c}r - k - 1\\ b_j - 2\end{array}}\right) \bigl ( y_{(i + k)} - y_{(i)} \bigr ) \ , \end{aligned}$$
(25)

\(k_\text {max}\) is the upper limit is given by \(k_\text {max} = \min (n - b_j + 1, r - i)\). \({\mathcal {N}}\) is a normalization constant,

$$\begin{aligned} {\mathcal {N}} = \sum _{k = 1}^{k_\text {max}} \left( {\begin{array}{c}r - k - 1\\ b_j - 2\end{array}}\right) \ , \end{aligned}$$
(26)

and \({\mathcal {P}}(n_{ij}, a_i, b_j \vert M)\) is the probability to encounter an associative cumulative contingency table subject to fixed marginals between all permutations of two variables X and Y with \(|Y_i |= a_i\), \(i = 1, \ldots , r\) and \(|X_j |= b_j\), \(j = 1, \ldots , c\). \(n_{ij}\) is a specific realization of the joint cumulative distribution \(P(y_i, x_j)\) given row marginal \(a_i\) and column marginal \(b_j\). The details can be found in the appendix and are analogous to the baseline adjustment for mutual information (Vinh et al. 2009).

The empirical estimator \(\hat{{\mathcal {E}}}_0\) in Eq. 22 is required to vanish for a large number of samples \(\hat{{\mathcal {E}}}_0(Y; X)~\rightarrow ~0\) as \(n~\rightarrow ~\infty \) in case there is an exact functional dependence between X and Y (Romano et al. 2016). Further, \(\hat{{\mathcal {E}}}_0\) is required to be zero if variables are proportional to the output, \({\mathcal {E}}_0(Y; X)~\rightarrow ~0\) as \(X~\rightarrow ~Y\).

In practice, \(\hat{{\mathcal {E}}}_0\) is generally greater than zero when the number of data samples is limited and can become as large as \(\hat{{\mathcal {E}}}\) when the number of data samples is very small. \(\hat{{\mathcal {E}}}_0\) can therefore be interpreted as a correction term for comparing empirical estimates of different sets of variables on a common baseline: In general, if the value of the correction term is large, more data samples are needed to reliably estimate the dependence between X and Y. If the value of correction term is small, the adjusted empirical estimator either indicates a strong mutual dependence between X and Y (high \(\hat{{\mathcal {E}}}^*\)) or a weak mutual dependence, if the variables of the data set are not related to Y (low \(\hat{{\mathcal {E}}}^*\)).

For cumulative mutual information, we define the empirical estimator as follows

$$\begin{aligned} \hat{{\mathcal {I}}}^*(Y; X)&= \hat{{\mathcal {I}}}(Y; X) - \hat{{\mathcal {I}}}_0(Y; X) \ , \end{aligned}$$
(27)
$$\begin{aligned} \hat{{\mathcal {D}}}^*(Y; X)&= \frac{\hat{{\mathcal {I}}}^*(Y; X)}{\hat{{\mathcal {H}}}^*(Y)} = \hat{{\mathcal {D}}}(Y; \vec {X}) - \hat{{\mathcal {D}}}_0(Y; \vec {X}) \ , \end{aligned}$$
(28)

where \(\hat{{\mathcal {I}}}^*(Y; X)\) is the adjusted empirical cumulative mutual information, \(\hat{{\mathcal {D}}}^*(Y; X)\) is the adjusted fraction of empirical cumulative information, and \(\hat{{\mathcal {I}}}_0(Y; X)\) is the expected cumulative mutual information under the independence assumption of random variables.

6 Total cumulative mutual information

Empirical cumulative mutual information provides a non-parametric deterministic measure to estimate the dependence of continuous distributions. Equation 16 estimates cumulative mutual information based on cumulative probability distributions, \(P(X) = P(X \ge x)\). Similarly, a measure can be instantiated for residual cumulative probability distributions, \(P'(X) := P(X \ge x) = 1 - P(X \le x)\),

$$\begin{aligned} {\mathcal {D}}'(Y; X) = \frac{{\mathcal {H}}'(Y) - {\mathcal {H}}'(Y \vert X)}{{\mathcal {H}}'(Y)} \ . \end{aligned}$$
(29)

Both measures \({\mathcal {D}}(Y; X)\) and \({\mathcal {D}}'(Y; X)\) estimate the dependence between a set of variables and an output from different sides of the distribution: therefore, they set lower and upper bounds on the information they contain. As the sample size increases to infinity, both measures converge to the same value. However, due to the limited number of data samples (cf., Sect. 5), these measures are different and need to be adjusted in practice,

$$\begin{aligned} \hat{{\mathcal {D}}}^*(Y; X)&= \hat{{\mathcal {D}}}(Y; X) - \hat{{\mathcal {D}}}_0(Y; X) \ , \nonumber \\ \hat{{\mathcal {D}}}^{*\prime }(Y; X)&= \hat{{\mathcal {D}}}'(Y; X) - \hat{{\mathcal {D}}}_0'(Y; X) \ . \end{aligned}$$
(30)

The baseline adjustment turns both measures convex by relating the strength of a dependence among variables with the dependence of the same set of variables under the independence assumption of random variables (Vinh et al. 2009, 2010). They can therefore be used to efficiently search for the strongest mutual dependence between a set of variables and an output, e.g., by using the minimum contribution of fraction of empirical cumulative mutual information of the two measures,

$$\begin{aligned} \hat{{\mathcal {D}}}_\text {min}^*(Y; X) := \min ( \hat{{\mathcal {D}}}^*(Y; X), \hat{{\mathcal {D}}}^{*\prime }(Y; X)) \ . \end{aligned}$$
(31)

Total cumulative mutual information (TCMI) combines \(\hat{{\mathcal {D}}}^*(Y; X)\) and \(\hat{{\mathcal {D}}}^{\prime *}(Y; X)\) into a single measure. TCMI is defined as the average strength of cumulative mutual dependence between a set of variables X and an output Y,

$$\begin{aligned} \langle \hat{{\mathcal {D}}}_\text {TCMI}^*(Y; X) \rangle := \langle \hat{{\mathcal {D}}}_\text {TCMI}(Y; X) \rangle - \langle \hat{{\mathcal {D}}}_\text {TCMI, 0}(Y; X) \rangle \ , \end{aligned}$$
(32)

where

$$\begin{aligned}&\langle \hat{{\mathcal {D}}}_\text {TCMI}(Y; X) \rangle = \frac{1}{2} \left[ \hat{{\mathcal {D}}}(Y; X) + \hat{{\mathcal {D}}'}(Y; X) \right] \nonumber \\&\langle \hat{{\mathcal {D}}}_\text {TCMI, 0}(Y; X) \rangle = \frac{1}{2} \left[ \hat{{\mathcal {D}}}_0(Y; X) + \hat{{\mathcal {D}}}_0'(Y; X) \right] \ . \end{aligned}$$
(33)

7 Feature selection

Feature selection (Eq. 1) is an optimization problem that either requires a convex dependence measure or additional criteria to judge the optimality of a feature set (Yu and Príncipe 2019). Measures based on (cumulative) mutual information do not meet either requirement, but an adjusted measure such as TCMI does.

As already mentioned in the introduction, the optimal search strategy (subset selection) of k features from an initial set of variables \(\vec {X} = \{ X_1, \ldots , X_d \}\) is a combinatorial and exhaustive search procedure that is only applicable to low-dimensional problems. An efficient alternative to the exhaustive search is the (depth-first) branch-and-bound algorithm (Land and Doig 1960; Narendra and Fukunaga 1977; Clausen 1999; Morrison et al. 2016). The branch-and-bound algorithm guarantees to find an optimal set of feature variables without evaluating all possible subsets. The performance depends crucially on the variables of a data set and the maximum strength of the mutual dependence between a set of variables and an output. It may be that an output is only weakly related to the variables in the data set, making it necessary to repeat the feature selection with a different set of variables. It may also be that prior knowledge of potential mutual dependences is available, which speeds up the feature selection (e.g., that only \(m < k\) of d variables X are related to Y and therefore not all combinations need to be implicitly enumerated).

The branch-and-bound algorithm maximizes an objective function \({\mathcal {Q}}^*: \vec {X}' \rightarrow {\mathbb {R}}\) defined on a subset of variables \(\vec {X}' \subseteq \vec {X}\) by making use of the monotonicity condition of a feature-selection criterion, \({\mathcal {Q}}: \vec {X}' \rightarrow {\mathbb {R}}\), and a bounding criterion, \(\bar{{\mathcal {Q}}}: \vec {X}' \rightarrow {\mathbb {R}}\). The monotonicity condition requires that feature subsets \(\vec {X}_1\), \(\vec {X}_2\), \(\ldots \), \(\vec {X}_k\), \(k=1,\ldots ,d\), obtained by sequentially adding k features from the set of variables \(\vec {X}\), satisfy

$$\begin{aligned} \vec {X}_1 \subseteq \vec {X}_2 \subseteq \cdots \subseteq \vec {X}_k \ , \qquad \vec {X}_k \subseteq \vec {X} \ , \end{aligned}$$
(34)

so that the feature-selection criterion \({\mathcal {Q}}\) and bounding criterion \(\bar{{\mathcal {Q}}}\) are monotonically increasing and decreasing respectively,

$$\begin{aligned} \begin{aligned}&{\mathcal {Q}}(\vec {X}_1) \le {\mathcal {Q}}(\vec {X}_2) \le \cdots \le {\mathcal {Q}}(\vec {X}_k) \\&\bar{{\mathcal {Q}}}(\vec {X}_1) \ge \bar{{\mathcal {Q}}}(\vec {X}_2) \ge \cdots \ge \bar{{\mathcal {Q}}}(\vec {X}_k) \ . \end{aligned} \end{aligned}$$
(35)

The branch-and-bound algorithm builds a search tree of feature subsets \(\vec {X}' \subseteq \vec {X}\) with increasing cardinality (Clausen 1999; Morrison et al. 2016) (Alg. 1 and Fig. 2). Initially the tree contains only the empty subset (the root node). At each iteration, a limited number of (non-redundant) sub-trees are generated by augmenting one variable \(X \in \vec {X}\) at a time to the current subset and then adding it to the search tree (branching step). While traversing the tree from the root down to terminal nodes from left to right, the algorithm keeps the information about the currently best subset \(X^* := \vec {X}_k\) and the corresponding objective function it yields (the current maximum). Anytime the objective function \({\mathcal {Q}}^*\) in some internal nodes exceeds the bounding criterion \(\bar{{\mathcal {Q}}}\) of sub-trees, it decreases (either due to the condition Eq. 35 or the bounding criterion is lower than the current maximum value of the objective function), sub-trees can be pruned and computations be skipped (bounding step). Once the entire tree has been examined, the search terminates and the optimal set of variables is returned, along with a ranking of sub-optimal variable sets in descending order of the value of the objective function values.

figure a

As objective and criterion function we set

$$\begin{aligned} {\mathcal {Q}}^* = {\mathcal {D}}_\text {TCMI}^*(Y; X) \ , \end{aligned}$$
(36)

the criterion function to be

$$\begin{aligned} {\mathcal {Q}} = \min ({\mathcal {D}}(Y; X), {\mathcal {D}}'(Y; X)) \ , \end{aligned}$$
(37)

and, as a pruning rule, the bounding criterion to be (cf., Eq. 31),

$$\begin{aligned} \bar{{\mathcal {Q}}} = 1 - \min (\hat{{\mathcal {D}}}_0(Y; X), \hat{{\mathcal {D}}}_0'(Y; X)) \ . \end{aligned}$$
(38)

Proofs for the monotonicity conditions for \({\mathcal {Q}}\) and \({\bar{Q}}\) follow similar arguments as for Shannon entropy (Mandros et al. 2017) and are provided in the appendix.

Fig. 2
figure 2

Example of a depth-first tree search strategy of the branch-and-bound algorithm (Land and Doig 1960; Narendra and Fukunaga 1977; Clausen 1999; Morrison et al. 2016) to search for the optimal subset of features. Shown is the tree traversal going from top to down and left to right by dashed arrows, the estimated fraction of total cumulative information (objective function inside circles), subsets of features (labels at the bottom of the circles), fraction of cumulative information (criterion function, first number, right or left the circles), and the expected fraction of cumulative information contribution (bounding function, second number, right or left the circles). Capital roman symbols indicate applied pruning rules or updates of the current maximum objective function. Anytime the objective function in some internal nodes exceeds the bounding function of sub-trees (I), it decreases (II) – either due to the condition Eq. 35 or the bounding function is lower than the current maximum value of the objective function (III), sub-trees can be pruned and computations be skipped. On termination of the algorithm, the bound contains the optimum objective function value (IV)

7.1 Complexity Analysis

The computational complexity of the branch-and-bound algorithm is largely determined by two factors: the branching factor B and the depth D of the tree (Morrison et al. 2016). The branching factor is the maximum number of generated variables combinations at each level l of the tree and can be estimated by the central binomial coefficient \(B \le \max _{l = 1, \ldots , D} \left( {\begin{array}{c}d\\ l\end{array}}\right) \approx \left( {\begin{array}{c}d\\ d/2\end{array}}\right) \), if \(\vec {X}\) has d variables. The depth D of the tree is given by the largest cardinality of a variable set, represented as the longest path in the tree from the root to a terminal node. The ranking of the variable sets involves \({\mathcal {O}}((n \log n)^d)\) sorting operations when all variables are relevant. Thus, any branch-and-bound implementation has worst-case \({\mathcal {O}}(M \cdot B^D)\) computational time complexity, where M is the time needed to evaluate the feature-selection criterion for a combination of variables in the tree.

In the worst case, for n number of example data and d variables, cumulative mutual information requires to evaluate the integral \({\mathcal {O}}(n^d)\) times and \({\mathcal {O}}(n^2)\) times to calculate the baseline adjustment term. Thus, TCMI has time complexity \(M \sim {\mathcal {O}}(n^d)\) and a feature-subset search in the current implementation suffers from the curse of dimensionality (Koller and Sahami 1996).

As a result, the total time complexity of the feature selection algorithm is non-deterministic polynomial-time (NP)-hard and, in general, the search strategy of examining all possible subsets is not viable. In the vast majority of cases, however, dependencies are relatively simple relationships of only a small number of features. In addition, feature selection can be restricted at any time to examine subsets of variables that are less than or equal to a predefined dimensionality. Then the time complexity is greatly reduced and the feature selection can be solved in polynomial time. Whether the assumptions apply to arbitrary data sets is a case-by-case study. However, indicators such as the convergence rate of the TCMI approaching the maximum value or the estimated strength of the relationships are helpful in exploratory data analysis to search for the relevant features of a data set.

8 Experiments

To demonstrate the performance of TCMI in different settings, we first consider generated data and show that our method can detect both univariate and multivariate dependences. Then, we discuss applications of TCMI on data sets from the KEEL and UCI Machine Learning Repository (Alcalá-Fdez et al. 2009, 2011; Dua and Graff 2017) and a typical scenario from the materials-science community, namely to predict the crystal structure of octet-binary compound semiconductors (Ghiringhelli et al. 2015, 2017).

8.1 Case study on generated data

In a number of experiments, we test the theoretical properties of TCMI, i.e., its invariance properties and performance statistics. We also study an exemplified feature-selection task to find a bivariate normal distribution embedded in a multi-dimensional space.

8.1.1 Interpretability of TCMI

Table 3 Dependence scores, \(\langle \hat{{\mathcal {D}}}_\text {TCMI}^*(Y; X) \rangle \), between a linear data distribution and a linear, exponential, step, sawtooth and uniform (random) distribution. The data sample size is \(n = 200\). Step-like distributions were generated by discretization of the linear distribution with each value repeating r-times. Sawtooth-like distributions have 2, 4, and 8 number of steps per ramp and \(\lceil n / level \rceil \) ramps in total. The table also shows Spearman’s rank correlation coefficient squared \(\rho ^2\), total cumulative mutual information contributions, \(\langle \hat{{\mathcal {D}}}_\text {TCMI}(Y; X) \rangle \) and \(\langle \hat{{\mathcal {D}}}_\text {TCMI, 0}(Y; X) \rangle \), and the scores from similar dependence measures such as CMI (Nguyen et al. 2013), MAC (Nguyen et al. 2014b), UDS (Nguyen et al. 2016; Wang et al. 2017), and MCDE (Fouché and Böhm 2019; Fouché et al. 2021)

In the first experiment, we use TCMI, CMI, MAC, UDS, and MCDE to estimate the dependence between a linear data distribution Y of size \(n = 200\) and different distributions X as features (Table 3). Besides linear, exponential, and constant distributions (zero vector), we consider stepwise distributions generated by discretizing a linear distribution, where each value is repeated 2, 4, and 8 times. Furthermore, we consider uniform (random) and saw-tooth distributions with 2, 4, and 8 steps per ramp. Results show that (i) the TCMI score increases nonlinearly with the similarity between a variable and an output, (ii) TCMI is zero for a constant distribution, and (iii) TCMI is approaching one for an exact dependence (see also Fig. 3). CMI, MAC, UDS, and MCDE perform similarly well, but they seem to be less sensitive than TCMI in assessing the strength of a mutual dependence. In particular, the strength of a dependence with CMI, MAC, UDS, MCDE does not change with the shape of a distribution (i.e., of different cumulative probability distributions such as the step-like distributions). MCDE does not differentiate between a linear and a constant distribution, while UDS seems to be limited and does not reach the maximum score even in the presence of an exact dependence.

Due to the limited availability of data samples, a random distribution has a higher TCMI, MAC, and MCDE score, i.e., stronger dependence, than a sawtooth distribution, in agreement with Spearman’s rank coefficient of determination \(\rho ^2\) (Spearman 1904). It should be noted that the baseline adjustment \(\langle \hat{{\mathcal {D}}}_\text {TCMI, 0} \rangle \) for a random variable is larger than any other tested dependence of Table 3. A large baseline adjustment results in smaller TCMI values, such that it is unlikely that a random variable will be part in any feature selection. However, if the dependences are of the same strength as spurious dependencies induced by random variables, TCMI may select variables that are not related to an output.

8.1.2 Properties of the baseline correction term

Fig. 3
figure 3

Expected empirical cumulative mutual information, \(\langle \hat{{\mathcal {D}}}_0(Y; X) \rangle \), with respect to the number of sample data. Shown is the dependency (solid line) and a heuristic derived analytic functional relationship (dashed line)

In the second experiment (Fig. 3), we take a closer look at the baseline adjustment term that decreases monotonically with respect to the number of data samples. Baseline adjustment is given by the expected empirical cumulative mutual information (Eqs.  24 and 28). Expected empirical cumulative mutual information follows a clear downward trend in the score with increasing number of sample sizes in all our test cases. For linear dependencies, for example, we found that the baseline adjustment roughly follows a \(\langle \hat{{\mathcal {D}}}^{(\prime )}_0(Y; X) \rangle \sim n^{-2/3}\) scaling law that vanishes as \(n \rightarrow \infty \) (Fig. 3). However, the exact scaling behavior in general varies depending on the presence of duplicate values of each variable in a data set.

8.1.3 Invariance properties of TCMI

In the third experiment, we investigate the invariance properties of TCMI as compared to CMI, MAC, UDS, and MCDE. To this end, we generated random distributions X of different sizes (50, 100, 200, and 500) and reparameterized variables by applying positive monotonic transformations (cf., Sect. 3.2). Table 4 summarizes the results of comparing the dependence scores between a linear distribution and reparameterized variables, e.g., between \(\hat{{\mathcal {D}}}(Y; X)\) and \(\hat{{\mathcal {D}}}(Y; {\mathcal {T}}(X))\), where monotonic transformations \({\mathcal {T}}(X) = a X^k + b\) with \(a, b, k \in {\mathbb {R}}\) and compositions \({\mathcal {T}}(X) = {\mathcal {T}}_1(X) \pm \cdots \pm {\mathcal {T}}_m(X)\) were explored.

Table 4 Overview of the invariance properties of the dependence measures: cumulative mutual information (CMI: Nguyen et al. (2013)), multivariate maximal correlation analysis (MAC: Nguyen et al. (2014b)), universal dependency analysis (UDS: Nguyen et al. (2016), Wang et al. (2017)), Monte Carlo dependency estimation (MCDE: Fouché and Böhm (2019), Fouché et al. (2021)), and total cumulative mutual information (TCMI). Checkmarks and crosses in parentheses denote invariance in terms of probabilistic tolerance

By construction, TCMI is invariant under positive monotonic transformations (Eq. 12). Our experiments show that TCMI is indeed both scale and permutation invariant. For CMI, MAC, and UDS, the order of the variables plays a crucial role in determining which permutation of the variable achieves either the highest dependence score (CMI, UDS) or the best discretization (MAC). Hence, deterministic dependence measures such as CMI and UDS with which TCMI is most closely related are neither scale nor permutation invariant. MAC is scale invariant, but not permutation invariant. In contrast, the stochastic dependence measure MCDE is scale and permutation invariant, but only within a probabilistic tolerance (i.e., dependence scores vary between different runs of a program within a certain threshold).

8.1.4 Baseline adjustment of TCMI

Fig. 4
figure 4

Fraction of cumulative information scores against increasing dimensionality for \( \{Y, \vec {X}\}\) using 10, 50, 100, and 500 data samples generated from mutually independent and uniform distributions of size \(\vec {X} = \{Y, X_1, \ldots , X_4 \}\). Contributions of average fraction of total cumulative mutual information, \(\langle \hat{{\mathcal {D}}}_\text {TCMI}(Y; X) \rangle \) and \(\langle \hat{{\mathcal {D}}}_\text {TCMI, 0}(Y; X) \rangle \) are shown on either side of the plot and the resulting score \(\langle \hat{{\mathcal {D}}}_\text {TCMI}^*(Y; X) \rangle \) as points. Error bars indicate standard deviations from repeating the experiment 50 times. Since X and Y are independent, average total cumulative mutual information should be constant across subsets of features independent of sample size and subset dimensionality. While \(\langle \hat{{\mathcal {D}}}_\text {TCMI}(Y; X) \rangle \) is increasing with the cardinality of the variable feature set and \(\langle \hat{{\mathcal {D}}}_\text {TCMI, 0}(Y; X) \rangle \) decreasing, \(\langle \hat{{\mathcal {D}}}_\text {TCMI}^*(Y; X) \rangle \) is approximately constant for a wide range of data samples \(10\ldots 500\) and subset dimensionality \(1\ldots 4\). The crosses represent the deviation of the TCMI from the constant baseline. By enlarging the feature subset with a shuffled version of the same variable, TCMI can be corrected. For comparison the dependence scores for the other investigated measures against increasing dimensionality – cumulative mutual information (CMI: Nguyen et al. (2013)), multivariate maximal correlation analysis (MAC: Nguyen et al. (2014b)), and universal dependency analysis (UDS: Nguyen et al. (2016), Wang et al. (2017)) – are also shown

In the fourth experiment, we investigate the necessity of a baseline adjustment to estimate mutual dependences (Sect. 5). To this end, we generated mutually independent and uniform distributions \(Z = \{ Y, X_1, \ldots , X_d \}\) of dimensionality d with sample sizes 10, 50, 100, and 500. We compared TCMI, CMI, MAC, UDS, and MCDE across subsets of variables of different subspace dimensionality while repeating the experiment 50 times. Figure 4 summarizes the results.

By definition, the score of the dependence measures for independent random variables must be zero independent of the sample size (cf., Sect. 5). However, none of the investigated dependence-measure scores are zero for all sample sizes. This is due to the fact that sample data are rarely exactly uniform. In practice, due to random sampling, we expect constant scores approaching \(\langle \hat{{\mathcal {D}}}_\text {TCMI}^*(Y; \vec {X}) \rangle \rightarrow 0.5\) and \(\hat{{\mathcal {D}}}_\text {MCDE}(Y; \vec {X}) \rightarrow 0.5\) as \(n \rightarrow \infty \) independent of the dimensionality of \(\vec {X} = \{ X_1, \ldots , X_d \}\) in the case if none of the variables are dependent to Y, and zero scores for CMI, MAC, and UDS.

Dependence scores of TCMI and MCDE are approximately constant for a wide range of data samples \(10\ldots 500\) and subset dimensionality \(1\ldots 4\) and approach \(\langle {\mathcal {D}}_\text {TCMI}^* \rangle \rightarrow 0.5\) or \(\hat{{\mathcal {D}}}_\text {MCDE} \rightarrow 0.5\) as \(n \rightarrow \infty \) as expected. In contrast, CMI, MAC, and UDS show a clear bias towards larger dependence scores at larger subset cardinalities. Furthermore, their scores are nonzero even at larger sample sizes between mutually independent random variables. In addition, their dependence scores decrease with more data samples, indicating that these measures are unreliable in estimating the strength of mutual dependences.

In comparison to MCDE, CMI, MAC, UDS, and TCMI underestimate dependencies in the one-dimensional case when noise is present in the data. By enlarging the subset with a shuffled version of the same variable, thereby simulating a variable with noise, these measures can be corrected (Fig. 4). As a result, both the corrected version of TCMI and MCDE provide a clear comparison mechanism of dependence scores across different subsets of variables, independent of the number of data samples.

8.1.5 Bivariate normal distribution

At last, we consider a simple feature-selection task with known ground truth, namely to find a bivariate normal distribution embedded in a high-dimensional space. For this purpose, we generated a bivariate normal distribution of size \(n = 500\) from features x and y, added additional variables such as normal, exponential, logistic, triangular, uniform, Laplace, Rayleigh, and Weibull distributions all with zero mean \(\mu = 0\) and identity covariance matrix \(\sigma = 1\), and augmented the feature space as described in Sect. 3.2. In terms of Pearson or Spearman’s correlation coefficient, none of the features have coefficients of determinations higher than \(1\%\) with respect to the bivariate normal distribution. Thus, without knowing the ground truth, the data set appears to be uncorrelated. However, since the ground truth is known, there are two features, namely x and y, to completely describe the bivariate normal distribution of the data set.

Subspace search

In order to find the two most relevant features from the high-dimensional data set, a subspace search is performed up to the second subset dimensionality. Further, feature selection is being performed for four sets of 50, 100, 200, and 500 data samples (Fig. 5). Results are reported in Table 5.

Overall, almost all dependence measures find at least one of the two relevant features x, y, both, or at least similar distributions, such as the normal distribution. However, scores and subset sizes of relevant features decrease with larger sample sizes for MAC and UDS, while CMI identifies exact dependencies even between distributions where none dependence exists, e.g., between a Laplacian and bivariate normal distribution. In contrast, MCDE robustly finds one of the relevant features x or y, but never finds two of them being jointly relevant. TCMI also finds one or both of the two relevant features, but scores and relevance are more determined by sample size. With sample sizes greater than 200, TCMI is the only dependence measure that correctly identifies the optimal feature subset to be \(\{x,y\}\). Still, TCMI scores are lower than of the other dependence measures, even though the score increases for larger sample sizes.

Fig. 5
figure 5

Bivariate normal probability distribution with mean \(\mu = (0, 0)\) and covariance matrix \(\Sigma = [1, 0.5; 0.5, 1]\). Shown is a scatter plot with 50, 100, 200, and 500 data samples, its cumulative probability distributions, \(P(Z \le z)\), \(Z \in \{ X, Y \}\), and contour lines of equal probability densities \(\in \{ 0.01, 0.02, 0.05, 0.08, 0.13 \}\)

Table 5 Topmost feature subsets in the order of identification from the bivariate normal distribution data set with 50, 100, 200, and 500 data samples as being restricted to subset dimensionality \(\le 2\) and selected by the following dependence measures: total cumulative mutual information (TCMI), cumulative mutual information (CMI), multivariate maximal correlation analysis (MAC), universal dependency analysis (UDS), and Monte Carlo dependency estimation (MCDE)

Statistical power analysis

To assess the robustness of dependence measures, we performed a statistical power analysis for CMI, MAC, UDS, MCDE, and TCMI and added Gaussian noise with increasing standard deviation \(\sigma \) (Nguyen et al. 2014b, 2016; Fouché and Böhm 2019; Fouché et al. 2021). We considered \(5+1\) noise levels, distributed linearly from 0 to 1, inclusive. We computed the score of the bivariate normal distribution for each dependence \(\Lambda =\){CMI, MAC, UDS, MCDE, TCMI}, i.e., \(\langle \Lambda (Y; X) \rangle _\sigma \), with \(n = 500\) data samples and subset \(\{x,y\}\) and compared it with the score of independently drawn random data samples, \(\langle \Lambda (Y; I) \rangle _0\), of the same size (\(n = 500\)) and dimension (\(d = 1+2\)). The power of a dependence measure \(\Lambda \), was then evaluated as the probability P of a dependence score to be larger than the \(\gamma \)-th percentile of the score with respect to independence I,

$$\begin{aligned} \text {Power}_{\Lambda , \sigma }^\gamma (Y; X) := P \left( \langle \Lambda (Y; X) \rangle _\sigma > \langle \Lambda (Y; I) \rangle _0^\gamma \right) \ . \end{aligned}$$
(39)

Essentially, the power of a dependence measure quantifies the contrast, i.e., the difference between dependence X and independence I at noise level \(\sigma \) with \(\gamma \%\) confidence. It is a relative statistical measure and depends on the strength of the dependence. Therefore, dependence strengths that are close to independence are likely to be more sensitive to noise than stronger dependences.

For our experiments, we set \(\gamma = 95\%\) and repeated the experiment 500 times. At each iteration, we shuffled the data samples, computed the scores \(\langle \Lambda (Y; X) \rangle _\sigma \) and \(\langle \Lambda (Y; I) \rangle _0^\gamma \) for every dependence measure at noise level \(\sigma \), and recorded the average and standard deviation of the respective dependence measures. The results of the statistical power analysis, the average score of the dependence measures and independence as well as the contrast are summarized in Fig. 6.

With the exception of MAC, the statistical power of all dependence measures tends to be constant or to decrease with increasing noise level. It is remarkable that MCDE is the only dependence measure that has a high statistical power, offers a high contrast and assesses a strong dependence. In particular, the contrast of MCDE provides excellent statistics, even at noise levels much higher than TCMI. Although MAC and CMI also have high statistical power, their contrasts and dependence scores are low.

While a low contrast introduces difficulties in identifying subsets of related variables to an output, a low dependence score needs to be viewed in terms of the dependence score of all other possible subsets of features: If the subset has the highest score, it is still the subset that is most strongly related to an output given a dependence measure.

In our analysis, MCDE has the highest scores, followed by MAC, TCMI, and CMI. UDS completely fails to detect dependences in line with observations (Fouché and Böhm 2019; Fouché et al. 2021). In general, TCMI is dependent on the number of samples (Eq. 27) and its contrast generally increases with more data samples. However, TCMI seems to be more sensitive and, therefore, less robust as compared to the other dependence measures. An in-depth analysis shows: the sensitivity is merely due to the moderate strength of the dependence as the statistical power is much more robust for stronger dependences in other data sets we tested.

Fig. 6
figure 6

Statistical power analysis with \(95\%\) confidence of dependence measures at different noise levels \(\sigma = 0\ldots 1\): total cumulative mutual information (TCMI), cumulative mutual information (CMI), multivariate maximal correlation analysis (MAC), universal dependency analysis (UDS), and Monte Carlo dependency estimation (MCDE). The diagrams also show the trends in the dependence scores of the optimal feature subset \(\{x,y\}\) of the bivariate normal distribution

8.2 Case study on real-world data

Next, we study selected real-world data sets from KEEL and UCI Machine Learning Repository (Alcalá-Fdez et al. 2009, 2011; Dua and Graff 2017), and highlight TCMI for one, not restricted to, typical application of the materials-science community, namely the crystal-structure prediction of octet-binary compound semiconductors (Ghiringhelli et al. 2015, 2017).

8.2.1 KEEL and UCI regression data sets

We investigate how TCMI and similar dependence measures perform in real-world problems developed for multivariate regression tasks. Unfortunately, in practice, not every data set is known to have relevant features. Therefore, we compare our results with analyzed data sets with known relevant features. All in all, we consider one simulated data set from the KEEL database (Alcalá-Fdez et al. 2009, 2011) and two data sets from the UCI Machine Learning Repository (Dua and Graff 2017):

  1. 1.

    Friedman #1 regression (Friedman 1991)

    This data set is used for modeling computer outputs. Inputs \(X_1\) to \(X_5\) are independent features that are uniformly distributed over the interval [0, 1]. The output Y is created according to the formula:

    $$\begin{aligned} Y = 10 \sin (\pi X_1 X_2) + 20 (X_3 - 0.5)^2 + 10 X_4 + 5 X_5 + \epsilon \, \end{aligned}$$
    (40)

    where \(\epsilon \) is the standard normal deviate N(0, 1). In addition, the data set has five redundant variables \(X_6\ldots X_{10}\) that are i.i.d random samples. Further, we enlarge the number of features by adding four variables \(X_{11}\ldots X_{14}\) each very strongly correlated with \(X_1\ldots X_4\) and generated by \(f(x) = x + N(0, 0.01)\).

  2. 2.

    Concrete compressive strength Yeh (1998)

    The aim of this data set is to predict the compressive strength of high performance concrete. Compressive strength is the ability of a material or structure to withstand loads that tend to reduce size. It is a highly nonlinear function of age and ingredients. These ingredients include cement, water, blast furnace slag (a by-product of iron and steel production), fly ash (a coal combustion product), superplasticizer (additive to improve the flow characteristics of concrete), coarse aggregate (e.g., crushed stone or gravel), and fine aggregate (e.g., sand).

  3. 3.

    Forest fires Cortez and Morais (2007)

    This data set focuses on wildfires in the Montesinho Natural Park, which is located at the northern border of Portugal. It includes features such as local coordinates x and y where a fire occurred, the time (day, month, and year), temperature (temp), relative humidity (RH), wind, rain, and derived forest-fire features such as fine-fuel moisture code (FFMC), duff moisture code (DMC), drought code (DC), and initial spread index (ISI) to estimate the propagation speed of fire.

For each data set, we performed feature selection using all aforementioned dependence measures (TCMI, CMI, MAC, UDS, MCDE) and compared resulting feature subsets with potentially relevant features reported from the original references. Results are summarized in Table 6.

Table 6 Relevant feature subsets for selected data sets from the KEEL database (Alcalá-Fdez et al. 2009, 2011) and UCI Machine Learning Repository (Dua and Graff 2017), designed for multivariate regression tasks and feature selection as found out by total cumulative mutual information (TCMI), cumulative mutual information (CMI), multivariate maximal correlation analysis (MAC), universal dependency analysis (UDS), and Monte Carlo dependency estimation (MCDE). For comparison, potentially relevant feature subsets mentioned in the references are also included

Our results show that even in the simplest example of the Friedman regression data set, two dependence measures show extreme behavior: UDS selects no variables and MAC selects all variables of the data set and therefore both do not perform any feature selection at all. Both dependence measures do not only completely fail to identify the actual dependences of the Friedman regression data set, but also fail in the concrete compressive strength, and forest fires data set. Therefore, it is likely that these dependence measures report incorrect results in other data sets and are therefore inappropriate for feature selection and dependence-assessment tasks.

CMI and MCDE partially agree with potentially relevant features from the respective references: Therefore, they may be useful when low-dimensional feature subsets need to be identified. In contrast, TCMI effectively selects all relevant variables of the Friedman regression data set. However, TCMI it is not free from selecting non-relevant variables in sub-optimal feature subsets as it reports \(X_7\) or \(X_8\) in the fourth or fifth best feature subset. Therefore, dependence scores need to be related with respect to the baseline adjustment term, and the lower the dependence scores are, the more likely non-relevant variables are in the subsets (cf., Sect. 8.1.1).

Found feature subsets with TCMI for the Friedman regression data set as well as for the concrete compressive strength data set have high dependence scores. They agree well with relevant features as reported by the references, even though TCMI misses slag in the concrete compressive strength example: It is likely that variables such as fine and coarse aggregate or superplasticizer serve as a substitute for slag due to the limited number of data samples. However, we cannot test this assumption as all data samples were used to compute the dependence scores and no curated test sets are available for further tests.

In the forest-fires data set, temperature and relative humidity as well as duff moisture and drought code are not only reported by TCMI, but also by CMI and MCDE. It is therefore likely that these variables are also relevant in the forest-fires predictions, although none of them were mentioned in the reference (Cortez and Morais 2007). Apart from weather conditions, TCMI also includes some of the derived forest-fires variables such as duff moisture (CMD) and drought code (DC) – these variables are indirectly related to precipitation and are used to estimate the lower and deeper moisture content of the soil. Admittedly, the TCMI scores are moderate, which indicates difficulties in assessing the mutual dependences between a set of features and the burnt area of forest fires as a whole. A detailed analysis shows that although forest fires are devastating, they are isolated events – not enough to actually reliably identify the precursors of wildfires from the investigated data set.

8.2.2 Octet-binary compound semiconductors

Our last example is dedicated to a typical, well characterized, and canonical materials-science problem, namely the crystal-structure stability prediction of octet-binary compound semiconductors (Ghiringhelli et al. 2015, 2017). Octet-binary compound semiconductors are materials consisting of two elements formed by groups of I/VII, II/VI, III/V, or IV/IV elements leading to a full valence shell. They can crystallize in rock salt (RS) or zinc blende (ZB) structures, i.e., either with ionic or covalent bindings and were already studied in the 1970’s (Van Vechten 1969; Phillips 1970), followed by further studies (Zunger 1980; Pettifor 1984), and recent work using machine learning (Saad et al. 2012; Ghiringhelli et al. 2015, 2017; Ouyang et al. 2018).

The data set consists of 82 materials with two atomic species in the unit cell. The objective is to accurately predict the energy difference \(\Delta E\) between RS and ZB structures based on 8 electro-chemical atomic properties for each atomic species A/B (in total 16) such as atomic ionization potential \(\text {IP}\), electron affinity \(\text {EA}\), the energies of the highest-occupied and lowest-unoccupied Kohn-Sham levels, \(\text {H}\) and \(\text {L}\), and the expectation value of the radial probability densities of the valence s-, p-, and d-orbitals, \(r_s\), \(r_p\), and \(r_d\), respectively (Ghiringhelli et al. 2015). As a reference, we added Mulliken electronegativity \(\text {EN} = -(\text {IP}+\text {EA})/2\) to the data set and also studied the best two features from the publication (Ghiringhelli et al. 2015)

$$\begin{aligned} D_1 = \frac{\text {IP}(B) - \text {EA}(B)}{r_p(A)^2} , \quad \ D_2 = \frac{|r_s(A) - r_p(B) |}{\exp [r_s(A)]} \ , \end{aligned}$$
(41)

as known dependences to show the consistency of the method as well as to probe TCMI with linearly dependent features (Ghiringhelli et al. 2015).

Table 7 Relevant feature subsets for the octet-binary compound semiconductors data set as found out by total cumulative mutual information (TCMI) showing the most relevant feature subsets of each cardinality. For comparison, best feature subsets for \(D_1 = D_1(\text {IP}(B),\text {EA}(B), r_p(A))\) and \(D_2 = D_2(r_s(A), r_p(B))\) from reference (Ghiringhelli et al. 2015) (entries with a star \(\star \)) are also listed. Bold feature subsets mark subsets with interchangeable variables \(\text {EN}\) and \(\text {IP}\). The table also shows statistics of constructed machine-learning models using the gradient boosting decision tree (GBDT) algorithm (Friedman 2001) with 10-fold cross-validation: root-mean-squared error (RMSE), mean absolute error (MAE), maximum absolute error (MaxAE), and Pearson coefficient of determination (\(r^2\)). Units are in electronvolts (eV)

To predict the energy difference \(\Delta E\) between RS and ZB structures, we performed a subspace search with TCMI to identify the subset of features that exhibit the strongest dependence on \(\Delta E\). Results are summarized in Table 7. In total, the strongest dependence on \(\Delta E\) was found with six features from both atomic species, A and B, before TCMI decreased again with seven features.

Results reveal that there are several feature subsets that are found to be optimal among different cardinalities. We note that TCMI never selects Mulliken electronegativity \(\text {EN}\) together with either electron affinity \(\text {EA}\) or ionization potential \(\text {IP}\) for the same atomic species. We also note that \(\text {EN}\) can be replaced by \(\text {IP}\) (see bold feature subsets in Table 7). However, \(\text {EN}\) cannot be replaced by \(\text {EA}\), as \(\text {EN}\) is found to be stronger linearly correlated with \(\text {IP}\) than with \(\text {EA}\) and hence results in slightly smaller TCMI values (by at least 0.02 in case of the optimal subsets, not shown in the table). Results therefore do not only corroborate the functional relationship between \(\text {EN}\), \(\text {IP}\), and \(\text {EA}\), but also the consistency of TCMI.

Furthermore, TCMI indicates that features, like the atomic radii \(r_s(B)\) and \(r_p(B)\) or the energies \(\text {EN}(B)\), \(\text {H}(B)\), \(\text {H}(B)\) and \(\text {IP}(B)\) of IV to VIII elements, can be used interchangeably without reducing the dependence scores. Indeed, by assessing dependences between pairwise feature combinations, TCMI identifies \(r_s(B)\) and \(r_p(B)\) to be strongly dependent and \(\text {EN}(B)\), \(\text {H}(B)\), and \(\text {IP}(B)\) strongly dependent, consistent with bivariate correlation measures such as Pearson or Spearman. In numbers, the Pearson coefficient of determination (\(r^2\)) between the atomic radii \(r_s\) and \(r_p\) are \(r^2(r_s(A), r_p(A)) = 0.94\), \(r^2(r_s(B), r_p(B)) = 0.99\) and the Pearson coefficient of determination between Mulliken electronegativity and ionization potential or electron affinity is \(r^2(\text {EN}(B), \text {IP}(B)) = 0.96\), or \(r^2(\text {EN}(B), \text {H}(B)) = 0.99\), respectively. These findings illustrate that TCMI assigns similar scores to collinear features.

Features \(D_1\) and \(D_2\) (Eq. 41) from the reference (Ghiringhelli et al. 2015), are combinations of atomic properties that best represent \(\Delta E\) linearly,

$$\begin{aligned} D_1&= D_1(\text {IP}(B),\text {EA}(B), r_p(A)) \ , \end{aligned}$$
(42)
$$\begin{aligned} D_2&= D_2(r_s(A), r_p(B)) \ . \end{aligned}$$
(43)

As such, they incorporate knowledge that generally lead to higher TCMI scores for the same feature subset cardinality. While this applies to the first and second subset dimensions, feature subsets with the aforementioned features \(D_1\), \(D_2\) are on par with feature subsets based on atomic properties at higher dimensions. However, \(D_1\) and \(D_2\) are not selected consistently by TCMI because TCMI does not make any assumption about the linearity of the dependency \((D_1, D_2) \mapsto \Delta E\). This suggests that the linear combination of \(D_1\) and \(D_2\) is a good, but not complete, description of the energy difference \(\Delta E\).

Fig. 7
figure 7

Feature spaces of the topmost selected feature subsets for one (left) and two dimensions (right). Shown are the two classes of crystal-lattice structures as diamonds (zinc blende) and squares (rock salt), their distribution, and the trend line/manifold in the prediction of the energy difference \(\Delta E\) between rock salt and zinc blende. The trend line/manifold was computed from with the gradient boosting decision tree algorithm (Friedman 2001) and 10-fold cross validation. For reference, some octet-binary compound semiconductors are labeled

A visualization of relevant subsets also reveals clear monotonous relationships in one and two dimensions (Fig. 7). In addition, we constructed machine-learning models for each feature subset and report model statistics for the prediction of \(\Delta E\) along with statistics of the full feature set (Table 7). The details can be found in the appendix. We partitioned the data set into \(k = 10\) groups (so-called folds) and generated k machine-learning models, using 9 folds to generate the model and the k-th fold to test the model (10-fold cross validation). To reduce variability, we performed five rounds of cross-validation with different partitions and averaged the rounds to obtain an estimate of the model’s predictive performance. For the machine-learning models we used the gradient boosting decision tree algorithm (GBDT) (Friedman 2001). GBDT is resilient to feature scaling (Eq. 12) just like TCMI and is one of the best available, award-winning, and versatile machine-learning algorithm for classification and regression (Natekin and Knoll 2013; Fernández-Delgado et al. 2014; Couronné et al. 2018). Notwithstanding this, traditional methods sensitive to feature scaling may show superior performances for data sets with sample sizes larger than the number of considered features (Lu and Petkova 2014) (compare also model performances in Table 7 with references (Ghiringhelli et al. 2015; Ouyang et al. 2018; Ghiringhelli et al. 2017)).

Machine-learning models are designed to improve with more data and a feature subset that best represents the data for the machine-learning algorithm (Friedman 2001; James et al. 2013). Therefore, we expect a general trend of higher model performances with larger feature-subset cardinalities. Furthermore, we do not expect that the optimal feature subset of TCMI performs best for every machine-learning model (“No free lunch” theorem: Wolpert (1996b, 1996a), Wolpert and Macready (1995, 1997)) as an optimal feature subset identified by the feature-selection criterion TCMI may not be same according to other evaluation criteria such as root-mean-squared error (RMSE), mean absolute error (MAE), maximum absolute error (MaxAE), or Pearson coefficient of determination (\(r^2\)). This fact is evident in our analysis. The choice of GBDT may not be optimal because its predictive performance generally decreases with the number of variables (compare the model performance with all 16 variables to a subset with two or four variables, Table 7). However, to the best of our knowledge, there is no other machine-learning algorithm that models data without making assumptions about the functional form of dependence, is independent of an intrinsic metric, and can operate on a small number of data samples. Therefore, we focus only on the predictive performance of the found subsets compared to the predictive performance of the identified features with respect to all variables in the data set (Table 7).

Results confirm the general trend of higher model performances with larger feature-subset cardinalities and show that the initial subset of 16 variables can be reduced down to 6 variables without decreasing model performances. Essentially, feature subsets with three to four variables are already as good as a machine-learning model with all 16 variables, where the large number of variables already start to degrade the prediction performance of the GBDT model. The overall performance gradually increases with the subset cardinality. However, our analysis identifies significant variability in performance with a higher standard deviation for feature subsets at smaller dependence scores than for larger values.

An exhaustive search for the best GBDT model yields an optimum of seven features to best predict the energy difference between rock salt and zinc blende crystal structures with D1 and D2 neglected,

$$\begin{aligned} \{\text {EA}(A),\text {IP}(A),r_d(A),r_p(A),\text {IP}(B),r_s(B),r_p(B)\} \\ \text {RMSE}: 0.11,\ \text {MAE}: 0.08,\ \text {MaxAE}: 0.27,\ r^2: 0.92 \ . \end{aligned}$$

In contrast to the optimal feature subsets of TCMI (cf., Table 7), the optimal GBDT feature set is a variation of optimal feature subsets of TCMI with highest-occupied Kohn-Sham level and ionization potential interchanged, \(\text {H}(A) \leftrightarrow \text {IP}(A)\), and lowest-unoccupied Kohn-Sham level, \(\text {L}(B)\), missing. Model performances demonstrate that the optimal feature subsets of TCMI are close to the model’s optimum and corroborate the usefulness of TCMI in finding relevant feature subsets for machine-learning predictions. Slight differences in performances are mainly due to the variances of the cross-validation procedure and the small number of 82 data samples, which effectively limited the reliable identification of larger feature subsets in the case of TCMI (Table 5).

9 Discussion

Although TCMI is a non-parametric, robust, and deterministic measure, the biggest limitation is its computational complexity. For small data sets (\(n < 500\)) and feature subsets (\(d < 5\)) feature selection finishes in minutes to hours on a modern computer. For larger data sets, however, TCMI scales with \({\mathcal {O}}(n^d)\) and quickly exceeds any realizable runtime. Furthermore, the search for the optimal feature subset also needs to be improved. Even though in our analysis only a fraction of less than one percent of the possible search space had to be evaluated, TCMI was evaluated hundreds of thousands of times. Future research towards pairwise evaluations (Peng et al. 2005), Monte Carlo sampling (Fouché and Böhm 2019; Fouché et al. 2021), or gradual evaluation of features based on iterative refinement strategies of sampling will show to what extent the computational costs of TCMI can be reduced.

A further limitation is that non-relevant variables may be selected in the optimal feature subsets, when only a limited amount of data points is available (cf., Sect. 8.1.5). By construction, the identification of feature subsets is dependent on the feature-selection search strategy (cf., Sect. 1). The results show that it is critical to use optimal search strategies because sub-optimal search strategies can report subsets of features that are not related to an output. Even if the exhaustive search for feature subsets is computationally intensive, it can be implemented efficiently, e.g., by using the branch-and-bound algorithm. In our implementation, the branch-and-bound algorithm was used to search for optimal, i.e., minimal non-redundant feature subsets. However, as our results demonstrate, different feature subsets with few or no common features may lead to similar dependence scores. The main rationale for this outcome is that the features may be correlated with each other and therefore contain redundant information about dependences. Including these redundant features will surely lead to a higher stability of the method, more consistent results, and better insights into the actual dependence. If a machine-learning algorithm is given, the best option at present is to generate predictive models for each of the found feature subsets and select the one that works best.

10 Conclusions

We constructed a non-parametric and deterministic dependence measure based on cumulative probability distribution (Rao et al. 2004; Rao 2005) to propose fraction of cumulative mutual information \({\mathcal {D}}(Y; \vec {X})\), an information-theoretic divergence measure to quantify dependences of multivariate continuous distributions. Our measure can be directly estimated from sample data using well-defined empirical estimates (Sect. 3). Fraction of cumulative mutual information quantifies dependences without breaking permutation invariance of feature exchanges, i.e., \({\mathcal {D}}(Y; \vec {X}) = {\mathcal {D}}(Y; \vec {X}')\) for all \(\vec {X}' \in {\text {perm}}(\vec {X})\), while being invariant under positive monotonic transformations. Measures based on mutual information are monotonously increasing with respect to the cardinality of feature subsets and sample size. To turn fraction of cumulative mutual information into a convex measure, we related the strength of a dependence with the dependence of the same set of variables under the independence assumption of random variables (Vinh et al. 2009, 2010). We further constructed a measure based on residual cumulative probability distributions and introduced total cumulative mutual information \(\langle \hat{{\mathcal {D}}}_\text {TCMI}^*(Y; \vec {X}) \rangle \).

Tests with simulated and real data corroborate that total cumulative mutual information is capable of identifying relevant features of linear and nonlinear dependences. The main application of total cumulative mutual information is to assess dependences, to reduce an initial set of variables before processing scientific data, and to identify relevant subsets of variables, which jointly have the largest mutual dependence and minimum redundancy with respect to an output. The performance of the total cumulative mutual information is still exponential and thus outweighs potential benefits of TCMI. In future works, we will address the performance issues of TCMI, the stability of identified feature subsets, and provide a feature-selection framework that is also suitable for discrete, continuous, and mixed data types. We will also apply TCMI to current problems in the physical sciences with a practical focus on the identification of feature subsets to simplify subsequent data-analysis tasks.

Since total cumulative mutual information identifies dependences with strong mutual contributions, it is applicable to a wide range of problems directly operating on multivariate continuous data distributions. In particular, it does not need to quantize variables by using probability density estimation, clustering, or discretization prior to estimating the mutual dependence between variables. Thus, total cumulative mutual information has the potential to promote an information-theoretic understanding of functional dependences in different research areas and to gain more insights from data.

Supplementary information. We implemented total cumulative mutual information in Python. Our Python-based implementation is part of B.R.’s doctoral thesis and is made publicly available under a Apache License 2.0.

All data and scripts involved in producing the results can be downloaded from Zenodo (https://doi.org/10.5281/zenodo.6577261). An online tutorial to reproduce the main results presented in this work can also be found on GitHub (https://github.com/benjaminregler/tcmi) or in the NOMAD Analytics Toolkit (https://analytics-toolkit.nomad-coe.eu/public/user-redirect/notebooks/tutorials/tcmi.ipynb).