Abstract
The purpose is to use Shannon entropy measures to develop classification techniques and an index which estimates the separation of the groups in a finite mixture model. These measures can be applied to machine learning techniques such as discriminant analysis, cluster analysis, exploratory data analysis, etc. If we know the number of groups and we have training samples from each group (supervised learning) the index is used to measure the separation of the groups. Here some entropy measures are used to classify new individuals in one of these groups. If we are not sure about the number of groups (unsupervised learning), the index can be used to determine the optimal number of groups from an entropy (information/uncertainty) criterion. It can also be used to determine the best variables in order to separate the groups. In all the cases we assume that we have absolutely continuous random variables and we use the Shannon entropy based on the probability density function. Theoretical, parametric and non-parametric techniques are proposed to get approximations of these entropy measures in practice. An application to gene selection in a colon cancer discrimination study with a lot of variables is provided as well.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
The measure of the uncertainty associated to a random variable is a task of great and increasing interest. Since the pioneering work of Shannon (1948), in which the concept of Shannon entropy was defined as the average level of information or uncertainty related to a random event, several measures of uncertainty with different purposes have been defined and studied. The Shannon (differential) entropy associated to a random vector \(\textbf{X}\) with an absolutely continuous distribution is a good way to measure the uncertainty of the data from \(\textbf{X}\). It is defined by
where f is the probability density function of \(\textbf{X}\) and \(\log \) is the natural log, see Shannon (1948). Several generalizations and extensions of the Shannon entropy have been proposed in the literature with the scope of better analyzing the uncertainty in different scenarios. Among them we recall the weighted entropy (Di Crescenzo and Longobardi 2006), the cumulative entropies (Balakrishnan et al. 2022; Rao et al. 2004), Tsallis entropy (Tsallis 1988) and Rényi entropy (Rényi 1961).
In many applications, the distribution of \(\textbf{X}\) is a finite mixture of s distributions with some probabilities \(p_1,\dots ,p_s\ge 0\) such that \(p_1+\dots +p_s=1\). Maybe, the main application nowadays is the assessment of differential expression from high dimensional genomic data. There are a lot of other applications. For example, some applications to information of additive noise models in communication channels or thermodynamic of computations can be seen in Melbourne et al. (2022) and in the references therein. Results for Gaussian (normal) mixtures in several scenarios where the transmitters utilize pulse amplitude modulation constellations can be seen in Moshksar and Khandani (2016).
Several criteria are available in the literature to determine the optimal number of groups in a mixture model. An entropy criterion called NEC (normalized entropy criterion) to estimate the number of clusters arising from a mixture model was proposed in Celeux and Soromenho (1996). There it is compared with other popular indices such as the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC). A recent modification was proposed in Biernacki et al. (1999). The main advantage of the entropy indices is that they do not depend on the number of unknown parameters in the mixture model (as AIC and BIC do).
In this paper we propose a new index based on Shannon entropy to measure the separation of the groups in a mixture. As the NEC, this index does not depend on the number of unknown parameters and it can be estimated by using non-parametric techniques. This index can be used to determine the optimal number of groups in a mixture model both in discriminant (supervised methods) or cluster analysis (unsupervised methods). This approach is also used to propose discriminant criteria to classify new individuals in one of these groups. The procedures are illustrated by using simulated examples (which show the accuracy of the empirical measures) and real data examples. A real case study dealing with the selection of discriminant variables in a colon cancer data set is provided as well.
The rest of the paper is organized as follows. In Sect. 2, we introduce the notation, the main definitions and the preliminary results. The new index and the applications to discriminant analysis are placed in Sect. 3. The examples are in Sect. 4. The application to the colon cancer data set is done in Sect. 5. Finally, Sect. 6 contains the conclusions and some tasks for future research projects.
2 Notation and Preliminary Results
First, we present the notions for the univariate case. The Shannon entropy associated to a random variable X with probability density function (PDF) f is defined by
where \(\log \) represents the natural log and, by convention, \(0\log 0=0\). The value H(X) is used to measure the uncertainty (dispersion) in the values of X, see Shannon (1948). The random variable \(-\log (f(\textbf{X}))\) is called information content of X in Di Crescenzo et al. (2021). For other entropy measures see Balakrishnan et al. (2022), Buono and Longobardi (2020), Di Crescenzo and Longobardi (2006), Rao et al. (2004), Rényi (1961), Tsallis (1988) and the references therein.
If \(X_{1},\dots ,X_{n}\) is a sample of independent and identically distributed (IID) random variables from X, then H(X) can be estimated with
where \(\widehat{f}_n\) is an estimator for the PDF f based on \(X_{1},\dots ,X_{n}\). Here we can use both parametric and non-parametric estimators. In the first case, we assume a known functional form for f (e.g. exponential or Gaussian) with some unknown parameters (mean, variance, etc.) that are estimated from the sample. In the second case, an empirical estimator for f is used (e.g. a kernel density estimator).
If we fix a value t and we consider the values of X below and above t, that is, we consider the conditional random variables \((X|X\le t)\) and \((X|X>t)\), then the entropy of X can be rewritten (see Proposition 2.1 in Di Crescenzo and Longobardi 2002) as
where \(F(t)=\Pr (X\le t)\) is the distribution function of X, \(\bar{F}(t)=1-F(t)=\Pr (X>t)\) is the reliability (or survival) function of X,
is the entropy of the past lifetime \((X|X\le t)\),
is the entropy of the residual lifetime \((X-t|X> t)\), and
is the entropy of the discrete (Bernoulli) random variable necessary to distinguish between the two groups. A similar representation holds for any partition of the support of X with s disjoint sets (groups). A bivariate version of (2) was obtained in Ahmadi et al. (2015).
If X contains two groups \(G=1\) and \(G=0\) with respective PDF \(f_1\) and \(f_0\), then \(f=pf_1+(1-p)f_0\), where \(p=\Pr (G=1)\). Hence the entropy with the two groups together (i.e. the entropy of the mixture) is
Expression (2) can be used to define the entropy in X with the two groups as follows.
Definition 1
If X has a mixture PDF \(f=pf_1+(1-p)f_0\), then the entropy of the two groups is
where
is the entropy of the first group and
is the entropy of the second group. The efficiency in the division made by the groups isdefined as
where H(X) is given in (3).
The efficiency is also called the concavity deficit in Melbourne et al. (2022) and can be interpreted as a generalization of the Jensen-Shannon divergence measure (see Briët and Harremoës 2009).
If \(S_i=\{x: f_i(x)>0 \}\), \(i=0,1\) are the supports of the two groups and \(S_1\cap S_0=\emptyset \) (the two groups are completely separated), then \(p=\Pr (X\in S_1)\),
and \(Eff^{(2)}(X)=H_2(G)\ge 0\), that is, the efficiency coincides with the entropy to distinguish between the two groups given by
Even more, if \(0<p<1\), then \(H^{(2)}(X)<H(X)\) and the division in two groups is effective since it decreases the uncertainty (the entropy).
Another extreme case is when \(f_1=f_0\) (identically distributed groups), where
for any \(p\in [0,1]\). Here, \(H^{(2)}(X)=H(X)\) and \(Eff^{(2)}(X)=0\) tell us that it is not a good idea to consider two groups since the uncertainty does not change.
Thus, we can say that the division in two groups is efficient if \(H(X)> H^{(2)}(X)\) since, in this case, it decreases the uncertainty in X. This is always the case when \(H(X)> H(X|G=i)\) for \(i=0,1\), that is, when the uncertainties in the groups are smaller than the uncertainty in the mixed population (a reasonable property).
The following proposition shows that the efficiency is related with the Kullback–Leibler (KL) divergence measure between the densities. If f and g are two PDF, the KL-divergence measure (or the relative entropy) is defined as
see e.g. Melbourne et al. (2022). It can be proved that \(KL(f|g)\ge 0\) and that \(KL(f|g)=0\)
if and only if \(f=g\) (a.e.). This proposition also shows that the efficiency is non-negative and bounded by \(H_2(G)\). This bound can be traced back to Grandvalet and Bengio (2005) and Cover and Thomas (2006). Some improvements of this bound were obtained in Melbourne et al. (2022) and in Moshksar and Khandani (2016) (Gaussian mixtures). The relationship with KL divergence measure can be seen in, e.g., (22) of Melbourne et al. (2022). To get a self contained paper, we provide a proof in the Appendix since it is a key result for our purposes.
Proposition 1
Let \(f=pf_1+(1-p)f_0\) be the PDF of X, then
Moreover, if \(f_1\ne f_0\) (a.e.) and \(0<p<1\), then \(Eff^{(2)}(X)>0\).
Equation (5) implies that \(H(X) \ge H^{(2)}(X)\) under a mixture model with two groups.
A straightforward calculation leads to the generalization to the mixture model with s groups, stated in the following relationship:
where \(H^{(s)}(X)=\sum _{i=1}^s p_iH(X|G=i)\). Therefore, \(Eff^{(s)}(X)=H(X)-H^{(s)}(X)\) can be formulated as a weighted sum of KL divergences between the class conditional PDF of the groups and the PDF of the mixture. Expression (6) proves the non-negativeness of \(Eff^{(s)}(X)\) whenever there exists an underlying group structure for the variable X. It can also be used to assess the overlapping of the class structure. An investigation is needed in order to elucidate the usefulness of the quantity \(Eff^{(s)}(X)\) in the statistical practice. Some applications would include: its use as an auxiliary tool that may help to determine the number of groups in clustering analysis or its application in genomic studies for the selection of genomic variables having the potential to discriminate a clinical outcome, just to name a couple of applications.
The next proposition proves that the efficiency increases when we divide a group in two subgroups. The proof is given in the Appendix.
Proposition 2
Let \(f=p_1f_1+(1-p_1)f_0\) be the PDF of X and let us assume that \(f_0=qf_2+(1-q)f_3\) for some \(q\in [0,1]\). Then \(H^{(2)}(X)\ge H^{(3)}(X)\) and
where \(H_3(G)=-\sum _{i=1}^3p_i\log p_i\), \(p_2=(1-p_1)q\) and \(p_3=(1-p_1)(1-q)\).
Now we can state the results for the k-dimensional case. Let \(\textbf{X}=(X_1,\dots ,X_k)\) be a random vector with an absolutely continuous joint distribution and joint PDF f. Then the (multivariate) Shannon entropy is defined by
The estimator for \(H(\textbf{X})\) is defined as in the univariate case. An expression similar to (2) can be obtained for \(H(\textbf{X})\) when the support of \(\textbf{X}\) is divided in s disjoint sets (groups).
The results for this general case are stated below. They are completely analogous to the results for the univariate case, so we omit the proofs.
Definition 2
If \(\textbf{X}\) has a joint PDF \(f=\sum _{i=1}^s p_if_i\), the entropy of the s groups is
where
is the entropy of the ith group for \(i=1,\dots ,s\). The efficiency in the division made by the s groups is defined as
where \(H(\textbf{X})\) is given in (7).
Proposition 3
Let \(f=\sum _{i=1}^s p_if_i\) be the PDF of \(\textbf{X}\), then
where \(H_s(G)=-\sum _{i=1}^s p_i\log p_i\ge 0\). Moreover, if, for some i, \(f_i\ne f\) (a.e) and \(0<p_i<1\), then \(Eff^{(s)}(X)>0\).
3 New Results
Following the idea of the normalized entropy criterion (NEC) defined in Celeux and Soromenho (1996) and Biernacki et al. (1999), we can define the following index to measure the relative efficiency of the division made by the s groups. This index can be used to decide about the optimal number of groups (clusters) in a finite mixture model (including the case of no groups).
Proposition 2 proves that the efficiency is not a good value if we want to determine the optimal number of groups since it is always increasing when a group is divided in two. So we use the upper bound in (9) to propose a relative efficient measure.
Definition 3
If \(f=\sum _{i=1}^s p_if_i\) is the PDF of \(\textbf{X}\), then we define the relative efficiency of the division in s groups, shortly denoted as RED(s), by
Note that \(0\le RED(s)\le 1\) and that we should choose the value of s which leads to a maximum of RED(s). If this value is close to zero, then we should consider just one group (i.e. no groups). In practice, these values will be replaced with their estimations (see next section). Note that the indices \(RED(2),RED(3),\dots \) are not necessarily ordered.
Theorem 1 in Melbourne et al. (2022) provides an upper bound for RED(s) based on \(f_1,\dots ,f_s\) written as
where \(T_s=\max _{j=1,\dots ,s} \left\| f_j-\widehat{f}_j\right\| _{TV}\), \(\left\| g\right\| _{TV}=\frac{1}{2}\int _{\mathbb {R}} |g(x)|dx\) is the Total Variation (TV) distance and
is the mixture complement of \(f_j\). Note that \(f=p_j f_j+(1-p_j) \widehat{f}_j\).
Let us see now how to apply this approach to classify new individuals in one of these groups. As mentioned above let us assume here that our population is divided into s groups and that we want to use the numerical random variables \(X_1,\dots ,X_k\) to classify new individuals into one these groups. To simplify the notation let us assume that \(k=1\) and \(s=2\) but the same techniques can be applied for \(k>1\) and \(s>2\) (see Example 3).
Let us assume first that the PDF of the two groups \(f_1\) and \(f_0\) are known. Then we need to determine two disjoint regions \(R_1\) and \(R_0\) such that \(R_1\cup R_0=\mathbb {R}\) in order to classify an individual with a value X in the first (second) group when \(X\in R_1\) (\(X\in R_0\)). Two typical (classical) solutions are the maximum likelihood criterion which defines \(R_1\) as
and the maximum posterior probability criterion with
where \(p=\Pr (G=1)\). We want to provide an alternative option based on entropy.
The ideal case is when the respective supports of the groups \(S_1=\{x: f_1(x)>0\}\) and \(S_0=\{x: f_0(x)>0\}\) are disjoint sets. In that case, the entropy can be written from (2) as
where \(p=\Pr (G=1)=\Pr (X\in S_1)\), \(1-p=\Pr (G=0)=\Pr (X\in S_0)\),
and \(H_2(G)=-p\log p-(1-p)\log (1-p).\) In this case \(RED(2)=1\).
This case is unrealistic since usually the populations have values in common regions. So we might try to determine the region \(R_1\) that minimizes
where \(p_1=\Pr (X\in R_1)\) and \(p_0=1-p_1= \Pr (X\in R_0)\).
In the ideal case with \(S_1\cap S_0=\emptyset \), the optimal region is \(R^{opt}_1=S_1\) and we have
Clearly, for \(R_1=\mathbb {R}\) we get \(p_1=1\) and \(H(R_1)\) coincides with the entropy of group 1. For \(R_1=\emptyset \), \(p_1=0\) and we get the entropy of group 0. Hence, \(H(R^{opt}_1)\le H(X|G=i)\) for \(i=0,1\). So, from (4) and for the optimal region \(R^{opt}_1\) we get
Hence \(H(X)\ge H^{(2)}(X) \ge H(R^{opt}_1).\) Thus, if we define the effectiveness of \(R_1\) as
we get \(Eff(R_1^{opt})\ge Eff(R_1)\) and \( Eff(R_1^{opt})\ge Eff^{(2)}(X)\ge 0.\)
We must say that it is not easy to solve the theoretical problem that leads to the optimal region \(R_1^{opt}\). In the univariate case, if the mean of the first group is bigger than the one of the second, we might assume \(R_1=[t,\infty )\) and then \(H(R_1)\) is just a function of t that could be plotted (numerically) in order to find its minimum value.
In order to simplify the calculations in practice, if we have two IID samples \(X_1,\dots ,X_n\) and \(Y_1,\dots ,Y_m\) from \(f_1\) and \(f_0\), respectively, the entropies in the groups can be approximated by
and
when the PDF \(f_1\) and \(f_0\) are known. If they are unknown, they will be replaced by parametric or non-parametric estimations. Therefore, \(H^{(2)}(X)\) can be estimated with
when \(f_1,f_0\) and p are known.
The entropy determined by the region \(R_1=[t,\infty )\) can be written as,
where \(p_1=\Pr (X\in R_1)=\Pr (X>t)\). Hence, it can be approximated with
where \(\widehat{p}_1=\left( \sum _{i=1}^n 1(X_i>t) +\sum _{i=1}^m 1(Y_i>t) \right) /(n+m)\) and \(\widehat{p}_0=1-\widehat{p}_1\).
These sample entropy measures can be used to define a new classification criterion. Thus, if we have a new individual with value \(Z=t\), we can compute this entropy by considering that Z belongs to the group 1
or to group 0,
It should be classified into the group with the minimum entropy. If \(p=0.5\), \(n=m\) and we replace \(n+1=m+1\) with n, then this criterion is equivalent to the maximum likelihood criterion. For an arbitrary probability p, if we are still replacing \(n+1\) and \(m+1\) with n and m, respectively, \(\widehat{H}_1(t)\le \widehat{H}_0(t)\) holds if and only if
This is also a reasonable criterion to determine \(R_1\) similar to that based on the posterior probabilities. As mentioned above, in practice, the unknown PDF \(f_1\) and \(f_0\) should be replaced with parametric or non-parametric estimations. If p is unknown and it is estimated with \(\widehat{p}=n/(n+m)\), then this criterion is again equivalent to the maximum likelihood criterion. The same can be done in the k-dimensional case or when we have more than two groups. Let us see some examples.
4 Examples
In the first example we consider a population having a mixture of two (univariate) exponential distributions. The purpose is to show the accuracy of the empirical measures.
Example 1
Let us assume that the two groups have exponential distribution functions \(F_i(t)=1-\exp (-t/\mu _i)\) for \(t\ge 0\) and \(i=0,1\) with means \(\mu _1=1\) and \(\mu _0=0.5\). The entropy of the exponential model is
As expected, it is increasing with \(\mu \) since its variance is \(\mu ^2\). Hence, the entropy of the groups are \(H(X|G=1)=H(1)=1\) and \(H(X|G=0)=H(0.5)=1-\log 2=0.3068528\). The values of the first group are more dispersed (i.e. X has a bigger uncertainty in that group).
Let us consider a fifty-fifty mixture of these two groups, that is,
for \(t\ge 0\). A straightforward calculation shows that its entropy is \(H(X)=0.7072083\). This value is between the values of the entropies in the two groups. This is due to the facts that the first group has a big uncertainty (comparing with the other) and that the two groups share similar values (the supports are not disjoint sets). So, when we mix them, the uncertainty decreases. The entropy with two groups defined by (4) is then
In this case, the division in two groups is effective \(H(X)>H^{(2)}(X)\) and the RED index is
Its closeness to 0 confirms that the two groups are really mixed (as mentioned above).
Now we simulate two samples \(X_1,\dots ,X_n\) and \(Y_1,\dots ,Y_n\) from these distributions with \(n=m=50\) IID data in each group. The approximations of the entropies obtained with these samples and the exact PDF are
and
The entropy in the mixed population can be estimated in a similar way with
The entropy with two groups \(H^{(2)}(X)\) can be approximated (by assuming \(p=0.5\)) with
The RED index is then approximated as
These approximations can be improved by increasing n and m. Note that p can also be estimated from the sample sizes (if we have a sample from the mixed population).
If we assume that the means of the exponential distributions are unknown and we estimate them with the sample means \(\bar{X}=1.130371\) and \(\bar{Y}=0.4805333\), we get the approximations \(H(X|G=1)=1\approx 1.122546\) and \(H(X|G=0)=0.3068528\approx 0.2671412\). The values H(X) and \(H^{(2)}(X)\) can be approximated in a similar way obtaining 0.7822224 and 0.6948435, respectively. Then the approximation of the RED index is 0.1260611. If we do not know that they come from exponential models, we can use empirical kernel estimators for \(f_1\) and \(f_0\) based on the respective samples.
Let us determine now the optimal regions to separate these two groups. In this example we can assume \(R_1=[t,\infty )\) since \(\mu _1=1>\mu _0=0.5\). The value of t for the optimal region under the maximum likelihood (or the maximum posterior probability) criterion is obtained by solving \(f_1(t)=f_0(t)\) for \(t>0\). This equation leads to the value \(t_{ML}=\log 2=0.6931472\). The exact misclassification probabilities are
and the total misclassification probability with (known) prior probabilities \(p=0.5\) and \(1-p=0.5\) is
If we want to use the criterion based on the entropy given in (10) for \(R_1=[t,\infty )\), we get
where
and \(p_0=1-p_1= \Pr (X\in R_0)=\Pr (X<t)\). A direct calculation leads to
The plot can be seen in Fig. 1 (left, red). As stated above \(H(\infty )=0.3068528=1-\log (2)=H(X|G=0)\) and \(H(0)=1=H(X|G=1)\). The optimal value with the minimum entropy criterion is \(t_{ME}=1.115213\) getting \(H(t_{ME})=0.1776504\). With this value, the exact misclassification probabilities are
and 0.3898186. The total misclassification probability is greater than that obtained with the ML criterion. By using the approximated version of this criterion given in (12), we obtain \(\hat{t}_{ME}=0.86534\). The plot can be seen in Fig. 1 (left, black). With this value, the exact misclassification probabilities are
and 0.37813 which is again a little bit greater than the error obtained with the ML criterion.
If we use the criterion based on the empirical entropy (with known means) for the samples obtained above by replacing \(n+1\) with n, we get the functions \(\widehat{H}_1\) and \(\widehat{H}_0\) plotted in Fig. 1, right. In this case the values that lead to a classification in group 1 belong to \(R_1=[0.6931472,\infty )\). It coincides with the region determined by the maximum likelihood criterion since \(p=0.5=n/(n+m)\).
In the second example we consider a mixture of two univariate normal (Gaussian) distributions. In this case, we replace the exact calculations with approximations.
Example 2
In the first case we consider a population obtained by mixing two normal distributions with means \(\mu _1=2\) and \(\mu _0=-2\) and a common variance \(\sigma ^2=1\). To approximate the entropy functions we simulate two samples \(X_1,\dots ,X_n\) and \(Y_1,\dots ,Y_m\) from these distributions with \(n=m=50\) IID data in each group. The approximations of the entropies obtained with these samples and the exact PDF are
and
The (common) exact value is \(0.5+0.5\log (2\pi )=1.418939\). The entropy in the mixed population can be estimated in a similar way with
where \(f=0.5f_1+0.5f_0\). In this case, the mixed population has more uncertainty than the subpopulations, that is, \(H(X)>H(X|G=i)\) for \(i=0,1\). So we can consider the entropy with two groups approximated from (11) as
Therefore, the division is effective \(H(X)>H^{(2)}(X)\) and the approximated RED index is
This value close to 1 indicates that the groups are well separated (as expected).
As \(\mu _1>\mu _0\) we can consider again the region \(R_1=[t,\infty )\) for the classification in the first group. Clearly, by using the maximum likelihood criterion, we get the optimal region \(R_1=[0,\infty )\) (see Fig. 2, left). To apply the minimum entropy criterion, we consider the function H(t) approximated with (12), obtaining the plot given in Fig. 2, right. The minimum of this function is \(\hat{t}=-0.01138\), a value close to the expected one (\(t=0\)).
However, if we use the criterion based on the empirical entropies \(\widehat{H}_1\) and \(\widehat{H}_0\) with known means and variances, we get \(t=-0.00205\), which is very close to the value obtained with the maximum likelihood criterion. Their plots can be seen in Fig. 3, left. If we replace \(n+1\) with n we get \(t=0\). If the exact means and variances are replaced by their estimations from the samples we get \(t=0.06873\). We omit the plot since it is very similar to the one in Fig. 3, left.
In the second case, we consider \(\mu _1=\mu _0=0\) and \(\sigma _1^2=1<\sigma _0^2=4\). Then we can use the region \(R_1=[-t,t]\) for the classification in the first group. The estimations of the entropies in the groups and in the mixed populations are \(H(X|G=1)\approx 1.339576\), \(H(X|G=0)\approx 2.024522\) and \(H(X)\approx 1.762427\). As in the first example, the entropy (uncertainty) in the second group is bigger than that in the mixed population (since the first population reduces uncertainty). The approximation for the entropy with two groups is \(H^{(2)}(X)\approx 1.682049\). It reduces a little bit the global entropy H(X) in the mixed population and the RED index is 0.1159609. By using \(\widehat{H}(t)\) we get the region \(R_1=[-1.02890,1.02890]\), see Fig. 3, right.
If we estimate \(H_1(t)\) and \(H_0(t)\), we obtain the plots given in Fig. 4 by using n (left) or \(n+1\) (right). Note that in this case the results are very different. In the first case the optimal region is \(R_1=[-1.35956,1.35956]\) (that coincides with the region of the maximum likelihood criterion) while in the second \(R_1=[-0.14788,0.14788]\). The total misclassification probabilities are 0.3386627 and 0.4706897, respectively. The first value is actually the minimum error.
In the next example we show how to work with a real data set with four numerical variables and three groups.
Example 3
Let us consider the iris data set available in the statistical program R. It contains the values in four variables (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) measured in 150 iris flowers from three different species: setosa (\(G=1\)), versicolor (\(G=2\)), and virginica (\(G=3\)). There are 50 data from each specie with \(\textbf{X}_1,\dots ,\textbf{X}_{50}\in G_1\), \(\textbf{X}_{51},\dots ,\textbf{X}_{100}\in G_2\), and \(\textbf{X}_{101},\dots ,\textbf{X}_{150}\in G_3\). For another analysis of this data set using Deng extropy see Buono and Longobardi (2020).
Let us assume a Normal (Gaussian) distribution for these data in each group. As we know that there are three groups, we proceed as follows:
-
We estimate the means and variance-covariance matrices in each group (by using only the data in each group).
-
We use them to estimate the PDF \(f_i\) in each group by using normal PDF with these parameter values. The estimations of the respective PDF are represented by \(\widehat{f}_i\) for \(i=1,2,3\).
-
We approximate the entropies in the groups from (7) and in analogy with (1) for the multivariate case.
By using this procedure we obtain the following entropy values:
and
These entropies show that the values of the flowers from the third group are more dispersed.
Analogously, we can estimate the PDF of the mixed population with \(\widehat{f}=(\widehat{f}_1+\widehat{f}_2+\widehat{f}_3)/3.\) We use this function to estimate the entropy of all the data (mixed population) with
Note that \(\widehat{H}(\textbf{X})>\widehat{H}(\textbf{X}|G=i)\) for \(i=1,2,3\) (although it is closed to \(\widehat{H}(\textbf{X}|G=3)\)). This may confirm the existence of the three groups.
We can also compare this entropy with the entropy without groups estimated as
where \(f_{wg}\) is the normal PDF with the mean and the variance-covariance matrix estimated with all the data together. As \(\widehat{H}_{wg}(\textbf{X})>> \widehat{H}(\textbf{X})\), this fact confirms the existence of thethree groups.
Next we compare it with the entropy with three groups defined as in (4) with
where \(p_i=\Pr (G=i)\) for \(i=1,2,3\) are the prior probabilities. By assuming \(p_i=1/3\) for \(i=1,2,3\), we estimate it as
As \(\widehat{H}^{(3)}(\textbf{X})<\widehat{H}(\textbf{X})\), this fact might also confirm that the uncertainty is reduced by considering three groups. Hence
and \(RED(3)=0.9669213\). This value confirms that the three groups can be separated.
We might wonder what happen if we just consider two groups. Note that in this case the estimation for \(\widehat{H}(\textbf{X})\) also changes (since we estimate f in different ways). The most efficient option is to join the groups two and three. The entropy of the new group is then 1.638049, obtaining \(\widehat{H}^{(2)}(X)=0.7927237\), \(\widehat{H}(X)=1.429234\) and
Hence \(RED(2)=0.9182897<RED(3)=0.9669213\). With the other groups we get \(Eff^{(2)}(X)\approx 0.5660699\) (join groups one and two) or 0.5355315 (join groups one and three). Therefore it is not a good idea to join these groups and it is better to consider the three initial groups.
We could also study what happens by considering just the two first groups (which are the least dispersed) by including the data of group 3 in groups 1 or 2. By applying the maximum likelihood criteria to do so, all the data from group 3 go to group 2 and so the result is the same as that stated above with \(Eff^{(2)}(X)\approx 0.6365099\).
If we just consider the first 100 data, that belong to groups 1 and 2, then we get \(\widehat{H}(X)=0.34348\), \(\widehat{H}^{(2)}(X)=-0.3496672\) and \(RED(2)=1\). Therefore, these two groups are completely separated. This is not the case if we just consider the data from groups 2 and 3. In this case we get \(\widehat{H}(X)=0.8302966\), \(\widehat{H}^{(2)}(X)=0.6854083\) and \(RED(2)=0.2090297\). Therefore, these two groups are mixed.
The PCA plot with the two first principal components for these three groups can be seen in Fig. 5. Note that our conclusions based on the RED index are consistent with the different groups in that figure.
If we want to use these entropy functions to classify a new flower with measures \(\textbf{z}=(z_1,z_2,z_3,z_4)\), we just compute the entropy \(\widehat{H}^{(3)}\) by assuming that \(\textbf{z}\) belongs to each group. Then it is classified in the group with the minimum entropy. However, it is not easy to determine the classification regions in \(\mathbb {R}^4\) obtained with this criterion for each group.
For example, for the first flower in this data set with \(\textbf{z}=(5.1,3.5,1.4,0.2)\) and replacing \(n+1\) with n, we get the approximations
and
Hence with the minimum entropy criterion it is classified (correctly) in the first group. As the sample sizes of the groups in the training sample coincide and the prior probabilities are equal, this classification criterion is equivalent to the maximum likelihood criterion (under normality) and to the classical Quadratic Discriminant Analysis (QDA) since we have used the normal PDF. This is not the case if the prior probabilities are unequal. It is also different if the PDF of the groups are estimated with nonparametric techniques.
If we do not replace n with \(n+1\), we get the estimations
and
where \(\textbf{z}\) is not used to compute \(\widehat{f}_i\) (i.e. to compute the mean and the covariance matrix of group i). Again, it is classified correctly in the first group. These entropy values play a role similar to the role played by the posterior probabilities in the classical QDA showing the “reliability” (margins) of these classifications. Note that by adding just one data in a wrong group might increase the entropy considerably. We do the same with all the 150 flowers of the data set. If we replace \(n+1\) with n, the classification is correct in 147 cases. In particular, it fails for two flowers in the second group (classified in the third group) and one in the third one (classified in the second group). If we do not replace \(n+1\) with n, 147 flowers are classified correctly and the three failures occur for flowers in the second group that are classified in the third one. In this case, the group with the biggest entropy may attract more data (since their values are more dispersed).
In the last example, as in Biernacki et al. (1999), we consider bivariate Gaussian distributions to study the evolution of the RED index when we change the means.
Example 4
Let us consider a mixture model with equal proportions of two Gaussian distributions. We simulate a sample of size 100 from a bivariate normal distribution with mean \(\mu _1=(0,0)\) and variance-covariance matrix \(\Sigma _1=I_2\), that is \(\textbf{X}_1,\dots ,\textbf{X}_{100}\in G_1\), and samples of size 100 from bivariate normal distributions with means \(\mu _2=(d,0)\) and variance-covariance matrix \(\Sigma _2=I_2\), by varying d from 0 to 5 in steps of 0.1, i.e., \(\textbf{X}_{101},\dots ,\textbf{X}_{200}\in G_2\). We use the data in each group to estimate the means and the variance-covariance matrices and then to obtain the estimated PDF \(\widehat{f}_1\) and \(\widehat{f}_2\) by using normal distributions with these parameters. Then, we estimate the PDF \(\widehat{f}\) of the mixed population by the arithmetic mean of the estimated PDF (since we are assuming a mixture model with equal proportions). Thus, we can estimate the entropies in the groups by
and the entropy of the mixed population with
Then, we estimate \(H^{(2)}(\textbf{X})\) by
and the relative efficiency of the division in two groups as
The results are shown in Fig. 6, left, as a function of d (black points). Moreover, we can estimate the mean and the variance-covariance matrix without assuming the existence of groups and then obtain an estimate of the entropy without groups as
Hence, we compare the values of \(\widehat{H}_{wg}(\textbf{X})\) and \(\widehat{H}(\textbf{X})\) and obtain that the former is lower than the latter only for d equal to \(0.1, \ 0.2,\ 0.4, \ 0.6\) and 1.9 confirming the existence of two groups with the increase of d. Further, we may suppose the existence of a third group and divide the data of the second group in two groups of 50 data, that is \(\textbf{X}_{101},\dots ,\textbf{X}_{150}\in G_2\) and \(\textbf{X}_{151},\dots ,\textbf{X}_{200}\in G_3\). In analogy with what we have done above, we estimate the mean and variance-covariance matrices of the new groups and then the entropies of the groups. In the mixture, the second and the third group have a weight of 0.25, so the estimate of \(H^{(3)}(\textbf{X})\) is given by
and the relative efficiency of the division in three groups is
In Fig. 6, left, we also plot the values of RED(3) (red points) as a function of d and we can compare them with the values of RED(2) (black points). We note that, as expected, the values of RED(3) are dominated by the values of RED(2) and the former is slightly higher than the latter only for small choices of d (\(0, \ 0.1, \ 0.2, \ 0.3, \ 0.4 \) and 0.5). In Fig. 6, right, we plot the samples for \(d=3\). Note that the value \(RED(2)=0.7347749\) for \(d=3\) allows us to detect the existence of the two groups even when they are really close.
We repeat the same experiment by choosing \(\mu _2=(d,d)\), varying d from 0 to 5 in steps of 0.1. In this case, the value of the entropy without groups is lower than the value of the entropy with two groups for \(d\in \{0, 0.1, 0.2, 0.4, 0.5, 0.6, 0.7\}\). Moreover, we again consider the possibility of dividing the second group in two groups and we obtain a value of RED(3) higher than RED(2) only with \(d=0.1\). The results are shown in Fig. 7, left, where we also plot the samples (right) for \(d=2\). Again the value \(RED(2)=0.7930995\) for \(d=2\) shows the existence of the two groups even when they are really close.
By comparing the values of RED(2) in Fig. 7, left, and in Fig. 6, left, it is possible to observe a faster tendency to one in the case in which \(\mu _2=(d,d)\) due to the higher distance between the means of the mixed populations.
5 Application to Variable Selection in Omic Data
In this section we study the performance of the proposed RED index when applied to variable selection in biological omic data. One of the main characteristics of omic data sources is concerned with the high dimensionality of the data sets due to the development of high-throughput technologies that allow the simultaneous monitoring of hundreds or thousands of biological variables from different layers of biological information such as genes, proteins, RNA and metabolites. Actually, the data sources generated by these technologies have given rise to the so-called omic data sources as well as the need of ad hoc exploratory data analysis tools for analyzing such high-dimensional data. One of the challenges settled by biologists and geneticists is concerned with the identification of the most informative omic variables for explaining a specific clinical outcome such as disease or the evolution of a disease in the response of patients to a specific drug. Hence, the challenge is to carry out variable selection for identifying those variables that discriminate the outcome and, as a result, eliminate the noisy inputs. In this section we show how the RED index can be used as a tool for variable selection when applied to a well-known microarray gene expression colon cancer data set.
The genomic study consists of gene expression levels for 40 tumor and 22 normal tissue samples collected by the Affymetrix oligonucleotide Hum-6000 array complementary to more than 6500 human genes from which only 2000 genes with the highest minimal intensity across samples are retained, see Alon et al. (1999). Hence, we end up with data set containing the expression levels for 2000 genes arranged in a matrix with 2000 columns and 62 rows, along with a clinical outcome related to the status of each tissue sample: tumor versus healthy. This gene expression data set is a classic in the literature and can be downloaded from the R package colonCA, see Sylvia (2019).
Some data preprocessing about robust normalization of gene expression measures following previous work by Arevalillo and Navarro (2013) is carried out. Then the RED index is estimated for the two group case (tumor and healthy outcomes) in order to generate a ranking that helps to sort the genes in accordance to their relevance for discriminating the clinical outcome. The results are provided by the gene ranking appearing in Fig. 8 which displays the whole ranking (left) and the top 13 genes with \(RED>0.5\) (right).
The genes at the top of the ranking have been previously described as relevant biomarkers of colon cancer. Table 1 shows the Hsa identifiers and the gene descriptions of the top genes having RED greater than 0.5.
The genes with identifiers Hsa.8147, Hsa.692, Hsa. 692.1 and Hsa.692.2 exhibit a high degree of co-expression as measured by correlation coefficients around 0.9. We now assess the RED score that results by considering pairs of genes in order to elucidate whether their joint behavior has a stronger impact than their marginal behavior at discriminating the clinical outcome. As highly correlated genes convey redundant expression measures, we only consider gene pairings having correlations below the 0.90 threshold for estimating their RED scores; this is achieved by selecting pairwise gene associations corresponding to the top RED parings having correlations lower than the 0.90 threshold.
The scatter plots depicted by Fig. 9 show the selected gene pairings; in all the cases the RED score is higher than the individual RED values previously obtained for the genes Hsa.36689, Hsa.692.1, Hsa.8147 and Hsa.2456 given by 0.605, 0.737, 0.739 and 0.513 respectively. Note that with just two genes, Hsa.36689 and Hsa.692.1, we get a RED index equal to 0.974 and a very good separation of these two groups. The other pairs also show high RED indices which informed about the strong bivariate differential expression patterns depicted by the scatter plots of Fig. 9.
6 Conclusions
We have provided new tools based on Shannon entropy to study data from a population with groups. This paper is just a first step and the potential applications are countless. The main one is the RED index. The illustrative examples show that this is a good tool to measure the separation of the groups. The main advantage is that it does not depend on the number of unknown parameters in the model. The new classification techniques also lead to promising results (similar to the ones obtained with classical discrimination measures).
There are several tasks for future research. Maybe, the main one could be to apply these tools to cluster analysis (unsupervised techniques) in order to determine the optimal number of clusters in a population. Applications to specific data sets in different research areas are obvious chores to be done.
Data Availability
Not applicable.
Code Availability
Not applicable.
References
Ahmadi J, Di Crescenzo A, Longobardi M (2015) On dynamic mutual information for bivariate lifetimes. Adv Appl Probab 47:1157–1174
Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. PNAS 96:6745–6750. https://doi.org/10.1073/pnas.96.12.6745
Arevalillo JM, Navarro H (2013) Exploring correlations in gene expression microarray data for maximum predictive-minimum redundancy biomarker selection and classification. Comput Biol Med 43:1437–1443
Balakrishnan N, Buono F, Longobardi M (2022) On cumulative entropies in terms of moments of order statistics. Methodol Comput Appl Probab 24:345–359
Biernacki C, Celeux G, Govaert G (1999) An improvement of the NEC criterion for assessing the number of clusters in a mixture model. Pattern Recogn Lett 20:267–272
Briët J, Harremoës P (2009) Properties of classical and quantum Jensen? Shannon divergence. Phys Rev A 79:283–304
Buono F, Longobardi M (2020) A dual measure of uncertainty: the Deng extropy. Entropy 22:582. https://doi.org/10.3390/e22050582
Celeux G, Soromenho G (1996) An entropy criterion for assessing the number of clusters in a mixture model. J Classif 13:195–212
Cover TM, Thomas JA (2006) Elements of Information Theory, 2nd edn. Wiley, Hoboken, NJ, USA
Di Crescenzo A, Longobardi M (2002) Entropy-based measure of uncertainty in past lifetime distributions. J Appl Probab 39:434–440
Di Crescenzo A, Longobardi M (2006) On weighted residual and past entropies. Sci Math Jpn 64(2):255–266
Di Crescenzo A, Paolillo L, Suárez-Llorens A (2021) Stochastic comparisons, differential entropy and varentropy for distributions induced by probability density functions. https://doi.org/10.48550/arXiv.2103.1108
Grandvalet Y, Bengio Y (2005) Semi-supervised learning by entropy minimization. Proc Adv Neural Inf Process Syst 529–536
Melbourne J, Talukdar S, Bhaban S, Madiman M, Salapaka MV (2022) The differential entropy of mixtures: New bounds and applications. IEEE Trans Inf Theory 68:2123–2146
Moshksar K, Khandani AK (2016) Arbitrarily tight bounds on differential entropy of Gaussian mixtures. IEEE Trans Inf Theory 62:3340–3354
Rao M, Chen Y, Vemuri B, Wang F (2004) Cumulative residual entropy: a new measure of information. IEEE Trans Inf Theory 50:1220–1228
Rényi A (1961) On measures of information and entropy. In: Proceedings of the Fourth Berkeley Symposium on Mathematics, Statistics and Probability pp 547–561
Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:279–423
Sylvia M (2019) colonCA: exprSet for Alon et al. (1999) colon cancer data. R package version 1.28.0
Tsallis C (1988) Possible generalization of Boltzmann-Gibbs statistic. J Stat Phys 52:479–487
Acknowledgements
The authors thank an anonymous reviewer for several helpful suggestions.
Funding
Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. JN is partially supported by “Ministerio de Ciencia e Innovación” of Spain under grant PID2019-103971GB-I00/AEI/10.13039/501100011033. FB is member of the research group GNAMPA of INdAM (Istituto Nazionale di Alta Matematica) and is partially supported by MIUR-PRIN 2017, project “Stochastic Models for Complex Systems”, no. 2017 JFFHSH. JMA acknowledges NextGenerationEU support.
Author information
Authors and Affiliations
Contributions
The authors contributed equally to this work.
Corresponding author
Ethics declarations
Ethics Approval
Not applicable.
Consent to Participate
Not applicable.
Consent for Publication
Not applicable.
Conflict of Interest
The authors declare no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix. Proofs
Appendix. Proofs
Proof of Proposition 1
If \(S_1\) and \(S_0\) are the respective supports of \(f_1\) and \(f_0\), the entropy of the two groups can be written as
Therefore
Hence \(Eff^{(2)}(X)\ge 0\) since the KL-measure is non-negative.
To get the upper bound we note that
where the inequality holds since \(0\le pf_1(x)\le f(x)\). Analogously, it can be proved that \(KL (f_0|f)\le -\log (1-p)\). Hence, from (5), we get
Moreover, if the \(f_1\ne f_0\) (a.e), then \(f_i\ne f\) (a.e) and \(KL(f_i|f)>0\) for \(i=1,2\). Then \(Eff^{(2)}(X)>0\) for all \(p\in (0,1)\).
Proof of Proposition 2
From the definition we have
On the other hand, from Proposition 1, we get
Replacing \(H(X|G=0)\) with this expression we get \(H^{(2)}(X)\ge H^{(3)}(X)\). Hence, the result for the efficiency also holds. The bounds are obtained as in Proposition 1.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Navarro, J., Buono, F. & Arevalillo, J.M. A New Separation Index and Classification Techniques Based on Shannon Entropy. Methodol Comput Appl Probab 25, 78 (2023). https://doi.org/10.1007/s11009-023-10055-w
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11009-023-10055-w