Abstract
Asymptotic unbiasedness and L2-consistency are established for various statistical estimates of mutual information in the mixed models framework. Such models are important, e.g., for analysis of medical and biological data. The study of the conditional Shannon entropy as well as new results devoted to statistical estimation of the differential Shannon entropy are employed essentially. Theoretical results are completed by computer simulations for logistic regression model with different parameters. The numerical experiments demonstrate that new statistics, proposed by the authors, have certain advantages.
Similar content being viewed by others
References
Balagani K, Phoha V (2010) On the feature selection criterion based on an approximation of multidimensional mutual information. IEEE Trans Pattern Anal Mach Intell 32(7):1342–1343
Bennasar M, Hicks Y, Setchi R (2014) Feature selection using joint mutual information maximisation. Expert Syst Appl 42:8520–8532
Berrett TB, Samworth RJ, Yuan M (2016) Efficient multivariate entropy estimation via k-nearest neighbour distances. arXiv:160600304
Biau G, Devroy L (2015) Lectures on the nearest neighbor method. Springer-Verlag, New York
Bulinski A, Dimitrov D (2019a) Statistical estimation of the Kullback - Leibler divergence. arXiv:190700196
Bulinski A, Dimitrov D (2019b) Statistical estimation of the Shannon entropy. Acta Math Sinica English Series 35:17–46
Bulinski A, Kozhevin A (2017) Modification of the MDR-EFE method for stratified samples. Stat, Optim Inf Comput 5:1–18
Bulinski A, Kozhevin A (2018) Statistical estimation of conditional Shannon entropy. ESAIM: Probability and Statistics pp 1–35 , published online: November 28. https://doi.org/10.1051.ps.2018026
Coelho F, Braga A, Verleysen M (2016) A mutual information estimator for continuous and discrete variables applied to feature selection and classification problems. Int J Computat Int Syst 9(4):726–733
Delattre S, Fournier N (2017) On the Kozachenko-Leonenko entropy estimator. Journal of Statistical Planning and Inference 185
Doquire G, Verleysen M (2012) A comparison of mutual information estimators for feature selection. In: Proc. of the 1st international conference on pattern recognition applications and methods, pp 176–185. https://doi.org/10.5220/0003726101760185
Favatti P, Lotti G, Romani F (1991) Algorithm 691: Improving quadpack automatic integration routines. ACM Trans Math Softw 17(2):218–232
Fleuret F (2004) Fast binary feature selection with conditional mutual information. J Mach Learn Res 4:1531–1555
Gao W, Kannan S, Oh S, Viswanath P (2017) Estimating mutual information for discrete-continuous mixtures. In: 31st conference on neural information processing systems (NIPS), Long Beach, CA, USA, pp 1–12
Hoeffding W (1963) Probability inequalities for sums of bounded random variables. J Amer Statist Assoc 58(301):13–30
Kozachenko L, Leonenko N (1987) Sample estimate of the entropy of a random vector. Probl Inf Transm 23:95–101
Kraskov A, Stögbauer H, Grassberger P (2004) Estimating mutual information. Phys Rev E 69:066,138
Macedo F, Oliveira R, Pacheco A, Valadas R (2019) Theoretical foundations of forward feature selection methods based on mutual information. Neurocomputing 325:67–89
Massaron L, Boschetti A (2016) Regression analysis with python. Packt Publishing Ltd., Birmingham
Nair C, Prabhakar B, Shah D (2007) Entropy for mixtures of discrete and continuous variables. arXiv:0607075v2
Novovičová J, Somol P, Haind M, Pudin P (2007) Conditional mutual information based feature selection for classification task. In: Ruedz M D L, Kittler J (eds) CIARP 2007, LNCS, vol 4756. Springer-Verlag, Berlin, pp 417–426
Paninski L (2003) Estimation of entropy and mutual information. Neural Comput. 15:1191–1253
Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of maxdependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27:1226–1238
Vergara J, Estévez P (2014) A review of feature selection methods based on mutual information. Neural Comput Appl 24:175–186. https://doi.org/10.1007/s00521-013-1368-0
Verleysen M, Rossi F, Franëcois D (2009) Advances in feature selection with mutual information. arXiv:09090635v1
Yeh J (2014) Real analysis: Theory of measure and integration, 3rd edn. World Scientific, Singapore
Acknowledgements
The work of A.Bulinski is supported by the Russian Science Foundation under grant 19-11-00290 and performed at the Steklov Mathematical Institute of Russian Academy of Sciences. Theoretical results are established in the joint work of A.Bulinski and A.Kozhevin, simulations are realized by A.Kozhevin. The authors are grateful to the Reviewer for his remarks concerning the book by G.Biau and L.Devroy.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Proof
of Lemma 1 Formula Eq. 21 becomes evident if we consider the Bernoulli scheme with a success probability P(Y = y). One has {Ny, n = 0} = {Yq≠y, q = 1,…,n}, and
for k = 1,…,n, where ∪I is taken over I = {i1,…,ik} such that 1 ≤ i1 < … < ik ≤ n.
Thus, for \(m\in \mathbb {N}\), m > k, and any \(B_{i} \in {\mathscr{B}}(\mathbb {R}^{d})\), i = 1,…,m,
where, for p ≥ n + m − k, \({\sum }_{J}\) denotes the sum over all J having the form
For k = 0, the sum \({\sum }_{J}\) is taken over i1,…,im such that n < i1 < … < im. If J = {i1,…,im}⊂{1,…,p}, where im = p, then \( \{{j_{1}^{y}} = i_{1}, \ldots , {j_{m}^{y}} = p\} = \{Y_{i} = y, i \!\in \! J\}\) ∩{Yq≠y, q ∈{1,…,p}∖ J}. For p, n, m and k under consideration the cardinality of all collections J(p, n, m, k) is equal to \(\binom {n}{k}\binom {p-n-1}{m-k-1}\). Consequently,
where we took into account that (X1, Y1), (X2, Y2),… are i.i.d. random vectors having the same distribution as (X, Y ). Set l = p − (n + m − k). Then
since \({\sum }_{l=0}^{\infty }\binom {l+m-k-1}{l}{\textsf {P}}(Y \ne y)^{l}{\textsf {P}}(Y=y)^{m-k}=1\) for negative binomial distribution. Thus
The latter relation yields all the Lemma statements when m > k.
Let now 1 ≤ m ≤ k, where k ∈{0,…,n}. The only possible case is k > 0, and hence
This equality implies the validity of all the Lemma assertions in the considered case. □
Proof
of Lemma 3 We can take c = 1 without loss of generality. Indeed, I(X, Y ) = I(X, aY ) for any a≠ 0 since, for each one-to-one mapping ϕ: M → T where ♯(M) = ♯(T) = m, one has I(X; ϕ(Y )) = I(X, Y ). Condition (B) is valid in view of Corollary 1 by (Bulinski and Kozhevin 2018). Consider (C). One has PX ≪ μ and, consequently, for any \(x\in \mathbb {R}^{d}\) and y ∈ M,
Moreover,
Here we employed that 0 ≤ fY |X(y|x) ≤ 1 for any y ∈ M and μ-almost all \(x \in \mathbb {R}^{d}\) and 0 ≤ fY(y) = P(Y = y) ≤ 1. Thus, (C) is valid.
Corollary 2.7 in Bulinski and Dimitrov (2019b) implies that, for each ν > 2, ε > 0, one has \(L_{f}(\nu )<\infty \) and \(G_{f}(\varepsilon )<\infty \). It remains to show that \(T_{g_{y}}(\varepsilon ) < \infty \) for y ∈ M.
According to formula (4.17) of Bulinski and Dimitrov (2019b)
where λmin > 0 is the minimal eigenvalue of Σ. Immediately one obtains
At first we find the lower bound for \(\frac {1 + e^{-(w,x)-b}}{1 + e^{-(w,z)-b}}\) when ∥x − y∥≤ r. Set u = −(w, x) − b. Then
In a similar way \(\frac {1 + e^{(w,x)+b}}{1 + e^{(w,z)+b}} \geq e^{-\|w\|r}\) whenever ∥x − z∥≤ r. Hence, for y ∈ M, R > 0, r ∈ (0,R) and \(x,z\in \mathbb {R}^{d}\) such that ∥x − z∥≤ r, one has
Consequently, for \(C = e^{- \|w\| R -\frac {R^{2}}{2 \lambda _{min} }}\), y ∈ M, \(x \in \mathbb {R}^{d}\) and 0 < r < R,
For each ε ∈ (0, 1) and y ∈ M,
hence, \( T_{g_{y}}(\varepsilon ,R) = {\int \limits }_{\mathbb {R}^{d}} m_{f_{X, Y}}^{-\varepsilon }(x, R) f_{X, Y}(x, y) dx < \infty . \) Due to Bulinski and Dimitrov (2019b), Lemma 2.5, we can claim that \(T_{g_{y}}(\varepsilon ):=T_{g_{y}}(\varepsilon ,\varepsilon )<\infty \) for all ε small enough. The proof of Lemma is complete. □
Proof
of Lemma 4 Similarly to the Lemma 3 proof, we infer that (C) is valid. Now we turn to (A). Obviously, \(f(x) \leq \frac {1}{(\pi \gamma )^{d}}\) for \(x \in \mathbb {R}^{d}\), therefore \(Q_{f}(\varepsilon ) < \infty \) for ε ∈ (0, 1). Let us prove that \(L_{f}(\nu ) < \infty \) for some ν > 2. We employ linear change of variables in the integral representing Lf(ν) and note that \(\left |\log \| y - x \| -\log \gamma \right |^{\nu } \leq 2^{\nu -1}(\left |\log \| y - x \|\right |^{\nu } + |\log \gamma |^{\nu })\). Then we can claim that \(L_{f}(\nu )<\infty \) if
Set \(u = \frac {y - x}{\sqrt {2}}, v = \frac {x + y}{\sqrt {2}}\). Then, for u = (u1,…,ud) and v = (v1,…,vd), we can study the convergence of the following integral:
since, for \(z \in \mathbb {R}\), \( I(z) = {\int \limits }_{\mathbb {R}} \frac {1}{\left (1 + \frac {(z - s)^{2}}{2} \right )} \frac {1}{\left (1 + \frac {(z + s)^{2}}{2} \right )} ds = \frac {\sqrt {2} \pi }{z^{2} + 2}. \) Introduce \(v_{i}={u_{i}^{2}}\), i = 1,…,d. Thus, is is enough to show that \(J(d)={\int \limits }_{(0,\infty )^{d}}h_{d}(v)dv<\infty \), where
It is easily seen that \(J(1)<\infty \). Consider d ≥ 2. One has J(d) = J1(d) + J2(d), where J1(d) and J2(d) are integrals of hd(v) taken over \(B_{1}:=(0,\infty )^{d} \cap \{v_{1}+\ldots +v_{d} <1\}\) and \(B_{2}:=(0,\infty )^{d} \cap \{v_{1}+\ldots +v_{d} \geq 1\}\), respectively. We write
where we use mathematical induction in d. In a similar way one can verify that \(J_{2}(d)<\infty \). The proof of the finiteness of J(d), for each \(d\in \mathbb {N}\) and ν > 2, is complete, so \(L_{f}(\nu ) < \infty \) for any ν > 2. Now we show that, for y ∈ M, \(x \in \mathbb {R}^{d}\), R > 0 and 0 ≤ r ≤ R, inequality (2) is valid, where C > 0 does not depend on x, y and r.
Due to Eq. 44, for each y ∈ M and \(x,z\in \mathbb {R}^{d}\) such that ∥x − z∥≤ r (r ≥ 0),
For \(d\in \mathbb {N}\) and \(x,z\in \mathbb {R}^{d}\), set \( F_{d}(x,z):= {\prod }_{i=1}^{d}\frac {1 + {x_{i}^{2}}}{1 + {z_{i}^{2}}}, x=(x_,\ldots ,x_{d}), z=(z_{1},\ldots ,z_{d}). \) Now we will consider \(x,z\in \mathbb {R}^{d}\) such that ∥x − z∥≤ γr. Take R < 1/γ. One has
Evidently, \(\frac {1}{s+1}\geq 1-s\) for s > − 1. Thus we get \( F_{d}(x,z)\geq {\prod }_{i=1}^{d} (a_{i} - b_{i}(z_{i}-x_{i})), \) where \(a_{i} = 1-\frac {(\gamma r)^{2}}{1+{x_{i}^{2}}} \geq 1 - (\gamma r)^{2} \geq 1-(\gamma R)^{2}\), \(b_{i} = \frac {2 x_{i}}{1 + {x_{i}^{2}}}\), i = 1,…,d. Using induction one can prove that, for any r ≥ 0, \(A_{i},B_{i}\in \mathbb {R}\), i = 1,…,d,
Return to Eq. 3. If ∥x − z∥≤ r ≤ R, where 0 < R < 1/γ, then
Hence Eq. 2 is valid with \( C = \left (1-(\gamma R)^{2}\right )^{d} e^{-\|w\|R}\). Therefore, for each ε ∈ (0, 1) and y ∈ M,
and \( T_{g_{y}}(\varepsilon ,R) = {\int \limits }_{\mathbb {R}^{d}} m_{f_{X, Y}}^{-\varepsilon }(x, R) f_{X, Y}(x, y) dx < \infty \).
It remains to verify condition (B). The function \( g_{1}(x) = \frac {1}{1 + e^{-(w, x) - b}} f(x) \) is strictly positive. Moreover,
The function \(s(u) = \frac {u}{(1 + u^{2})^{2}}\) is continuous and \(\lim _{u \to \infty } s(u) = \lim _{u \to -\infty } s(u) = 0\), therefore \( \max \limits _{u \in \mathbb {R}} |(u - \nu )|\left (1 + ((u - \nu )/\gamma )^{2}\right )^{-1} < \infty . \) Thus \(\max \limits _{x \in \mathbb {R}^{d} }\|\nabla g_{1}(x)\| < \infty \) and g1(x) satisfies the Lipschitz condition with some constant C0 > 0 for each \(x \in \mathbb {R}^{d}\). According to Remark 1 in Bulinski and Kozhevin (2018) we conclude that g1 is C0-constricted. The same reasoning is valid for g0.
For i.i.d. random variables X1,…,Xd, the Minkowski inequality yields
Thus (B) is satisfied. The proof of Lemma is complete. □
Rights and permissions
About this article
Cite this article
Bulinski, A., Kozhevin, A. Statistical Estimation of Mutual Information for Mixed Model. Methodol Comput Appl Probab 23, 123–142 (2021). https://doi.org/10.1007/s11009-020-09802-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11009-020-09802-0