Statistical Estimation of Mutual Information for Mixed Model

Bulinski, Alexander; Kozhevin, Alexey

doi:10.1007/s11009-020-09802-0

Statistical Estimation of Mutual Information for Mixed Model

Published: 01 July 2020

Volume 23, pages 123–142, (2021)
Cite this article

Methodology and Computing in Applied Probability Aims and scope Submit manuscript

101 Accesses
5 Citations
Explore all metrics

Abstract

Asymptotic unbiasedness and L²-consistency are established for various statistical estimates of mutual information in the mixed models framework. Such models are important, e.g., for analysis of medical and biological data. The study of the conditional Shannon entropy as well as new results devoted to statistical estimation of the differential Shannon entropy are employed essentially. Theoretical results are completed by computer simulations for logistic regression model with different parameters. The numerical experiments demonstrate that new statistics, proposed by the authors, have certain advantages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Generalized Mutual-Information Based Independence Tests

A New Criterion of Mutual Information Using R-value

Multiple Testing of Conditional Independence Hypotheses Using Information-Theoretic Approach

References

Balagani K, Phoha V (2010) On the feature selection criterion based on an approximation of multidimensional mutual information. IEEE Trans Pattern Anal Mach Intell 32(7):1342–1343
Article Google Scholar
Bennasar M, Hicks Y, Setchi R (2014) Feature selection using joint mutual information maximisation. Expert Syst Appl 42:8520–8532
Article Google Scholar
Berrett TB, Samworth RJ, Yuan M (2016) Efficient multivariate entropy estimation via k-nearest neighbour distances. arXiv:160600304
Biau G, Devroy L (2015) Lectures on the nearest neighbor method. Springer-Verlag, New York
Book Google Scholar
Bulinski A, Dimitrov D (2019a) Statistical estimation of the Kullback - Leibler divergence. arXiv:190700196
Bulinski A, Dimitrov D (2019b) Statistical estimation of the Shannon entropy. Acta Math Sinica English Series 35:17–46
Bulinski A, Kozhevin A (2017) Modification of the MDR-EFE method for stratified samples. Stat, Optim Inf Comput 5:1–18
Article MathSciNet Google Scholar
Bulinski A, Kozhevin A (2018) Statistical estimation of conditional Shannon entropy. ESAIM: Probability and Statistics pp 1–35 , published online: November 28. https://doi.org/10.1051.ps.2018026
Coelho F, Braga A, Verleysen M (2016) A mutual information estimator for continuous and discrete variables applied to feature selection and classification problems. Int J Computat Int Syst 9(4):726–733
Article Google Scholar
Delattre S, Fournier N (2017) On the Kozachenko-Leonenko entropy estimator. Journal of Statistical Planning and Inference 185
Doquire G, Verleysen M (2012) A comparison of mutual information estimators for feature selection. In: Proc. of the 1st international conference on pattern recognition applications and methods, pp 176–185. https://doi.org/10.5220/0003726101760185
Favatti P, Lotti G, Romani F (1991) Algorithm 691: Improving quadpack automatic integration routines. ACM Trans Math Softw 17(2):218–232
Article MathSciNet Google Scholar
Fleuret F (2004) Fast binary feature selection with conditional mutual information. J Mach Learn Res 4:1531–1555
MathSciNet MATH Google Scholar
Gao W, Kannan S, Oh S, Viswanath P (2017) Estimating mutual information for discrete-continuous mixtures. In: 31st conference on neural information processing systems (NIPS), Long Beach, CA, USA, pp 1–12
Hoeffding W (1963) Probability inequalities for sums of bounded random variables. J Amer Statist Assoc 58(301):13–30
Article MathSciNet Google Scholar
Kozachenko L, Leonenko N (1987) Sample estimate of the entropy of a random vector. Probl Inf Transm 23:95–101
MATH Google Scholar
Kraskov A, Stögbauer H, Grassberger P (2004) Estimating mutual information. Phys Rev E 69:066,138
Article MathSciNet Google Scholar
Macedo F, Oliveira R, Pacheco A, Valadas R (2019) Theoretical foundations of forward feature selection methods based on mutual information. Neurocomputing 325:67–89
Article Google Scholar
Massaron L, Boschetti A (2016) Regression analysis with python. Packt Publishing Ltd., Birmingham
Google Scholar
Nair C, Prabhakar B, Shah D (2007) Entropy for mixtures of discrete and continuous variables. arXiv:0607075v2
Novovičová J, Somol P, Haind M, Pudin P (2007) Conditional mutual information based feature selection for classification task. In: Ruedz M D L, Kittler J (eds) CIARP 2007, LNCS, vol 4756. Springer-Verlag, Berlin, pp 417–426
Paninski L (2003) Estimation of entropy and mutual information. Neural Comput. 15:1191–1253
Article Google Scholar
Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of maxdependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27:1226–1238
Article Google Scholar
Vergara J, Estévez P (2014) A review of feature selection methods based on mutual information. Neural Comput Appl 24:175–186. https://doi.org/10.1007/s00521-013-1368-0
Article Google Scholar
Verleysen M, Rossi F, Franëcois D (2009) Advances in feature selection with mutual information. arXiv:09090635v1
Yeh J (2014) Real analysis: Theory of measure and integration, 3rd edn. World Scientific, Singapore
Book Google Scholar

Download references

Acknowledgements

The work of A.Bulinski is supported by the Russian Science Foundation under grant 19-11-00290 and performed at the Steklov Mathematical Institute of Russian Academy of Sciences. Theoretical results are established in the joint work of A.Bulinski and A.Kozhevin, simulations are realized by A.Kozhevin. The authors are grateful to the Reviewer for his remarks concerning the book by G.Biau and L.Devroy.

Author information

Authors and Affiliations

Steklov Mathematical Institute of Russian Academy of Sciences, Moscow, 119991, Russia
Alexander Bulinski
Department of Mathematics and Mechanics, Lomonosov Moscow State University, Moscow, 119991, Russia
Alexey Kozhevin

Authors

Alexander Bulinski
View author publications
You can also search for this author in PubMed Google Scholar
Alexey Kozhevin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexander Bulinski.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Proof

of Lemma 1 Formula Eq. 21 becomes evident if we consider the Bernoulli scheme with a success probability P(Y = y). One has {N_{y, n} = 0} = {Y_q≠y, q = 1,…,n}, and

$$ \begin{aligned} \{N_{y,n} = k\} &= \cup_{I} \left\{ {j_{1}^{y}} = i_{1}, \ldots, {j_{k}^{y}} = i_{k} \right\}\\ &= \cup_{I} \left( \{ Y_{i} = y, i \in I\} \cap \{Y_{q} \ne y, q \in \{1,\ldots,n\}\setminus I \}\right), \end{aligned} $$

for k = 1,…,n, where ∪_I is taken over I = {i₁,…,i_k} such that 1 ≤ i₁ < … < i_k ≤ n.

Thus, for $m\in \mathbb {N}$, m > k, and any $B_{i} \in {\mathscr{B}}(\mathbb {R}^{d})$, i = 1,…,m,

$$ \begin{aligned} &{\textsf{P}}\left( X_{{j_{1}^{y}}} \in B_{1}, \ldots, X_{{j_{m}^{y}}} \in B_{m}, N_{y,n} = k\right) \\ &\qquad= \sum\limits_{\substack{ J = \{i_{1}, \ldots, i_{k}, i_{k+1}, \ldots, i_{m} \} \colon \\ 1 \leq i_{1} < {\ldots} < i_{k} \leq n < i_{k+1} < {\ldots} <i_{m} }} {\textsf{P}}\left( X_{i_{1}} \in B_{1}, \ldots, X_{i_{m}} \in B_{m}, {j_{1}^{y}} = i_{1}, \ldots, {j_{m}^{y}} = i_{m}\right) \\ &\qquad= \sum\limits_{p=n+m-k}^{\infty} \sum\limits_{J} {\textsf{P}}\left( X_{i_{1}} \in B_{1}, \ldots, X_{i_{m}} \in B_{m}, {j_{1}^{y}} = i_{1}, \ldots, {j_{m}^{y}} = i_{m}\right) \end{aligned} $$

$$ \begin{array}{@{}rcl@{}} &=& \sum\limits_{p=n+m-k}^{\infty}\sum\limits_{J} {\textsf{P}}(X_{i_{1}}\! \in\! B_{1}, \ldots, X_{i_{m}} \!\in\! B_{m}, \{Y_{i} = y, i \!\in \!J\}, \{Y_{q} \!\ne\! y, q \!\in \!\{1, \ldots, p\} \setminus J\}) \\ &=& \sum\limits_{p=n+m-k}^{\infty} \sum\limits_{J} \prod\limits_{i = 1}^{m} {\textsf{P}}(X \in B_{i}, Y = y) \prod\limits_{q \in \{1, \ldots, p\} \setminus J} {\textsf{P}}(Y \ne y) \\ &=& \prod\limits_{i = 1}^{m} {\textsf{P}}(X \in B_{i}, Y = y) \sum\limits_{p=n+m-k}^{\infty} \sum\limits_{J} {\textsf{P}}(Y \ne y)^{p-m}, \end{array} $$

where, for p ≥ n + m − k, ${\sum }_{J}$ denotes the sum over all J having the form

$$ J(p,n,m,k)=\{\{i_{1}, \ldots, i_{k}, i_{k+1}, \ldots, i_{m} \} \colon 1 \leq i_{1} < {\ldots} < i_{k} \leq n < i_{k+1} < {\ldots} <i_{m} = p\}. $$

For k = 0, the sum ${\sum }_{J}$ is taken over i₁,…,i_m such that n < i₁ < … < i_m. If J = {i₁,…,i_m}⊂{1,…,p}, where i_m = p, then $ \{{j_{1}^{y}} = i_{1}, \ldots , {j_{m}^{y}} = p\} = \{Y_{i} = y, i \!\in \! J\}$ ∩{Y_q≠y, q ∈{1,…,p}∖ J}. For p, n, m and k under consideration the cardinality of all collections J(p, n, m, k) is equal to $\binom {n}{k}\binom {p-n-1}{m-k-1}$. Consequently,

$$ \begin{aligned} &{\textsf{P}}(X_{{j_{1}^{y}}} \in B_{1}, \ldots, X_{{j_{m}^{y}}} \in B_{m}, N_{y,n} = k)\\ &\qquad= \binom{n}{k} \prod\limits_{i = 1}^{m} {\textsf{P}}(X \in B_{i}, Y = y) \sum\limits_{p=n+m-k}^{\infty} \binom{p-n-1}{m-k-1} {\textsf{P}}(Y \ne y)^{p-m}, \end{aligned} $$

where we took into account that (X₁, Y₁), (X₂, Y₂),… are i.i.d. random vectors having the same distribution as (X, Y ). Set l = p − (n + m − k). Then

$$ \begin{aligned} &\sum\limits_{p=n+m-k}^{\infty} \binom{p-n-1}{m-k-1} {\textsf{P}}(Y \ne y)^{p-m}= \sum\limits_{l=0}^{\infty}\binom{l+m-k-1}{m-k-1}{\textsf{P}}(Y \ne y)^{l+n-k}\\ &\qquad= {\textsf{P}}(Y \ne y)^{n-k}{\textsf{P}}(Y=y)^{-(m-k)}\sum\limits_{l=0}^{\infty}\binom{l+m-k-1}{l}{\textsf{P}}(Y \ne y)^{l}{\textsf{P}}(Y=y)^{m-k}\\ &\qquad={\textsf{P}}(Y \ne y)^{n-k}{\textsf{P}}(Y=y)^{k} {\textsf{P}}(Y=y)^{-m}, \end{aligned} $$

since ${\sum }_{l=0}^{\infty }\binom {l+m-k-1}{l}{\textsf {P}}(Y \ne y)^{l}{\textsf {P}}(Y=y)^{m-k}=1$ for negative binomial distribution. Thus

$$ {\textsf{P}}(X_{{j_{1}^{y}}} \in B_{1}, \ldots, X_{{j_{m}^{y}}} \in B_{m}, N_{y,n} = k) = \binom{n}{k}{\textsf{P}}(Y = y)^{k}{\textsf{P}}(Y \!\ne\! y)^{n-k} \prod\limits_{i = 1}^{m} {\textsf{P}}(X \in B_{i}| Y = y). $$

The latter relation yields all the Lemma statements when m > k.

Let now 1 ≤ m ≤ k, where k ∈{0,…,n}. The only possible case is k > 0, and hence

$$ \begin{array}{@{}rcl@{}} {\textsf{P}}&&{}\left( X_{{j_{1}^{y}}} \in B_{1}, \ldots, X_{{j_{m}^{y}}} \in B_{m}, N_{y,n} = k\right)\\ &=& \sum\limits_{\substack{ I = \{i_{1}, \ldots, i_{k} \} \colon \\ 1 \leq i_{1} < {\ldots} < i_{k} \leq n }}{\textsf{P}}\left( X_{i_{1}}\in B_{1},\ldots, X_{i_{m}}\in B_{m}, {j_{1}^{y}}=i_{1},\ldots,{j_{m}^{y}}=i_{m},\ldots,{j_{k}^{y}}=i_{k}\right)\\ &=& \binom{n}{k}\prod\limits_{i=1}^{m} {\textsf{P}}(X\in B_{i},Y=y){\textsf{P}}(Y=y)^{k-m}{\textsf{P}}(Y\neq y)^{n-k}\\ &=& \binom{n}{k}{\textsf{P}}(Y=y)^{k} {\textsf{P}}(Y\neq y)^{n-k}\prod\limits_{i=1}^{m} {\textsf{P}}(X\in B_{i}|Y=y). \end{array} $$

This equality implies the validity of all the Lemma assertions in the considered case. □

Proof

of Lemma 3 We can take c = 1 without loss of generality. Indeed, I(X, Y ) = I(X, aY ) for any a≠ 0 since, for each one-to-one mapping ϕ: M → T where ♯(M) = ♯(T) = m, one has I(X; ϕ(Y )) = I(X, Y ). Condition (B) is valid in view of Corollary 1 by (Bulinski and Kozhevin 2018). Consider (C). One has P_X ≪ μ and, consequently, for any $x\in \mathbb {R}^{d}$ and y ∈ M,

$$ 0 \leq P_{X, Y}(A_{x, y, 0}) = {\textsf{P}}(X = x, Y = y) \leq {\textsf{P}}(X = x) = 0. $$

Moreover,

$$ \begin{array}{@{}rcl@{}} \textsf{E} \left| \log \frac{f_{X, Y}(X, Y)}{f_{X}(X) f_{Y}(Y)}\right| &=&\sum\limits_{y \in M} \textsf{E} \left( \left| \log \frac{f_{X, Y}(X, Y)}{f_{X}(X) f_{Y}(Y)}\right| \Big| Y = y \right) {\textsf{P}}(Y = y) \\ &\leq& \textsf{E} \left| \log f_{Y}(Y) \right| + \textsf{E} \left| \log f_{Y|X}(Y|X) \right| = H(Y) + H(Y|X) < \infty. \end{array} $$

Here we employed that 0 ≤ f_{Y |X}(y|x) ≤ 1 for any y ∈ M and μ-almost all $x \in \mathbb {R}^{d}$ and 0 ≤ f_Y(y) = P(Y = y) ≤ 1. Thus, (C) is valid.

Corollary 2.7 in Bulinski and Dimitrov (2019b) implies that, for each ν > 2, ε > 0, one has $L_{f}(\nu )<\infty $ and $G_{f}(\varepsilon )<\infty $. It remains to show that $T_{g_{y}}(\varepsilon ) < \infty $ for y ∈ M.

According to formula (4.17) of Bulinski and Dimitrov (2019b)

$$ f_{X}(z) \geq f_{X}(x) e^{-\frac{\|z-x\|^{2}}{2 \lambda_{min} } + \left( {\Sigma}^{-1}(\nu - x), z - x \right) }, $$

where λ_min > 0 is the minimal eigenvalue of Σ. Immediately one obtains

$$ g_{1}(z) \geq g_{1}(x) \frac{1 + e^{-(w,x) - b}}{1 + e^{-(w,z) - b}} e^{-\frac{\|z-x\|^{2}}{2 \lambda_{min} } + \left( {\Sigma}^{-1}(\nu - x), z - x \right) }, $$

$$ g_{0}(z) \geq g_{0}(x) \frac{1 + e^{(w,x) + b}}{1 + e^{(w,z) + b}} e^{-\frac{\|z-x\|^{2}}{2 \lambda_{min} } + \left( {\Sigma}^{-1}(\nu - x), z - x \right) }. $$

At first we find the lower bound for $\frac {1 + e^{-(w,x)-b}}{1 + e^{-(w,z)-b}}$ when ∥x − y∥≤ r. Set u = −(w, x) − b. Then

$$ \frac{1 + e^{-(w,x)-b}}{1 + e^{-(w,z)-b}} \geq \frac{1+ e^{-\|w\|r+u}}{1+e^{u}} = 1 -\frac{e^{u}}{1+e^{u}}(1- e^{-\|w\|r})\geq e^{-\|w\|r}. $$

(44)

In a similar way $\frac {1 + e^{(w,x)+b}}{1 + e^{(w,z)+b}} \geq e^{-\|w\|r}$ whenever ∥x − z∥≤ r. Hence, for y ∈ M, R > 0, r ∈ (0,R) and $x,z\in \mathbb {R}^{d}$ such that ∥x − z∥≤ r, one has

$$ \begin{array}{@{}rcl@{}} {\int}_{B(x, r)} f_{X,Y}(z, y) dz &\geq& f_{X, Y}(x, y) e^{- \|w\| r-\frac{r^{2}}{2 \lambda_{min} }} {\int}_{B(0, r)} e^{({\Sigma}^{-1}(\nu - x), u ) } du \\ &\geq& f_{X, Y}(x, y) e^{- \|w\| r-\frac{r^{2}}{2 \lambda_{min} }} {\int}_{B(0, r)} (1 + ({\Sigma}^{-1}(\nu - x), u )) du \\ &=& f_{X, Y}(x, y) e^{- \|w\| r-\frac{r^{2}}{2 \lambda_{min} }} r^{d} V_{d} \geq f_{X, Y}(x, y) e^{- \|w\| R-\frac{R^{2}}{2 \lambda_{min} }} r^{d} V_{d}. \end{array} $$

Consequently, for $C = e^{- \|w\| R -\frac {R^{2}}{2 \lambda _{min} }}$, y ∈ M, $x \in \mathbb {R}^{d}$ and 0 < r < R,

$$ m_{f_{X, Y}}(x, r) \geq C f_{X, Y}(x, y), . $$

(45)

For each ε ∈ (0, 1) and y ∈ M,

$$ {\int}_{\mathbb{R}^{d}} f_{X, Y}^{1- \varepsilon}(x, y) dx \leq {\int}_{\mathbb{R}^{d}} f^{1- \varepsilon}_{X}(x) dx < \infty, $$

hence, $ T_{g_{y}}(\varepsilon ,R) = {\int \limits }_{\mathbb {R}^{d}} m_{f_{X, Y}}^{-\varepsilon }(x, R) f_{X, Y}(x, y) dx < \infty . $ Due to Bulinski and Dimitrov (2019b), Lemma 2.5, we can claim that $T_{g_{y}}(\varepsilon ):=T_{g_{y}}(\varepsilon ,\varepsilon )<\infty $ for all ε small enough. The proof of Lemma is complete. □

Proof

of Lemma 4 Similarly to the Lemma 3 proof, we infer that (C) is valid. Now we turn to (A). Obviously, $f(x) \leq \frac {1}{(\pi \gamma )^{d}}$ for $x \in \mathbb {R}^{d}$, therefore $Q_{f}(\varepsilon ) < \infty $ for ε ∈ (0, 1). Let us prove that $L_{f}(\nu ) < \infty $ for some ν > 2. We employ linear change of variables in the integral representing L_f(ν) and note that $\left |\log \| y - x \| -\log \gamma \right |^{\nu } \leq 2^{\nu -1}(\left |\log \| y - x \|\right |^{\nu } + |\log \gamma |^{\nu })$. Then we can claim that $L_{f}(\nu )<\infty $ if

$$ {\int}_{\mathbb{R}^{2d}} \left|\log \| y - x \|\right|^{\nu} \prod\limits_{i=1}^{d} \frac{1}{\left( 1 + {x_{i}^{2}} \right)} \frac{1}{\left( 1 + {y_{i}^{2}} \right)} d x dy < \infty. $$

Set $u = \frac {y - x}{\sqrt {2}}, v = \frac {x + y}{\sqrt {2}}$. Then, for u = (u₁,…,u_d) and v = (v₁,…,v_d), we can study the convergence of the following integral:

$$ \begin{aligned} &{\int}_{\mathbb{R}^{2d}} \left| \log (\sqrt{2} \| u \|) \right|^{\nu} \prod\limits_{i=1}^{d} \frac{1}{\left( 1 + \frac{(u_{i} - v_{i})^{2}}{2} \right)} \frac{1}{\left( 1 + \frac{(u_{i} + v_{i})^{2}}{2} \right)} du dv\\ &\qquad= (\sqrt{2} \pi)^{d}{\int}_{\mathbb{R}^{d}}|\log (\sqrt{2}\|u\|)|^{\nu}\prod\limits_{i=1}^{d} \frac{1}{{u_{i}^{2}} +2}du \end{aligned} $$

since, for $z \in \mathbb {R}$, $ I(z) = {\int \limits }_{\mathbb {R}} \frac {1}{\left (1 + \frac {(z - s)^{2}}{2} \right )} \frac {1}{\left (1 + \frac {(z + s)^{2}}{2} \right )} ds = \frac {\sqrt {2} \pi }{z^{2} + 2}. $ Introduce $v_{i}={u_{i}^{2}}$, i = 1,…,d. Thus, is is enough to show that $J(d)={\int \limits }_{(0,\infty )^{d}}h_{d}(v)dv<\infty $, where

$$ h_{d}(v):=|\log (v_{1}+{\ldots} +v_{d})|^{\nu}\prod\limits_{i=1}^{d} \frac{1}{v_{i}^{1/2}(v_{i} +2)}, v\in (0,\infty)^{d}. $$

It is easily seen that $J(1)<\infty $. Consider d ≥ 2. One has J(d) = J₁(d) + J₂(d), where J₁(d) and J₂(d) are integrals of h_d(v) taken over $B_{1}:=(0,\infty )^{d} \cap \{v_{1}+\ldots +v_{d} <1\}$ and $B_{2}:=(0,\infty )^{d} \cap \{v_{1}+\ldots +v_{d} \geq 1\}$, respectively. We write

$$ \begin{array}{@{}rcl@{}} J_{1}(d) &=& {\int}_{B_{1} \cap \{v_{1}<v_{2}+{\ldots} +v_{d}\}} h_{d}(v)dv + {\int}_{B_{1} \cap \{v_{1}\geq v_{2}+{\ldots} +vx_{d}\}} h_{d}(v)dv = J_{1,1}(d)+J_{1,2}(d), \\ J_{1,1}(d) &\leq& {\int}_{v_{1}>0} \frac{|\log v_{1}|^{\nu}}{\sqrt{v_{1}}(v_{1}+2)}dv_{1} {\int}_{v_{2}>0,\ldots,v_{d}>0} \prod\limits_{j=2}^{d}\frac{1}{\sqrt{v_{j}}(v_{j}+2)} dv_{2} {\ldots} dv_{d} <\infty, \\ J_{1,2}(d) &\leq& {\int}_{v_{1}>0}\frac{1}{\sqrt{v_{1}}(v_{1}+2)} dv_{1}{\int}_{v_{2}>0,\ldots,v_{d}>0}\frac{|\log (v_{2}+ {\ldots} +v_{d})|^{\nu}} {{\prod}_{j=2}^{d}\sqrt{v_{j}}(v_{j}+2)} dv_{2}{\ldots} dv_{d} <\infty, \end{array} $$

where we use mathematical induction in d. In a similar way one can verify that $J_{2}(d)<\infty $. The proof of the finiteness of J(d), for each $d\in \mathbb {N}$ and ν > 2, is complete, so $L_{f}(\nu ) < \infty $ for any ν > 2. Now we show that, for y ∈ M, $x \in \mathbb {R}^{d}$, R > 0 and 0 ≤ r ≤ R, inequality (2) is valid, where C > 0 does not depend on x, y and r.

Due to Eq. 44, for each y ∈ M and $x,z\in \mathbb {R}^{d}$ such that ∥x − z∥≤ r (r ≥ 0),

$$ \frac{f_{X, Y}(z, y)}{f_{X, Y}(x, y)} \geq e^{- \|w\| r} \prod\limits_{i=1}^{d}\frac{1 + \left( \frac{x_{i}-\nu)}{\gamma}\right)^{2}}{1 + \left( \frac{z_{i}-\nu)}{\gamma}\right)^{2}}. $$

(46)

For $d\in \mathbb {N}$ and $x,z\in \mathbb {R}^{d}$, set $ F_{d}(x,z):= {\prod }_{i=1}^{d}\frac {1 + {x_{i}^{2}}}{1 + {z_{i}^{2}}}, x=(x_,\ldots ,x_{d}), z=(z_{1},\ldots ,z_{d}). $ Now we will consider $x,z\in \mathbb {R}^{d}$ such that ∥x − z∥≤ γr. Take R < 1/γ. One has

$$ F_{d}(x,z)\geq \prod\limits_{i=1}^{d}\frac{1}{1 + \frac{(\gamma r)^{2}}{1 + {x_{i}^{2}}} + \frac{2 x_{i}}{1 + {x_{i}^{2}}}(z_{i} - x_{i})}. $$

Evidently, $\frac {1}{s+1}\geq 1-s$ for s > − 1. Thus we get $ F_{d}(x,z)\geq {\prod }_{i=1}^{d} (a_{i} - b_{i}(z_{i}-x_{i})), $ where $a_{i} = 1-\frac {(\gamma r)^{2}}{1+{x_{i}^{2}}} \geq 1 - (\gamma r)^{2} \geq 1-(\gamma R)^{2}$, $b_{i} = \frac {2 x_{i}}{1 + {x_{i}^{2}}}$, i = 1,…,d. Using induction one can prove that, for any r ≥ 0, $A_{i},B_{i}\in \mathbb {R}$, i = 1,…,d,

$$ {\int}_{B(0,r)}\prod\limits_{i=1}^{d}(A_{i}-B_{i}u_{i})du = V_{d} r^{d} \prod\limits_{i=1}^{d} A_{i}. $$

Return to Eq. 3. If ∥x − z∥≤ r ≤ R, where 0 < R < 1/γ, then

$$ {\int}_{B(x, r)} f(z, y) dz \geq f(x, y) e^{-\|w\|r}V^{d} r^{d}\left( 1-(\gamma R)^{2}\right)^{d}. $$

Hence Eq. 2 is valid with $ C = \left (1-(\gamma R)^{2}\right )^{d} e^{-\|w\|R}$. Therefore, for each ε ∈ (0, 1) and y ∈ M,

$$ {\int}_{\mathbb{R}^{d}} f_{X, Y}^{1- \varepsilon}(x, y) dx \leq {\int}_{\mathbb{R}^{d}} f_{X}^{1- \varepsilon}(x) dx < \infty, $$

and $ T_{g_{y}}(\varepsilon ,R) = {\int \limits }_{\mathbb {R}^{d}} m_{f_{X, Y}}^{-\varepsilon }(x, R) f_{X, Y}(x, y) dx < \infty $.

It remains to verify condition (B). The function $ g_{1}(x) = \frac {1}{1 + e^{-(w, x) - b}} f(x) $ is strictly positive. Moreover,

$$ \begin{array}{@{}rcl@{}} \nabla g_{1}(x) &=& \frac{f(x)}{(1 + e^{-(w, x) - b})^{2}} e^{-(w,x)-b} w + \frac{1}{1 + e^{-(w, x) - b}} \nabla f(x), \\ \nabla f(x) &=& - \frac{2}{\gamma^{2}} f(x) \left( \frac{x_{1} - \nu}{\left( 1 + \left( \frac{x_{1} - \nu}{\gamma}\right)^{2} \right)}, \ldots, \frac{x_{d} - \nu}{\left( 1 + \left( \frac{x_{d} - \nu}{\gamma}\right)^{2} \right)} \right), \\ \|\nabla g_{1}(x)\| &\leq& \frac{1}{4} |f(x)| \|w\| + \frac{1}{1 + e^{-(w, x) - b}} \|\nabla f(x)\| \leq \frac{1}{4(\pi \gamma)^{d}} \|w\| + \| \nabla f(x) \|. \end{array} $$

The function $s(u) = \frac {u}{(1 + u^{2})^{2}}$ is continuous and $\lim _{u \to \infty } s(u) = \lim _{u \to -\infty } s(u) = 0$, therefore $ \max \limits _{u \in \mathbb {R}} |(u - \nu )|\left (1 + ((u - \nu )/\gamma )^{2}\right )^{-1} < \infty . $ Thus $\max \limits _{x \in \mathbb {R}^{d} }\|\nabla g_{1}(x)\| < \infty $ and g₁(x) satisfies the Lipschitz condition with some constant C₀ > 0 for each $x \in \mathbb {R}^{d}$. According to Remark 1 in Bulinski and Kozhevin (2018) we conclude that g₁ is C₀-constricted. The same reasoning is valid for g₀.

For i.i.d. random variables X₁,…,X_d, the Minkowski inequality yields

$$ \begin{array}{@{}rcl@{}} \left( \textsf{E} |\log f(X)|^{2+\varepsilon} \right)^{1/(2+\varepsilon)} &\leq& \sum\limits_{i=1}^{d} \left( \textsf{E} \left| \log f_{X_{i}}(X_{i}) \right|^{2+\varepsilon} \right)^{1/(2+\varepsilon)}\\ &=& d \left( \textsf{E} \left| \log f_{X_{1}}(X_{1}) \right|^{2+\varepsilon} \right)^{1/(2+\varepsilon)}\\&=&d \left( {\int}_{\mathbb{R}} \frac{1}{\pi \left( 1 + x^{2} \right)} \left|\log \pi \gamma \left( 1 + x^{2} \right) \right|^{2+\varepsilon} dx\right)^{1/(2+\varepsilon)} < \infty. \end{array} $$

Thus (B) is satisfied. The proof of Lemma is complete. □

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bulinski, A., Kozhevin, A. Statistical Estimation of Mutual Information for Mixed Model. Methodol Comput Appl Probab 23, 123–142 (2021). https://doi.org/10.1007/s11009-020-09802-0

Download citation

Received: 23 March 2019
Revised: 31 December 2019
Accepted: 08 June 2020
Published: 01 July 2020
Issue Date: March 2021
DOI: https://doi.org/10.1007/s11009-020-09802-0

Keywords

Mathematics Subject Classification (2010)

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Statistical Estimation of Mutual Information for Mixed Model

Abstract

Access this article

Similar content being viewed by others

Generalized Mutual-Information Based Independence Tests

A New Criterion of Mutual Information Using R-value

Multiple Testing of Conditional Independence Hypotheses Using Information-Theoretic Approach

References

Acknowledgements