Efficient feature selection using shrinkage estimators

Information theoretic feature selection methods quantify the importance of each feature by estimating mutual information terms to capture: the relevancy, the redundancy and the complementarity. These terms are commonly estimated by maximum likelihood, while an under-explored area of research is how to use shrinkage methods instead. Our work suggests a novel shrinkage method for data-efficient estimation of information theoretic terms. The small sample behaviour makes it particularly suitable for estimation of discrete distributions with large number of categories (bins). Using our novel estimators we derive a framework for generating feature selection criteria that capture any high-order feature interaction for redundancy and complementarity. We perform a thorough empirical study across datasets from diverse sources and using various evaluation measures. Our first finding is that our shrinkage based methods achieve better results, while they keep the same computational cost as the simple maximum likelihood based methods. Furthermore, under our framework we derive efficient novel high-order criteria that outperform state-of-the-art methods in various tasks.

Following Hausser and Strimmer (2009) approach, we can derive a simple estimate of λ * by replacing all variances, covariances and expectations with their empirical counterparts (Schäfer and Strimmer, 2005, eq To derive the expressions of the five terms we will assume a random vector N, whose elements are the counts Nxy, which is distributed as a |X ||Y| Multinomial random variable. The parameters are N , the total number of observations, and p, the vector of the true underlying probabilities p(xy). Under this assumption we will derive estimates for the five terms of eq. (1).
• First term: The first term can be written as follows: Var p ML (xy) = Var Under the Multinomial modelling, the first two moments are (Mosimann, 1962): E N 2 xy = N (N − 1)p(xy) 2 + N p(xy) By substituting eqs (3) and (4) into eq. (2) we get: Var p ML (xy) = 1 N 2 N (N − 1)p(xy) 2 + N p(xy) − N 2 p(xy) 2 = p(xy) N (1 − p(xy)) The above term can be estimated as Var p ML (xy) =p ML (xy) N 1 −p ML (xy) • Second term: The covariance term can be written as follows: Cov p ML (xy),p Ind (xy) = E p ML (xy)p Ind (xy) − E p ML (xy) E p Ind (xy) The expected value E [NxNy] can be calculated as follows: Under the Multinomial modelling, the second order moments can be written as follows (Mosimann, 1962): E N xy N x y = N (N − 1)p(xy )p(x y) E NxyN x y = N (N − 1)p(xy)p(x y) By substituting eqs (4), (7) and (8) into eq. (6) we get: y ∈Y y =y p(xy )p(y) + x ∈X x =x p(xy)p(x y) + p(xy) 2 + N p(xy) = N (N − 1) (p(x) − p(xy)) p(y) + p(xy) (p(y) − p(xy)) + p(xy) 2 + N p(xy) By simple computation, we obtain Now we will calculate the expected value E [NxyNxNy] . This expectation can be written as follows: Under the Multinomial modelling, the third order moments can be written as follows (Mosimann, 1962): where N (2) = N (N − 1) and N (a) = N (N − 1) · · · (N − a + 1), for a > 2. By substituting eqs (11) -(14) into eq. (10) we get: By simple computation, we obtain E[NxyNxNy]=N (N −1)(N −2)p(xy)p(x)p(y)+N (N −1)p(xy)(p(y)+p(x)+p(xy))+N p(xy) Thus, by substituting eqs (3), (9) and (15) into eq. (5) we get: Cov p ML (xy),p Ind (xy) Again, by simple computation we obtain: The above term can be estimated as Cov p ML (xy),p Ind (xy) =p ML (xy) • Third term: This expectation term can be calculated as follows: Using eq. (4), the last expression can be written as Finally, the above term can be estimated as • Fourth term: This term can be written as follows: According the known fourth order moments formulas (Mosimann, 1962), we need to treat the terms in eq. (16) by splitting them in five categories: A. four different terms; B. three different terms over four; C. two different terms over four (two couples equal); D. two different terms over four (a triplet and a single element); E. a single element to the power four.
The sum of forth order moments can thus be written as the sum of five different terms, i.e., x ,x ∈X y ,y ∈Y Under the Multinomial modelling, the fourth order moments that we need to calculate these five terms can be written as follows (Mosimann, 1962): Using eqs (17) -(28) and some simple computation, we obtain the following expressions for the five terms By simple computation, we obtain Since the parameters p(xy) are unknown, we substitute the ML estimates in order to estimate the second moment ofp Ind (xy), i.e., • Fifth term: This expectation term can be calculated as follows: Using eq. (15), the last expression can be written as Finally, the above term can be estimated as A.2 Proof of Theorem 2 Let us re-express CMI criterion, presented in main text's eq. (10) Now we will decompose the second and third term, which capture redundancy and complementarity, using Assumptions 1 and 2 respectively.
Redundancy term -This term can be written as: Using Assumption 1 we can re-write the rhs as: Using the identity H(A|B) = H(A) − I(B; A) for the second term, and H(A|BC) = H(A|C) − I(B; A|C) for the third term, the above equation can be written as: Complementarity term -This term can be written as: Using Assumption 2 we can re-write the rhs as: Using the identity H(A|BC) = H(A|C) − I(B; A|C) for the second term, and H(A|BCD) = H(A|CD) − I(B; A|CD) for the third term, the above equation can be written as: The derived redundancy and complementarity terms, eqs. (30) and (31), can be substituted to eq. (29) so the CMI criterion can be written as follows: where Const are constant terms with respect to X k , as such removing them will have no effect on the choice of feature. Removing these terms we have an equivalent criterion.

A.3 Proof of Theorem 3
The JMI-3 criterion can be written in the following way: The first term is constant with respect to X k , as such removing it will have no effect on the choice of feature. We will use the identity I(A; B|C) = I(A; B) − I(A; C) + I(A; C|B) to re-express the conditional mutual information term The last two terms of the rhs can written as follows:I(X k ; X i X j ) = I(X k ; X j )+I(X k ; X i |X j ) and I(X k ; X i X j |Y ) = I(X k ; X j |Y ) + I(X k ; X i |X j Y ). Thus the JMI-3 criterion is The last expression shows that the JMI-3 criterion can be decomposed in the five terms of main text's eq. (17) with coefficients: β = γ = 1/|X θ |, and β = γ = 1/|X θ |(|X θ | − 1).

A.4 Proof of Theorem 4
Using identity I(A; B|C) = I(A; B) − I(A; C) + I(A; C|B), we can re-express the CMIM-3 criterion in the following way: The last two terms of the rhs can written as follows: I(X k ; X i X j ) = I(X k ; X j )+I(X k ; X i |X j ) and I(X k ; X i X j |Y ) = I(X k ; X j |Y ) + I(X k ; X i |X j Y ). Thus the CMIM-3 criterion can be decomposed as follows: B Protocol for generating synthetic data B.1 Generating data for Section 3.2 The data are generated as follows: 1. We specify the desired values for the population MI i.e. I(X; Y ) ∈ (0, 0.05] 2. We generate the population values for the probabilities p(xy) by sampling from a Dirichlet distribution. Using these probabilities we can calculate the population value of the mutual information I(X; Y ), and if it is inside the desirable values (specified in Step 1) we keep the probabilities, else we sample again. Every time we sample from a Dirichlet, Dir(α), the concentration parameter α is a number chosen randomly from: [0.3 − 3]. It is interesting to mention here that the concentration parameter α controls the concentration of the prior: large α provides uniform distributions, and as a result small MI values, while small values lead to more concentrated distributions, which generate higher mutual information values. Using the probabilities p(xy), we calculate p(x) = y∈Y p(xy), and p(y|x) = p(xy) p(x) . 3. We generate the values of X,{x n } N n=1 , by sampling N data from the marginal distribution p(x). Then we generate the values of Y by sampling N data from the conditional distributions {p(y|X = x n )} N n=1 .
B.2 Generating data for Section 3.3 The data are generated as follows: 1. We specify the desired values for the population CMI i.e. I(X; Y |Z) ∈ (0, 0.05] 2. We generate the population values for the probabilities p(x) and p(y) by sampling from a Dirichlet distribution, and for each value of X and Y we generate the values of p(z|xy) by sampling again from a Dirichlet. Using these probabilities we can calculate the population value of the conditional mutual information I(X; Y |Z), and if it is inside the desirable values (specified in Step 1) we keep the probabilities, else we sample again. Every time we sample from a Dirichlet, Dir(α) the concentration parameter α is a number chosen randomly from: [0.3 − 3]. 3. We generate the values of X and Y , {x n y n } N i=n , by sampling N data from the marginal distribution p(x) and p(y) respectively. Then we generate the values of Z by sampling n data from the conditional distributions {p(z|X = x n , Y = y n )} N n=1 . Table 1 shows the characteristics of the 11 benchmark BN 1 that we used to simulate the data in Section 5, while Table 2 shows the details of the 20 UCI datasets 2 used in Section 6.

D Experimental Results
Tables 3 and 4 contain the complete results of Section 5 for each dataset. Table 5 contain the results presented in Section 6. Table 3 Comparison between our suggested high-order FS criteria in terms of their ability to identify the correct features (TPR) for BN with: (a) Sample size = 500 , and (b) Sample size = 2500. The best method (i.e. highest TPR) is highlighted with bold font and at the bottom of the table we present the average ranking score of each method across all datasets.      (3) 0.063 (12) 0.059(9) 0.052(6) 0.061(11) 0.053 (7) heart 0.231 (6)