Abstract
A century ago, when Student’s t-statistic was introduced, no one ever imagined its increasing applicability in the modern era. It finds applications in highly multiple hypothesis testing, feature selection and ranking, high dimensional signal detection, etc. Student’s t-statistic is constructed based on the empirical distribution function (EDF). An alternative choice to the EDF is the kernel density estimate (KDE), which is a smoothed version of the EDF. The novelty of the work consists of an alternative to Student’s t-test that uses the KDE technique and exploration of the usefulness of KDE based t-test in the context of its application to large-scale simultaneous hypothesis testing. An optimal bandwidth parameter for the KDE approach is derived by minimizing the asymptotic error between the true p-value and its asymptotic estimate based on normal approximation. If the KDE-based approach is used for large-scale simultaneous testing, then it is interesting to consider, when does the method fail to manage the error rate? We show that the suggested KDE-based method can control false discovery rate (FDR) if total number tests diverge at a smaller order of magnitude than N3/2, where N is the total sample size. We compare our method to several possible alternatives with respect to FDR. We show in simulations that our method produces a lower proportion of false discoveries than its competitors. That is, our method better controls the false discovery rate than its competitors. Through these empirical studies, it is shown that the proposed method can be successfully applied in practice. The usefulness of the proposed methods is further illustrated through a gene expression data example.
Similar content being viewed by others
References
Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Statist. Soc. Ser. B 57, 289–300.
Candes, E. and Barber, R.F. (2018). https://statweb.stanford.edu/~candes/stats300c/Lectures/Lecture7.pdf.
Fan, J., Hall, P. and Yao, Q. (2007). To how many simultaneous hypothesis tests can normal, student’s t or bootstrap calibration be applied? J. Am. Statist. Assoc. 102, 1282–1288.
Ghosh, S. and Polansky, A.M. (2014). Smoothed and iterated bootstrap confidence regions for parameter vectors. J. Multivar. Statist. 132, 172–182.
Hall, P. (1992). The Bootstrap and Edgeworth Expansion. Springer, New York.
Hall, P., Jing, B.Y. and Lahiri, S.N. (1998). On the sampling window method for long-range dependent data. Statist. Sin. 8, 1189–1204.
Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75, 800–802.
Hommel, G.A. (1988). Stagewise rejective multiple test procedure based on a modified Bonferroni test. Biometrika 75, 383–386.
Karimi, S. and Farrokhnia, M. (2014). Leukemia and small round blue-cell tumor cancer detection using microarray gene expression data set: combining data dimension reduction and variable selection technique. Chemom. Intell. Lab. Syst. 139, 6–14.
Khan, J., Wei, J.S., Ringner, M., Saal, L.H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C.R. and Peterson, C. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat. Med. 7, 673–679.
Liu, W. and Shao, Q. (2014). Phase transition and regularized bootstrap in large-scale tt-tests with false discovery rate control. Ann. Statist. 42, 2003–2025.
Murie, C., Woody, O., Lee, A. and Nadon, R. (2009). Comparison of small n statistical tests of differential expression applied to microarrays. BMC Bioinformatics 10, 45.
Polansky, A.M. (2001). Bandwidth selection for the smoothed bootstrap percentile method. Comput. Stat. Data Anal. 36, 333–349.
Polansky, A.M. (2011). Introduction to Statistical Limit Theory. Chapman and Hall/CRC, Boca Raton.
Polansky, A.M. and Schucany, W.R. (1997). Kernel smoothing to improve bootstrap confidence intervals. J. R. Statist. Soc. Ser. B 59, 821–838.
Smyth, G. (2004). Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Statist. Appl. Genet. Mol. Biol. 3, 1544–6115.
Storey, J. (2002). A direct approach to false discovery rates. J. R. Statist. Soc. Ser. B 64, 479–498.
Tusher, V., Tibshirani, R. and Chu, G. (2001). Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. USA 98, 5116–5121.
Wand, M.P. and Jones, M.C. (1995). Kernel Smoothing. Chapman and Hall, London.
Westfall, P.H. and Young, S.S. (1993). Resampling Based Multiple Testing: Examples and Methods for p-value Adjustments. Wiley, New York.
Acknowledgments
We would like to thank the Editor, Associate Editor and two anonymous referees for their careful reading and constructive suggestions which improved the readability of the paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A. : Proofs
Appendix A. : Proofs
The proofs of Lemmas 1 and 5 are straightforward, and follow from the fact that a kernel density estimator is a convolution of the empirical distribution and the kernel function with respect to counting measure, hence omitted.
Proof of Lemma 2.
To prove Lemma 2 it is sufficient to show that
□
Suppose that
Then by the definition of \(\limsup \), we have
The above line implies that for all N ≥ 1,
and now using by switching order of supremum for doubly indexed sequence, one can have
Again, \(\sup _{i\geq 1}\sup _{j\geq 1} j^{k}h_{i}=\infty \) implies that there are infinitely many i such that
But Assumption (ii) implies that for each i, the sequence \(\{j^{k}h_{i}\}_{j=1}^{\infty }\) is bounded and this contradicts the with the fact
Thus we must have
and consequently Lemma 2 is established.
The proofs of Theorems 3 and 6 depend on the Edgeworth expansion of the distribution of
i = 1,…,d. The following result considers the Edgeworth expansion of the distribution of Ti.
Proposition 9.
Under the assumptions (i),(iii), and (iv), the distribution of Ti has the following Edgeworth expansion
holds uniformly in i; where q1,i(x) and q2,i(x)
where
and
In these expressions, γx,i and γy,i denote the skewness of the i th component of X and Y, and κx,i and κy,i denote the kurtosis of the i th component of X and Y. The proof of Proposition 9 is available with the attached supplement.
Proof of Theorem 2.1.
: Let \(\tilde {\sigma }_{i}=\sqrt {\frac {S_{x,i}}{r_{x}}+\frac {S_{y,i}}{r_{y}}+{h^{2}_{i}}(r^{-1}_{x}+r^{-1}_{y})}\), then we can express \(\tilde {T}_{i}\) as
where
and \(\hat {\sigma }_{i}=\sqrt {\frac {S^{2}_{x,i}}{r_{x}}+\frac {S^{2}_{y,i}}{r_{y}}}\). Let \(\tilde {t}_{i}\) denote the observed value of the two-sample t statistic, \(\tilde {T}_{i}\), based on the kernel density estimator technique. Then the p-value corresponding to H0i is given by
Let \(\beta _{i}=P(\tilde {T}_{i}\leq |\tilde {t_{i}}|)\). Then \(|\tilde {t_{i}}|\) is the βith quantile of distribution of \(\tilde {T}_{i}\). Under assumptions (i) and (iii), \(P\{\tilde {T}_{i}\leq x\}\) has a similar Edgeworth expansion as \(P\{\tilde {T}_{i}\leq x\}\) where q1,i(x) and q2,i(x) in Proposition 9 are replaced by \(\tilde {q}_{1,i}(x)\) and \(\tilde {q}_{2,i}(x)\). Using the similar techniques of Polansky and Schucany (1997), we can show that
holds uniformly in i, since \(\sup _{i}h_{i}=O(N^{-k})\). Under the assumptions (i), (ii), and (iii), we can obtain the Cornish-Fisher expansion for \(|\tilde {t_{i}}|\) as
uniformly in i, since \(\sup _{i}h_{i}=O(N^{-k})\). An excellent review of Cornish-Fisher expansions can be found in Hall (1992), where the function \(\tilde {q}_{21,i}(\cdot )\) is a function of \(\tilde {q}_{1,i}(\cdot )\) and \(\tilde {q}_{2,i}(\cdot )\). Equation A.2 implies that
holds uniformly in i. Considering \(P\{\tilde {T}_{i}\leq |\tilde {t_{i}}|\}\), we have that
holds uniformly in i; where the last line follows after an application of the Delta method of Section 2.7 of Hall (1992). Proposition 9 and Eq. A.5, and applying a Taylor series expansion we obtain
holds uniformly in i. Equations A.5 and A.6 imply that
holds uniformly in i. Similarly we have
holds uniformly in i. Finally, based on Eqs. A.1, A.7, and A.8 we can conclude that
holds uniformly in i. □
Proof of Lemma 4.
To prove this lemma, we use the following result. □
Proposition 10.
Let \(\mu _{xi,l}=\mathrm {E}(X_{i}-\mu _{xi})^{l}<\infty \) and \(\mu _{yi,l}=\mathrm {E}(X_{i}-\mu _{xi})^{l}<\infty \). Let
Then
Proof.
Let’s define
Then for every i and u ≥ n
Consequently, for every r and u ≥ n
which indicates that for every r and n
The above inequality implies that
Thus for every r,
and consequently
Thus, we have
Now, Eq. A.9 implies that
Thus, the last inequality, and |mxi,l − μxi,l| = Op(n− 1/2) imply that
and since n/N = O(1), so
Similarly we have
□
Proposition 10 implies that \(S^{2}_{x,i}=\sigma ^{2}_{x,i}+O_{p}(N^{-1/2})\), \(\hat {\gamma }_{x,i}=\gamma _{x,i}+O_{p}(N^{-1/2})\), \(\hat {\kappa }_{x,i}=\kappa _{x,i}+O_{p}(N^{-1/2})\) hold uniformly in i, and similarly, \(S^{2}_{y,i}=\sigma ^{2}_{y,i}+O_{p}(N^{-1/2})\), \(\hat {\gamma }_{y,i}=\gamma _{y,i}+O_{p}(N^{-1/2})\), \(\hat {\kappa }_{y,i}=\kappa _{y,i}+O_{p}(N^{-1/2})\) hold uniformly in i.
Again it can be shown using the technique in Polansky (2011, p. 42) that for any βi ∈ [𝜖,1 − 𝜖] for some 𝜖 > 0, \(z_{\hat {\beta _{i}}}=z_{\beta _{i}}+O(N^{-1})\).
Using the above asymptotic relations, it can be checked easily that
hold uniformly in i. Now,
uniformly in i because \(\hat {h}^{2}_{i,opt}=h^{2}_{i,opt}+O_{p}(N^{-3/2})\) and \(\frac {S^{2}_{x,i}}{n}+\frac {S^{2}_{y,i}}{m}+\hat {h}^{2}_{i,opt}(\frac {1}{n}+\frac {1}{m})=\sigma ^{2}_{x,i}+\sigma ^{2}_{y,i}+O_{p}(N^{-1/2})\) are true uniformly in i. Equation A.10 implies that
uniformly in i, where the second last line and the last follow from after an application of the Delta method of Section 2.7 of Hall (1992).
Proof of Theorem 6.
Following the steps of Theorem 2.1, we can prove Theorem 2.2. Hence, we have omitted the proof. □
Proof of Lemma 7.
Following the steps of Lemma 4, we can prove Lemma 7. Hence, we have omitted the proof. □
Proof of Theorem 8.
Let Pi denote true p-values of \(\tilde {T}_{i}\) or \(\tilde {T}^{R}_{i}\) and depending on the sign of \(\hat {h}^{2}_{i,\text {opt}}\). Let \(\hat {P}_{i}\) denote \(2(1-{\Phi }(|\tilde {T}_{i}|))\) or \(2(1-{\Phi }(|\tilde {T}^{R}_{i}|))\) depending on the sign of \(\hat {h}^{2}_{i,\text {opt}}\). Define for each true null hypothesis H0i. Let \({\mathscr{H}}_{0}\) denote the set of all true null hypothesis. Then \(V={\sum }_{i\in {\mathscr{H}}_{0}} V_{i}\) denotes the numbers of wrongly rejected null hypotheses, and FDR is defined as
Some part of this argument rests heavily on idea from Candes and Barber (2018). We can write \(\frac {V_{i}}{\max \limits \{R,1\}}\) as follows by summing over all possible values of number of rejections
Note that, when there are j rejections, then H0i is rejected if and only if \(\hat {p}_{i} \leq \alpha j/d\), and thus
□
Suppose that H0i is rejected, i.e. \(\hat {p}_{i} \leq \alpha j/d\), and if we set the value of \(\hat {P}_{i}\) to 0 then this new number of rejections is exactly j, because we are only reordering the first j p-values, all of which remain below the threshold αj/d. We denote this new number of rejections as \(R(\hat {p}_{i}=0)\), and thus .
Set \(\hat {P}=(\hat {P}_{1},\ldots ,\hat {P}_{d})^{\top }\) and \(\hat {P}_{(-i)}=\hat {P}\setminus \{\hat {P}_{i}\}\), then from the above observation we have
where last two lines follow from the facts that (i) is non-random when condition on \(\hat {P}_{(-i)}\) and \(\hat {P}_{i}=0\), and (ii) the components of \(\hat {P}\) are independent. We now center attention upon . Lemmas 4 and 7 indicate that \(|P_{i}-\hat {P}_{i}|=O_{p}(N^{-3/2})\) uniformly in i and using the similar arguments of the delta method of Hall (1992, Section 2.7), we have
uniformly in i, where the expression in the last equality follows from the fact that \(P_{i}\sim \text {Uniform}(0,1)\), and
uniformly in i. The expressions in Eqs. A.11 and A.12 imply that
uniformly in i, the last is obtained because
and
Equation A.13 implies that
last holds because O(N− 3/2) is uniform in i, where \(d_{0}={\sum }_{i\in {\mathscr{H}}_{0}}\) is number of true null hypotheses. Thus,
and \(\frac {\text {FDR}}{\frac {d_{0}}{d}}\to 1\) because d = o(N3/2).
Rights and permissions
About this article
Cite this article
Ghosh, S., Polansky, A.M. Large-Scale Simultaneous Testing Using Kernel Density Estimation. Sankhya A 84, 808–843 (2022). https://doi.org/10.1007/s13171-020-00220-5
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13171-020-00220-5