Two-sample homogeneity tests based on divergence measures
- 230 Downloads
The concept of f-divergences introduced by Ali and Silvey (J R Stat Soc (B) 28:131–142, 1996) provides a rich set of distance like measures between pairs of distributions. Divergences do not focus on certain moments of random variables, but rather consider discrepancies between the corresponding probability density functions. Thus, two-sample tests based on these measures can detect arbitrary alternatives when testing the equality of the distributions. We treat the problem of divergence estimation as well as the subsequent testing for the homogeneity of two-samples. In particular, we propose a nonparametric estimator for f-divergences in the case of continuous distributions, which is based on kernel density estimation and spline smoothing. As we show in extensive simulations, the new method performs stable and quite well in comparison to several existing non- and semiparametric divergence estimators. Furthermore, we tackle the two-sample homogeneity problem using permutation tests based on various divergence estimators. The methods are compared to an asymptotic divergence test as well as to several traditional parametric and nonparametric procedures under different distributional assumptions and alternatives in simulations. It turns out that divergence based methods detect discrepancies between distributions more often than traditional methods if the distributions do not differ in location only. The findings are illustrated on ion mobility spectrometry data.
KeywordsNonparametric two-sample test Semiparametric two-sample test Density ratio estimation Kullback-Leibler divergence Hellinger distance Permutation test
We thank the anonymous referees for their valuable remarks which helped us to improve this work. The authors were supported in part by the Collaborative Research Center 876, Project C3 Multi-level statistical analysis of high-frequency spatio-temporal process data and Collaborative Research Center 823, Project C3 analysis of structural change in dynamic processes of the German Research Foundation. Furthermore, we thank Marianna D’Addario and Dominik Kopczynski, both members of the Bioinformatics group of Prof. Dr. Sven Rahmann in the Collaborative Research Center 876, Project B1, for providing interesting real world data for our analysis.
- Bischl B, Lang M, Mersmann O (2013) BatchExperiments: statistical experiments on batch computing clusters. R package version 1.0-968, http://CRAN.R-project.org/package=BatchExperiments/
- Fisher RA (1935) The design of experiments. Oliver and Boyd, EdinburghGoogle Scholar
- Green PJ, Silverman BW (1994) Nonparametric regression and generalized linear models: a roughness penalty approach. CRC Monogr Stat Appl Probab (Book 58), Chapman and Hall, New YorkGoogle Scholar
- Kopczynski D, Baumbach JI, Rahmann S (2012) Peak modeling for ion mobility spectrometry measurements. In: Proceedings of the 20th European signal processing conference (EUSIPCO 2012), pp. 1801–1805Google Scholar
- R Development Core Team (2013) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, http://www.R-project.org
- Sugiyama M, Kanamori T, Suzuki T, Hido S, Sese J, Takeuchi I, Wei L (2009) A density-ratio framework for statistical data processing. IPSJ Trans Comput Vis Appl 1:183–208Google Scholar
- Turlach BA (1993) Bandwidth selection in kernel density estimation: a review. Universit catholique de LouvainGoogle Scholar