Abstract
A key problem in many research investigations is to decide whether two samples have the same distribution. Numerous statistical methods have been devoted to this issue, but only few considered a Bayesian nonparametric approach. In this paper, we propose a novel nonparametric Bayesian index (WIKS) for quantifying the difference between two populations \(P_1\) and \(P_2\), which is defined by a weighted posterior expectation of the Kolmogorov–Smirnov distance between \(P_1\) and \(P_2\). We present a Bayesian decision-theoretic argument to support the use of WIKS index and a simple algorithm to compute it. Furthermore, we prove that WIKS is a statistically consistent procedure and that it controls the significance level uniformly over the null hypothesis, a feature that simplifies the choice of cutoff values for taking decisions. We present a real data analysis and an extensive simulation study showing that WIKS is more powerful than competing approaches under several settings.
Similar content being viewed by others
Notes
Common choices for this metric are the Kolmogorov–Smirnov metric, the L2 metric, the Lévy metric, the \(L_1\) and the symmetrized Kullback–Leibler metric. For a survey of metrics between probability measures, see Rachev et al. (2013).
This approach was suggested by, e.g., Swartz (1999) in a Bayesian nonparametric goodness-of-fit context.
Proposition 3 of Supplementary Material.
Proposition 1 of Supplementary Material.
In general, as K (the concentration parameter) decreases, the role of G will be less important; in fact, as K gets closer to zero, the test statistic gets closer to the Kolmogorov–Smirnov test statistic.
References
Al Labadi L, Zarepour M (2014) Goodness-of-fit tests based on the distance between the dirichlet process and its base measure. J Nonparametric Stat 26(2):341–357
Basu S, Chib S (2003) Marginal likelihood and Bayes factors for Dirichlet process mixture models. J Am Stat Assoc 98(461):224–235
Berger JO, Guglielmi A (2001) Bayesian and conditional frequentist testing of a parametric model versus nonparametric alternatives. J Am Stat Assoc 96(453):174–184
Cecato JF, Martinelli JE, Izbicki R, Yassuda MS, Aprahamian I (2016) A subtest analysis of the montreal cognitive assessment (MoCA): which subtests can best discriminate between healthy controls, mild cognitive impairment and Alzheimer’s disease? Int Psychogeriatrics 28(5):825–832
Chen Y, Hanson TE (2014) Bayesian nonparametric k-sample tests for censored and uncensored data. Comput Stat Data Anal 71:335–346
Cuevas A, Febrero M, Fraiman R (2004) An anova test for functional data. Comput Stat Data Anal 47(1):111–122
DeGroot MH (1970) Optimal statistical decisions. McGraw-Hill, New York
Duong T, Goud B, Schauer K (2012) Closed-form density-based framework for automatic detection of cellular morphology changes. Proc Natl Acad Sci 109(22):8382–8387
Ferguson TS (1973) A Bayesian analysis of some nonparametric problems. Ann Stat 1(4):209–230
Ferguson TS (1974) Prior distributions on spaces of probability measures. Ann Stat 2(4):615–629
Florens JP, Richard JF, Rolin JM (1996) Bayesian encompassing specification tests of a parametric model against a non parametric alternative. Working Papers 96.08, Catholique de Louvain - Institut de statistique
Good IJ (1992) The Bayes/non-Bayes compromise: a brief review. J Am Stat Assoc 87:597–606
Gretton A, Borgwardt KM, Rasch MJ, Schölkopf B, Smola A (2012) A kernel two-sample test. J Mach Learn Res 13(Mar):723–773
Hjort NL et al (1990) Nonparametric Bayes estimators based on beta processes in models for life history data. Ann Stat 18(3):1259–1294
Holmes CC, Caron F, Griffin JE, Stephens DA (2015) Two-sample Bayesian nonparametric hypothesis testing. Bayesian Anal 10(2):297–320
Jeffreys H (1961) The theory of probability. Oxford University Press, Oxford
Kolmogorov AN (1933) Sulla determinazione empirica di una legge di distribuzione. Giorn Ist Ital Attuar 4:83–91
Komárek A (2014) mixAk: Multivariate normal mixture models and mixtures of generalized linear mixed models including model based clustering. R package version 3
Lavine M et al (1992) Some aspects of polya tree distributions for statistical modelling. Ann Stat 20(3):1222–1235
Mann HB, Whitney DR (1947) On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat 18:50–60
Pfister N, Bühlmann P, Schölkopf B, Peters J (2018) Kernel-based tests for joint independence. J R Stat Soc Ser B (Stat Methodol) 80(1):5–31
R Core Team (2018) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
Rachev ST, Klebanov L, Stoyanov SV, Fabozzi F (2013) The methods of distances in the theory of probability and statistics. Springer, New York
Ramdas A, Trillos NG, Cuturi M (2017) On wasserstein two-sample testing and related families of nonparametric tests. Entropy 19(2):47
Sethuraman J (1994) A constructive definition of dirichlet priors. Stat Sin 4(2):639–650
Smirnov N (1948) Table for estimating the goodness of fit of empirical distributions. Ann Math Stat 19(2):279–281
Srivastava R, Li P, Ruppert D (2016) RAPTT: an exact two-sample test in high dimensions using random projections. J Comput Graph Stat 25(3):954–970
Swartz T (1999) Nonparametric goodness-of-fit. Commun Stat Theory Methods 28(12):2821–2841
Székely GJ, Rizzo ML et al (2004) Testing for equal distributions in high dimension. InterStat 5(16.10):1249–1272
Wei S, Lee C, Wichers L, Marron JS (2016) Direction-projection-permutation for high-dimensional hypothesis tests. J Comput Graph Stat 25(2):549–569
Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics Bull 1(6):80–83
Acknowledgements
The authors are also grateful for the suggestions given by Danilo Lourenço Lopes, José Galvão Leite, the anonymous referees and the editors.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work was partially supported by FAPESP – Fundação de Amparo à Pesquisa do Estado de São Paulo, Grants 2017/03363-8 and 2019/11321-9 and CNPq – Conselho Nacional de Desenvolvimento Científico e Tecnológico, Grant PQ 306943/2017-4.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
de Carvalho Ceregatti, R., Izbicki, R. & Bueno Salasar, L.E. WIKS: a general Bayesian nonparametric index for quantifying differences between two populations. TEST 30, 274–291 (2021). https://doi.org/10.1007/s11749-020-00718-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11749-020-00718-y