Skip to main content
Log in

WIKS: a general Bayesian nonparametric index for quantifying differences between two populations

  • Original Paper
  • Published:
TEST Aims and scope Submit manuscript

Abstract

A key problem in many research investigations is to decide whether two samples have the same distribution. Numerous statistical methods have been devoted to this issue, but only few considered a Bayesian nonparametric approach. In this paper, we propose a novel nonparametric Bayesian index (WIKS) for quantifying the difference between two populations \(P_1\) and \(P_2\), which is defined by a weighted posterior expectation of the Kolmogorov–Smirnov distance between \(P_1\) and \(P_2\). We present a Bayesian decision-theoretic argument to support the use of WIKS index and a simple algorithm to compute it. Furthermore, we prove that WIKS is a statistically consistent procedure and that it controls the significance level uniformly over the null hypothesis, a feature that simplifies the choice of cutoff values for taking decisions. We present a real data analysis and an extensive simulation study showing that WIKS is more powerful than competing approaches under several settings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. Common choices for this metric are the Kolmogorov–Smirnov metric, the L2 metric, the Lévy metric, the \(L_1\) and the symmetrized Kullback–Leibler metric. For a survey of metrics between probability measures, see Rachev et al. (2013).

  2. This approach was suggested by, e.g., Swartz (1999) in a Bayesian nonparametric goodness-of-fit context.

  3. Proposition 3 of Supplementary Material.

  4. Proposition 1 of Supplementary Material.

  5. In general, as K (the concentration parameter) decreases, the role of G will be less important; in fact, as K gets closer to zero, the test statistic gets closer to the Kolmogorov–Smirnov test statistic.

References

  • Al Labadi L, Zarepour M (2014) Goodness-of-fit tests based on the distance between the dirichlet process and its base measure. J Nonparametric Stat 26(2):341–357

    Article  MathSciNet  Google Scholar 

  • Basu S, Chib S (2003) Marginal likelihood and Bayes factors for Dirichlet process mixture models. J Am Stat Assoc 98(461):224–235

    Article  MathSciNet  Google Scholar 

  • Berger JO, Guglielmi A (2001) Bayesian and conditional frequentist testing of a parametric model versus nonparametric alternatives. J Am Stat Assoc 96(453):174–184

    Article  MathSciNet  Google Scholar 

  • Cecato JF, Martinelli JE, Izbicki R, Yassuda MS, Aprahamian I (2016) A subtest analysis of the montreal cognitive assessment (MoCA): which subtests can best discriminate between healthy controls, mild cognitive impairment and Alzheimer’s disease? Int Psychogeriatrics 28(5):825–832

    Article  Google Scholar 

  • Chen Y, Hanson TE (2014) Bayesian nonparametric k-sample tests for censored and uncensored data. Comput Stat Data Anal 71:335–346

    Article  MathSciNet  Google Scholar 

  • Cuevas A, Febrero M, Fraiman R (2004) An anova test for functional data. Comput Stat Data Anal 47(1):111–122

    Article  MathSciNet  Google Scholar 

  • DeGroot MH (1970) Optimal statistical decisions. McGraw-Hill, New York

    MATH  Google Scholar 

  • Duong T, Goud B, Schauer K (2012) Closed-form density-based framework for automatic detection of cellular morphology changes. Proc Natl Acad Sci 109(22):8382–8387

    Article  Google Scholar 

  • Ferguson TS (1973) A Bayesian analysis of some nonparametric problems. Ann Stat 1(4):209–230

    MathSciNet  MATH  Google Scholar 

  • Ferguson TS (1974) Prior distributions on spaces of probability measures. Ann Stat 2(4):615–629

    Article  MathSciNet  Google Scholar 

  • Florens JP, Richard JF, Rolin JM (1996) Bayesian encompassing specification tests of a parametric model against a non parametric alternative. Working Papers 96.08, Catholique de Louvain - Institut de statistique

  • Good IJ (1992) The Bayes/non-Bayes compromise: a brief review. J Am Stat Assoc 87:597–606

    Article  MathSciNet  Google Scholar 

  • Gretton A, Borgwardt KM, Rasch MJ, Schölkopf B, Smola A (2012) A kernel two-sample test. J Mach Learn Res 13(Mar):723–773

    MathSciNet  MATH  Google Scholar 

  • Hjort NL et al (1990) Nonparametric Bayes estimators based on beta processes in models for life history data. Ann Stat 18(3):1259–1294

    MathSciNet  MATH  Google Scholar 

  • Holmes CC, Caron F, Griffin JE, Stephens DA (2015) Two-sample Bayesian nonparametric hypothesis testing. Bayesian Anal 10(2):297–320

    Article  MathSciNet  Google Scholar 

  • Jeffreys H (1961) The theory of probability. Oxford University Press, Oxford

    MATH  Google Scholar 

  • Kolmogorov AN (1933) Sulla determinazione empirica di una legge di distribuzione. Giorn Ist Ital Attuar 4:83–91

    MATH  Google Scholar 

  • Komárek A (2014) mixAk: Multivariate normal mixture models and mixtures of generalized linear mixed models including model based clustering. R package version 3

  • Lavine M et al (1992) Some aspects of polya tree distributions for statistical modelling. Ann Stat 20(3):1222–1235

    Article  Google Scholar 

  • Mann HB, Whitney DR (1947) On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat 18:50–60

    Article  MathSciNet  Google Scholar 

  • Pfister N, Bühlmann P, Schölkopf B, Peters J (2018) Kernel-based tests for joint independence. J R Stat Soc Ser B (Stat Methodol) 80(1):5–31

    Article  MathSciNet  Google Scholar 

  • R Core Team (2018) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/

  • Rachev ST, Klebanov L, Stoyanov SV, Fabozzi F (2013) The methods of distances in the theory of probability and statistics. Springer, New York

    Book  Google Scholar 

  • Ramdas A, Trillos NG, Cuturi M (2017) On wasserstein two-sample testing and related families of nonparametric tests. Entropy 19(2):47

    Article  MathSciNet  Google Scholar 

  • Sethuraman J (1994) A constructive definition of dirichlet priors. Stat Sin 4(2):639–650

    MathSciNet  MATH  Google Scholar 

  • Smirnov N (1948) Table for estimating the goodness of fit of empirical distributions. Ann Math Stat 19(2):279–281

    Article  MathSciNet  Google Scholar 

  • Srivastava R, Li P, Ruppert D (2016) RAPTT: an exact two-sample test in high dimensions using random projections. J Comput Graph Stat 25(3):954–970

    Article  MathSciNet  Google Scholar 

  • Swartz T (1999) Nonparametric goodness-of-fit. Commun Stat Theory Methods 28(12):2821–2841

    Article  MathSciNet  Google Scholar 

  • Székely GJ, Rizzo ML et al (2004) Testing for equal distributions in high dimension. InterStat 5(16.10):1249–1272

    Google Scholar 

  • Wei S, Lee C, Wichers L, Marron JS (2016) Direction-projection-permutation for high-dimensional hypothesis tests. J Comput Graph Stat 25(2):549–569

    Article  MathSciNet  Google Scholar 

  • Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics Bull 1(6):80–83

    Article  Google Scholar 

Download references

Acknowledgements

The authors are also grateful for the suggestions given by Danilo Lourenço Lopes, José Galvão Leite, the anonymous referees and the editors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Luis Ernesto Bueno Salasar.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was partially supported by FAPESP – Fundação de Amparo à Pesquisa do Estado de São Paulo, Grants 2017/03363-8 and 2019/11321-9 and CNPq – Conselho Nacional de Desenvolvimento Científico e Tecnológico, Grant PQ 306943/2017-4.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3236 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

de Carvalho Ceregatti, R., Izbicki, R. & Bueno Salasar, L.E. WIKS: a general Bayesian nonparametric index for quantifying differences between two populations. TEST 30, 274–291 (2021). https://doi.org/10.1007/s11749-020-00718-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11749-020-00718-y

Keywords

Mathematics Subject Classification

Navigation