Advertisement

On some graph-based two-sample tests for high dimension, low sample size data

  • Soham SarkarEmail author
  • Rahul Biswas
  • Anil K. Ghosh
Article
  • 40 Downloads

Abstract

Testing for equality of two high-dimensional distributions is a challenging problem, and this becomes even more challenging when the sample size is small. Over the last few decades, several graph-based two-sample tests have been proposed in the literature, which can be used for data of arbitrary dimensions. Most of these test statistics are computed using pairwise Euclidean distances among the observations. But, due to concentration of pairwise Euclidean distances, these tests have poor performance in many high-dimensional problems. Some of them can have powers even below the nominal level when the scale-difference between two distributions dominates the location-difference. To overcome these limitations, we introduce some new dissimilarity indices and use them to modify some popular graph-based tests. These modified tests use the distance concentration phenomenon to their advantage, and as a result, they outperform the corresponding tests based on the Euclidean distance in a wide variety of examples. We establish the high-dimensional consistency of these modified tests under fairly general conditions. Analyzing several simulated as well as real data sets, we demonstrate their usefulness in high dimension, low sample size situations.

Keywords

Distance concentration High-dimensional consistency Minimum spanning tree Nearest neighbor Non-bipartite matching Permutation test Shortest Hamiltonian path 

Notes

Acknowledgements

This research was partially supported by Keysight Technologies, Inc., USA.

References

  1. Andrews, D. W. K. (1988). Laws of large numbers for dependent nonidentically distributed random variables. Econometric Theory, 4, 458–467.MathSciNetCrossRefGoogle Scholar
  2. Aslan, B., & Zech, G. (2005). New test for the multivariate two-sample problem based on the concept of minimum energy. Journal of Statistical Computation and Simulation, 75, 109–119.MathSciNetCrossRefGoogle Scholar
  3. Baringhaus, L., & Franz, C. (2004). On a new multivariate two-sample test. Journal of Multivariate Analysis, 88, 190–206.MathSciNetCrossRefGoogle Scholar
  4. Baringhaus, L., & Franz, C. (2010). Rigid motion invariant two-sample tests. Statistica Sinica, 20, 1333–1361.MathSciNetzbMATHGoogle Scholar
  5. Billingsley, P. (1995). Probability and measure. New York: Wiley.zbMATHGoogle Scholar
  6. Biswas, M., & Ghosh, A. K. (2014). A nonparametric two-sample test applicable to high dimensional data. Journal of Multivariate Analysis, 123, 160–171.MathSciNetCrossRefGoogle Scholar
  7. Biswas, M., Mukhopadhyay, M., & Ghosh, A. K. (2014). A distribution-free two-sample run test applicable to high-dimensional data. Biometrika, 101, 913–926.MathSciNetCrossRefGoogle Scholar
  8. Biswas, M., Mukhopadhyay, M., & Ghosh, A. K. (2015). On some exact distribution-free one-sample tests for high dimension low sample size data. Statistica Sinica, 25, 1421–1435.MathSciNetzbMATHGoogle Scholar
  9. Chen, H., & Friedman, J. H. (2017). A new graph-based two-sample test for multivariate and object data. Journal of the American Statistical Association, 112, 397–409.MathSciNetCrossRefGoogle Scholar
  10. de Jong, R. M. (1995). Laws of large numbers for dependent heterogeneous processes. Econometric Theory, 11, 347–358.MathSciNetCrossRefGoogle Scholar
  11. Dutta, S., Sarkar, S., & Ghosh, A. K. (2016). Multi-scale classification using localized spatial depth. Journal of Machine Learning Research, 17(217), 1–30.MathSciNetzbMATHGoogle Scholar
  12. Friedman, J. H., & Rafsky, L. C. (1979). Multivariate generalizations of the Wald–Wolfowitz and Smirnov two-sample tests. The Annals of Statistics, 7, 697–717.MathSciNetCrossRefGoogle Scholar
  13. Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., & Smola, A. (2012). A kernel two-sample test. Journal of Machine Learning Research, 13, 723–773.MathSciNetzbMATHGoogle Scholar
  14. Hall, P., Marron, J. S., & Neeman, A. (2005). Geometric representation of high dimension, low sample size data. Journal of the Royal Statistical Society, Series B, 67, 427–444.MathSciNetCrossRefGoogle Scholar
  15. Hall, P., & Tajvidi, N. (2002). Permutation tests for equality of distributions in high-dimensional settings. Biometrika, 89, 359–374.MathSciNetCrossRefGoogle Scholar
  16. Henze, N. (1988). A multivariate two-sample test based on the number of nearest neighbor type coincidences. The Annals of Statistics, 16, 772–783.MathSciNetCrossRefGoogle Scholar
  17. Jung, S., & Marron, J. S. (2009). PCA consistency in high dimension, low sample size context. The Annals of Statistics, 37, 4104–4130.MathSciNetCrossRefGoogle Scholar
  18. Lin, Z., & Lu, C. (1996). Limit theory for mixing dependent random variables. Dordrecht: Kluwer Academic Publishers.zbMATHGoogle Scholar
  19. Liu, Z., & Modarres, R. (2011). A triangle test for equality of distribution functions in high dimensions. Journal of Nonparametric Statistics, 23, 605–615.MathSciNetCrossRefGoogle Scholar
  20. Lu, B., Greevy, R., Xu, X., & Beck, C. (2011). Optimal nonbipartite matching and its statistical applications. The American Statistician, 65, 21–30.MathSciNetCrossRefGoogle Scholar
  21. Maa, J. F., Pearl, D. K., & Bartoszyński, R. (1996). Reducing multidimensional two-sample data to one-dimensional interpoint comparisons. The Annals of Statistics, 24, 1069–1074.MathSciNetCrossRefGoogle Scholar
  22. Mondal, P. K., Biswas, M., & Ghosh, A. K. (2015). On high dimensional two-sample tests based on nearest neighbors. Journal of Multivariate Analysis, 141, 168–178.MathSciNetCrossRefGoogle Scholar
  23. Rosenbaum, P. R. (2005). An exact distribution-free test comparing two multivariate distributions based on adjacency. Journal of the Royal Statistical Society, Series B, 67, 515–530.MathSciNetCrossRefGoogle Scholar
  24. Sarkar, S., & Ghosh, A. K. (2018). On some high dimensional two-sample tests based on averages of inter-point distances. Stat, 7, e187.MathSciNetCrossRefGoogle Scholar
  25. Sarkar, S., & Ghosh, A. K. (2019). On perfect clustering of high dimension, low sample size data. IEEE Transactions on Pattern Analysis and Machine Intelligence.  https://doi.org/10.1109/TPAMI.2019.2912599.
  26. Schilling, M. F. (1986). Multivariate two-sample tests based on nearest neighbors. Journal of the American Statistical Association, 81, 799–806.MathSciNetCrossRefGoogle Scholar
  27. Székely, G. J., & Rizzo, M. L. (2004). Testing for equal distributions in high dimension. InterStat, 5.Google Scholar
  28. Székely, G. J., & Rizzo, M. L. (2013). Energy statistics: a class of statistics based on distances. Journal of Statistical Planning and Inference, 143, 1249–1272.MathSciNetCrossRefGoogle Scholar
  29. Tsukada, S.-I. (2019). High dimensional two-sample test based on the inter-point distance. Computational Statistics, 34, 599–615.MathSciNetCrossRefGoogle Scholar

Copyright information

© The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Institute of MathematicsÉcole Polytechnique Fédérale de LausanneLausanneSwitzerland
  2. 2.Department of StatisticsUniversity of WashingtonSeattleUSA
  3. 3.Theoretical Statistics and Mathematics UnitIndian Statistical InstituteKolkataIndia

Personalised recommendations