Hybrid schemes for exact conditional inference in discrete exponential families

Article
  • 25 Downloads

Abstract

Exact conditional goodness-of-fit tests for discrete exponential family models can be conducted via Monte Carlo estimation of p values by sampling from the conditional distribution of multiway contingency tables. The two most popular methods for such sampling are Markov chain Monte Carlo (MCMC) and sequential importance sampling (SIS). In this work we consider various ways to hybridize the two schemes and propose one standout strategy as a good general purpose method for conducting inference. The proposed method runs many parallel chains initialized at SIS samples across the fiber. When a Markov basis is unavailable, the proposed scheme uses a lattice basis with intermittent SIS proposals to guarantee irreducibility and asymptotic unbiasedness. The scheme alleviates many of the challenges faced by the MCMC and SIS schemes individually while largely retaining their strengths. It also provides diagnostics that guide and lend credibility to the procedure. Simulations demonstrate the viability of the approach.

Keywords

Contingency tables Exact inference Markov chain Monte Carlo Sequential importance sampling Algebraic statistics 

References

  1. Agresti, A. (1992). A survey of exact inference for contingency tables. Statistical Science, 7(1), 131–153.MathSciNetCrossRefMATHGoogle Scholar
  2. Agresti, A. (2002). Categorical data analysis (2nd ed.). Hoboken: Wiley.CrossRefMATHGoogle Scholar
  3. Aoki, S., Hara, H., Takemura, A. (2012). Markov bases in algebraic statistics (Vol. 199). New York: Springer.Google Scholar
  4. Baldoni, V., Berline, N., De Loera, J., Dutra, B., Koppe, M., Moreinis, S., Pinto, G., Vergne, M., Wu, J. (2014). A user’s guide for LattE integral v1.7.2. URL: http://www.math.ucdavis.edu/~latte/.
  5. Bélisle, C. J., Romeijn, H. E., Smith, R. L. (1993). Hit-and-run algorithms for generating multivariate distributions. Mathematics of Operations Research, 18(2), 255–266.Google Scholar
  6. Berkelaar, M., Eikland, K., Notebaert, P. (2015). lpSolve: Interface to Lp_solve v.5.5 to solve linear/integer programs. http://CRAN.R-project.org/package=lpSolve, R package version 5.6.11.
  7. Bishop, Y. M. M., Fienberg, S. E., Holland, P. W. (1975). Discrete multivariate analysis: Theory and practice. Cambridge: The MIT Press.Google Scholar
  8. Booth, J. G., Butler, R. W. (1999). An importance sampling algorithm for exact conditional tests in log-linear models. Biometrika, 86(2), 321–332.Google Scholar
  9. Boyett, J. M. (1979). Algorithm as 144: Random r\(\times \) c tables with given row and column totals. Journal of the Royal Statistical Society Series C-Applied Statistics, 28(3), 329–332.MATHGoogle Scholar
  10. Brooks, S. P., Gelman, A. (1998). General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7(4), 434–455.Google Scholar
  11. Caffo, B. (2013). exactLoglinTest: Monte Carlo exact tests for log-linear models. http://CRAN.R-project.org/package=exactLoglinTest, R package version 1.4.2.
  12. Caffo, B. S., Booth, J. G. (2001). A Markov chain Monte Carlo algorithm for approximating exact conditional probabilities. Journal of Computational and Graphical Statistics, 10(4), 730–745.Google Scholar
  13. Chen, Y., Diaconis, P., Holmes, S. P., Liu, J. S. (2005a). Sequential monte carlo methods for statistical analysis of tables. Journal of the American Statistical Association, 100(469), 109–120.Google Scholar
  14. Chen, Y., Dinwoodie, I., Dobra, A., Huber, M. (2005b). Lattice points, contingency tables, and sampling. Contemporary Mathematics, 374, 65–78.Google Scholar
  15. Chen, Y., Dinwoodie, I., Sullivant, S. (2006). Sequential importance sampling for multiway tables. The Annals of Statistics, 34(1), 523–545.Google Scholar
  16. Clarkson, D. B., Fan, Y., Joe, H. (1993). A remark on algorithm 643: Fexact: An algorithm for performing fisher’s exact test in RXC contingency tables. ACM Transactions on Mathematical Software, 19(4), 484–488.Google Scholar
  17. Cox, D., Little, J., O’Shea, D. (1997). Ideals, varieties, and algorithms (2nd ed.). New York: Springer.Google Scholar
  18. De Loera, J., Onn, S. (2005). Markov bases of three-way tables are arbitrarily complicated. Journal of Symbolic Computation, 41(2), 173–181.Google Scholar
  19. Diaconis, P., Sturmfels, B. (1998). Algebraic algorithms for sampling from conditional distributions. The Annals of Statistics, 26(1), 363–397.Google Scholar
  20. Dobra, A. (2003). Markov bases for decomposable graphical models. Bernoulli, 9(6), 1093–1108.MathSciNetCrossRefMATHGoogle Scholar
  21. Dobra, A., Sullivant, S. (2004). A divide-and-conquer algorithm for generating Markov bases of multi-way tables. Computational Statistics, 19, 347–366.Google Scholar
  22. Drton, M., Sturmfels, B., Sullivant, S. (2009). Lectures on algebraic statistics. Boston: Birkhauser Basel.Google Scholar
  23. Eddelbuettel, D. (2013). Seamless R and C++ integration with Rcpp. New York: Springer.CrossRefMATHGoogle Scholar
  24. Eddelbuettel, D., François, R. (2011). Rcpp: Seamless R and C++ integration. Journal of Statistical Software, 40(8), 1–18.Google Scholar
  25. Fisher, R. A. (1922a). On the interpretation of \(\chi \)2 from contingency tables, and the calculation of p. Journal of the Royal Statistical Society, 85(1), 87–94.Google Scholar
  26. Fisher, R. A. (1922b). On the mathematical foundations of theoretical statistics. Philosophical transactions of the royal society of London series A—Containing papers of a mathematical or physical character (pp. 309–368).Google Scholar
  27. Fisher, R. A. (1934). Statistical methods for research workers (5th ed.). Edinburgh: Oliver & Boyd.MATHGoogle Scholar
  28. Gelman, A., Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7(4), 457–472.Google Scholar
  29. Halton, J. H. (1969). A rigorous derivation of the exact contingency formula. Mathematical Proceedings of the Cambridge Philosophical Society, 65(02), 527–530.MathSciNetCrossRefMATHGoogle Scholar
  30. Hara, H., Takemura, A., Yoshida, R. (2010). On connectivity of fibers with positive marginals in multiple logistic regression. Journal of Multivariate Analysis, 101(4), 909–925.Google Scholar
  31. Hara, H., Aoki, S., Takemura, A. (2012). Running Markov chain without Markov basis. In Proceedings of the second CREST-SBM international conference, Harmony of Gröbner bases and the modern industrial society, Singapore (pp. 19–34).Google Scholar
  32. Kahle, D., Garcia-Puente, L., Yoshida, R. (2015). algstat: Algebraic statistics in R. http://CRAN.R-project.org/package=algstat, R package version 0.1.0.
  33. Kahle, T., Rauh, J. (2011). The Markov bases database. http://www.markov-bases.de.
  34. Lange, K. (2010). Numerical analysis for statisticians (2nd ed.). New York: Springer.CrossRefMATHGoogle Scholar
  35. Lehmann, E. L., Romano, J. P. (2005). Testing statistical hypotheses (3rd ed.). New York: Springer.Google Scholar
  36. Liu, J. S. (2008). Monte Carlo strategies in scientific computing. New York: Springer.MATHGoogle Scholar
  37. Lunn, D., Jackson, C., Best, N., Thomas, A., Spiegelhalter, D. (2012). The BUGS book: A practical introduction to Bayesian analysis. Boca Raton: CRC Press.Google Scholar
  38. Mehta, C. R., Patel, N. R. (1986). Algorithm 643: Fexact: A Fortran subroutine for fisher’s exact test on unordered r\(\times \) c contingency tables. ACM Transactions on Mathematical Software, 12(2), 154–161.Google Scholar
  39. Patefield, W. M. (1981). Algorithm as 159: An efficient method of generating random r\(\times \) c tables with given row and column totals. Journal of the Royal Statistical Society Series C-Applied Statistics, 30(1), 91–97.MathSciNetMATHGoogle Scholar
  40. Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine Series 5, 50(302), 157–175.CrossRefMATHGoogle Scholar
  41. R Core Team (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/.
  42. Read, T. R., Cressie, N. (1988). Goodness-of-fit statistics for discrete multivariate data. New York: Springer.Google Scholar
  43. Schrijver, A. (1986). Theory of linear and integer programming. Chichester: Wiley.MATHGoogle Scholar
  44. Sheskin, D. J. (2007). Handbook of parametric and nonparametric statistical procedures (4th ed.). Boca Raton: Chapman and Hall/CRC Press.MATHGoogle Scholar
  45. Snee, R. D. (1974). Graphical display of two-way contingency tables. The American Statistician, 28(1), 9–12.MathSciNetMATHGoogle Scholar
  46. Snijders, T. (1991). Enumeration and simulation methods for 0–1 matrices with given marginals. Psychometrika, 56(3), 397–417.MathSciNetCrossRefMATHGoogle Scholar
  47. Sturmfels, B. (1996). Gröbner bases and convex polytopes (Vol. 8). Providence: American Mathematical Society.MATHGoogle Scholar
  48. 4ti2 team (2008). 4ti2—A software package for algebraic, geometric and combinatorial problems on linear spaces. http://www.4ti2.de.

Copyright information

© The Institute of Statistical Mathematics, Tokyo 2017

Authors and Affiliations

  • David Kahle
    • 1
  • Ruriko Yoshida
    • 2
  • Luis Garcia-Puente
    • 3
  1. 1.Department of Statistical ScienceBaylor UniversityWacoUSA
  2. 2.Department of Operations ResearchNaval Postgraduate SchoolMontereyUSA
  3. 3.Department of Mathematics and StatisticsSam Houston State UniversityHuntsvilleUSA

Personalised recommendations