Abstract
Exact conditional goodness-of-fit tests for discrete exponential family models can be conducted via Monte Carlo estimation of p values by sampling from the conditional distribution of multiway contingency tables. The two most popular methods for such sampling are Markov chain Monte Carlo (MCMC) and sequential importance sampling (SIS). In this work we consider various ways to hybridize the two schemes and propose one standout strategy as a good general purpose method for conducting inference. The proposed method runs many parallel chains initialized at SIS samples across the fiber. When a Markov basis is unavailable, the proposed scheme uses a lattice basis with intermittent SIS proposals to guarantee irreducibility and asymptotic unbiasedness. The scheme alleviates many of the challenges faced by the MCMC and SIS schemes individually while largely retaining their strengths. It also provides diagnostics that guide and lend credibility to the procedure. Simulations demonstrate the viability of the approach.
Similar content being viewed by others
Notes
This is the corrected value from that article, which is generally known to have been a typographical error.
References
Agresti, A. (1992). A survey of exact inference for contingency tables. Statistical Science, 7(1), 131–153.
Agresti, A. (2002). Categorical data analysis (2nd ed.). Hoboken: Wiley.
Aoki, S., Hara, H., Takemura, A. (2012). Markov bases in algebraic statistics (Vol. 199). New York: Springer.
Baldoni, V., Berline, N., De Loera, J., Dutra, B., Koppe, M., Moreinis, S., Pinto, G., Vergne, M., Wu, J. (2014). A user’s guide for LattE integral v1.7.2. URL: http://www.math.ucdavis.edu/~latte/.
Bélisle, C. J., Romeijn, H. E., Smith, R. L. (1993). Hit-and-run algorithms for generating multivariate distributions. Mathematics of Operations Research, 18(2), 255–266.
Berkelaar, M., Eikland, K., Notebaert, P. (2015). lpSolve: Interface to Lp_solve v.5.5 to solve linear/integer programs. http://CRAN.R-project.org/package=lpSolve, R package version 5.6.11.
Bishop, Y. M. M., Fienberg, S. E., Holland, P. W. (1975). Discrete multivariate analysis: Theory and practice. Cambridge: The MIT Press.
Booth, J. G., Butler, R. W. (1999). An importance sampling algorithm for exact conditional tests in log-linear models. Biometrika, 86(2), 321–332.
Boyett, J. M. (1979). Algorithm as 144: Random r\(\times \) c tables with given row and column totals. Journal of the Royal Statistical Society Series C-Applied Statistics, 28(3), 329–332.
Brooks, S. P., Gelman, A. (1998). General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7(4), 434–455.
Caffo, B. (2013). exactLoglinTest: Monte Carlo exact tests for log-linear models. http://CRAN.R-project.org/package=exactLoglinTest, R package version 1.4.2.
Caffo, B. S., Booth, J. G. (2001). A Markov chain Monte Carlo algorithm for approximating exact conditional probabilities. Journal of Computational and Graphical Statistics, 10(4), 730–745.
Chen, Y., Diaconis, P., Holmes, S. P., Liu, J. S. (2005a). Sequential monte carlo methods for statistical analysis of tables. Journal of the American Statistical Association, 100(469), 109–120.
Chen, Y., Dinwoodie, I., Dobra, A., Huber, M. (2005b). Lattice points, contingency tables, and sampling. Contemporary Mathematics, 374, 65–78.
Chen, Y., Dinwoodie, I., Sullivant, S. (2006). Sequential importance sampling for multiway tables. The Annals of Statistics, 34(1), 523–545.
Clarkson, D. B., Fan, Y., Joe, H. (1993). A remark on algorithm 643: Fexact: An algorithm for performing fisher’s exact test in RXC contingency tables. ACM Transactions on Mathematical Software, 19(4), 484–488.
Cox, D., Little, J., O’Shea, D. (1997). Ideals, varieties, and algorithms (2nd ed.). New York: Springer.
De Loera, J., Onn, S. (2005). Markov bases of three-way tables are arbitrarily complicated. Journal of Symbolic Computation, 41(2), 173–181.
Diaconis, P., Sturmfels, B. (1998). Algebraic algorithms for sampling from conditional distributions. The Annals of Statistics, 26(1), 363–397.
Dobra, A. (2003). Markov bases for decomposable graphical models. Bernoulli, 9(6), 1093–1108.
Dobra, A., Sullivant, S. (2004). A divide-and-conquer algorithm for generating Markov bases of multi-way tables. Computational Statistics, 19, 347–366.
Drton, M., Sturmfels, B., Sullivant, S. (2009). Lectures on algebraic statistics. Boston: Birkhauser Basel.
Eddelbuettel, D. (2013). Seamless R and C++ integration with Rcpp. New York: Springer.
Eddelbuettel, D., François, R. (2011). Rcpp: Seamless R and C++ integration. Journal of Statistical Software, 40(8), 1–18.
Fisher, R. A. (1922a). On the interpretation of \(\chi \)2 from contingency tables, and the calculation of p. Journal of the Royal Statistical Society, 85(1), 87–94.
Fisher, R. A. (1922b). On the mathematical foundations of theoretical statistics. Philosophical transactions of the royal society of London series A—Containing papers of a mathematical or physical character (pp. 309–368).
Fisher, R. A. (1934). Statistical methods for research workers (5th ed.). Edinburgh: Oliver & Boyd.
Gelman, A., Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7(4), 457–472.
Halton, J. H. (1969). A rigorous derivation of the exact contingency formula. Mathematical Proceedings of the Cambridge Philosophical Society, 65(02), 527–530.
Hara, H., Takemura, A., Yoshida, R. (2010). On connectivity of fibers with positive marginals in multiple logistic regression. Journal of Multivariate Analysis, 101(4), 909–925.
Hara, H., Aoki, S., Takemura, A. (2012). Running Markov chain without Markov basis. In Proceedings of the second CREST-SBM international conference, Harmony of Gröbner bases and the modern industrial society, Singapore (pp. 19–34).
Kahle, D., Garcia-Puente, L., Yoshida, R. (2015). algstat: Algebraic statistics in R. http://CRAN.R-project.org/package=algstat, R package version 0.1.0.
Kahle, T., Rauh, J. (2011). The Markov bases database. http://www.markov-bases.de.
Lange, K. (2010). Numerical analysis for statisticians (2nd ed.). New York: Springer.
Lehmann, E. L., Romano, J. P. (2005). Testing statistical hypotheses (3rd ed.). New York: Springer.
Liu, J. S. (2008). Monte Carlo strategies in scientific computing. New York: Springer.
Lunn, D., Jackson, C., Best, N., Thomas, A., Spiegelhalter, D. (2012). The BUGS book: A practical introduction to Bayesian analysis. Boca Raton: CRC Press.
Mehta, C. R., Patel, N. R. (1986). Algorithm 643: Fexact: A Fortran subroutine for fisher’s exact test on unordered r\(\times \) c contingency tables. ACM Transactions on Mathematical Software, 12(2), 154–161.
Patefield, W. M. (1981). Algorithm as 159: An efficient method of generating random r\(\times \) c tables with given row and column totals. Journal of the Royal Statistical Society Series C-Applied Statistics, 30(1), 91–97.
Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine Series 5, 50(302), 157–175.
R Core Team (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/.
Read, T. R., Cressie, N. (1988). Goodness-of-fit statistics for discrete multivariate data. New York: Springer.
Schrijver, A. (1986). Theory of linear and integer programming. Chichester: Wiley.
Sheskin, D. J. (2007). Handbook of parametric and nonparametric statistical procedures (4th ed.). Boca Raton: Chapman and Hall/CRC Press.
Snee, R. D. (1974). Graphical display of two-way contingency tables. The American Statistician, 28(1), 9–12.
Snijders, T. (1991). Enumeration and simulation methods for 0–1 matrices with given marginals. Psychometrika, 56(3), 397–417.
Sturmfels, B. (1996). Gröbner bases and convex polytopes (Vol. 8). Providence: American Mathematical Society.
4ti2 team (2008). 4ti2—A software package for algebraic, geometric and combinatorial problems on linear spaces. http://www.4ti2.de.
Acknowledgements
D. K. and R. Y. are supported by the National Science Foundation under Grant Nos. 1622449 and 1622369, respectively. The authors would like to thank an anonymous referee for suggesting that the validity of the scheme should be considered through the lens of unbiasedness.
Author information
Authors and Affiliations
Corresponding author
About this article
Cite this article
Kahle, D., Yoshida, R. & Garcia-Puente, L. Hybrid schemes for exact conditional inference in discrete exponential families. Ann Inst Stat Math 70, 983–1011 (2018). https://doi.org/10.1007/s10463-017-0615-z
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10463-017-0615-z