Abstract
We describe multistage Markov chain Monte Carlo (MSMCMC) procedures which, in addition to estimating the total number of contingency tables with given positive row and column sums, estimate the number, \(Q\), and the proportion, \(P\), of those tables that satisfy an additional, possibly, nonlinear constraint. Three Options, A, B, and C, are studied. Options A and B exploit locally optimal statistical properties whereas judicious assignment of a particular parameter of Option C allows estimation with approximately minimal standard error. Ten examples of varying dimensions and total entries illustrate and compare the procedures, where \(Q\) and \(P\) denote the number and proportion of chi-squared statistics less than a given value. For both small and large dimensional tables, the comparisons favor Options A and B for moderate \(P\) and Option C for small \(P\). Additional comparison with sequential importance sampling estimates favors the latter for small dimensional tables and moderate \(P\) but favors Option C for large dimensional tables for both small and moderate \(P\). The proposed options extend an earlier MSMCMC technique for estimating total count and, in principle, can be further extended to incorporate additional constraints.
Similar content being viewed by others
Notes
See Diaconis and Gangolli (1995) for a review of other approximating methods.
Minor adjustments are necessary at the boundaries.
See Liu and Chen (1998) for the theoretical framework for SIS in a Monte Carlo setting.
For more on how the inverse-temperature schedule affects \(\xi _1,\ldots ,\xi _r\), see S̆ tefankovi c̆ et al. 2009.
Note that \(g(\mathbf{x})\le g_{\max }(\mathcal{A})\) for some \(\mathbf{x}\in \mathcal{X}\) does not imply \(\mathbf{x}\in \mathcal{A}\).
The C code, ct.chisq.c, is accessible at http://www.unc.edu~gfish/.
The estimates were computed by using an augmented version of the C code used in Chen et al. (2005) for solely estimating \(|\mathcal{A}|\). The augmentation added negligible CPU time. For input, all tables were ordered from smallest to largest row sums and from largest to smallest column sums.
References
Andrews DF, Herzberg AM (1985) Data: a collection of problems from many fields for the student and research worker. Springer, New York
Aoki S, Hara H, Takemura A (2012) Markov bases in algebraic statistics. Springer, New York
Bezáková I (2008) Sampling binary contingency tables. Comput Sci Eng 10:26–31
Bezáková I, Sinclair A, S̆tefankovic̆ A, Vigoda E (2012) Negative examples for sequential importance sampling of binary contingency tables. Algorithmica 54:606–620
Blanchet J (2009) Efficient importance sampling for binary contingency tables. Ann Appl Prob 19:949–982
Bunea F, Besag J (2000) MCMC in \(I \times J \times K\) contingency tables, Fields Institute Communications, 26. American Mathematical Society, Providence
Chen Y, Diaconis P, Holmes SP, Liu JS (2005) Sequential Monte Carlo methods for statistical analysis of tables. J Am Stat Assoc 100:109–120
Cox LH (2007) Contingency tables of network type: models, Markov basis and applications. Statistica Sinica 17:1371–1393
De Loera JA, Haws D, Hemmecke R, Huggins P, Tauzer J, Yoshida R (2003) A User’s Guide for LattE, 1, software package LattE is available at http://www.math.ucdavis.edu/~latte/
Diaconis P, Efron B (1985) Testing for independence in a two-way table: new interpretation of the chi-square statistic. Ann Stat 13:845–874
Diaconis P, Gangolli A (1995) Rectangular arrays with fixed marginals. In: Aldous D, Diaconis P, Spencer J, Steele JM (eds) Discrete probability and algorithms, IMA volumes in mathematics and its applications, Vol. no 72. Springer, New York, pp 15–42
Diaconis P, Holmes S (1995) Three examples of Monte–Carlo chains: at the interface of statistical computing, computer science, and statistical mechanics. In: Aldous D, Diaconis P, Spencer J, Steele JM (eds) Discrete probability and algorithms, IMA volumes in mathematics and its applications, Vol. no 72. Springer, New York, pp 43–56
Diaconis P, Sturmfels B (1998) Algebraic algorithms for sampling from conditional distributions. Ann Stat 26:363–397
Dyer M, Kannan R, Mount J (1997) Sampling contingency tables. Rand Struct Algorithms 10:487–506
Fienberg SE (2007) The analysis of cross-classified categorical data, 2nd edn. Springer, New York
Fishman GS (2012) Counting contingency tables via multistage Markov Chain Monte Carlo. J Comput Graph Stat 21:713–738
Goodman LA (1979) Simple models for the analysis of association in cross-classifications having ordered categories. J Am Stat Assoc 74:537–552
Jerrum M, Valiant L, Vazirani V (1986) Random generation of combinatorial structures from a uniform distribution. Theor Comput Sci 43:169–188
Kalantari B, Lari I, Rizzi A, Simeone B (1993) Sharp bounds for the maximum of the chi-square index in a class of contingency tables with given marginals. Comput Stat Data Anal 16:19–34
Koch G, Amara J, Atkinson S, Stanish W (1983) Overview of categorical data analysis methods. SAS-SUGI 8:785–795
Liu JS, Chen R (1998) Sequential Monte Carlo methods for dynamic systems. J Am Stat Assoc 93:1032–1044
Mango A (2003) On the normalization of \(\chi ^2\) base contingency tables. Dev Appl Stat 19:11–123
Rapallo F, Yoshida R (2010) Markov bases and subbases for bounded contingency tables. Ann Inst Stat Math 62:785–805
Sabatti C (2002) Measuring dependency with volume tests. Am Stat 56:1–5
SAS, Example 3.6 Output Data Set of Chi-Square Statistics, http://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#/documentation/cdl/en/procstat/63104/HTML/default/procstat_freq_sect030.htm
S̆tefankovic̆ D, Vempala S, Vigoda E (2009) Adaptive simulated annealing: a near-optimal connection between sampling and counting. J Assoc Comput Mach 56(3):1–36
Zipunnikov V, Booth JG, Yoshida R (2009) Counting tables using the double-saddlepoint approximation. J Comput Graph Stat 18:915–929
Author information
Authors and Affiliations
Corresponding author
Additional information
Professor emeritus of operations research. The author is grateful to David Rubin for helpful discussions on computing a lower bound for \(g_{\max }(\mathcal{A})\) in Sect. 2.4, to Vadim Zipunnikov for his guidance in executing the double-saddlepoint approximation, and to Yuguo Chen for kindly providing the R code for sequential importance sampling.
Appendix
Appendix
Algorithm \(\sim \)GMAX first generates a greedy solution \(\mathbf{x}\), satisfying row and column constraints, by sequentially maximizing the contribution of each of the \(a\times b\) cells in the table. It then computes
Using the definitions
it then iteratively augments \(\chi ^2_{\footnotesize {\mathrm{old}}}\) by sequentially maximizing contributions make by every allowable four-cell combination \((i,j),~(i,l),~(k,j),~(k,l)\) where \(i\ne k\) and \(j\ne l\). If \(B<0\), \(A=A_{\min }\) maximizes the increment; if \(B>0\), \(A=A_{\max }\) maximizes the increment.
Rights and permissions
About this article
Cite this article
Fishman, G.S. Counting subsets of contingency tables. Comput Stat 29, 159–187 (2014). https://doi.org/10.1007/s00180-013-0442-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-013-0442-5