Counting subsets of contingency tables

Fishman, George S.

doi:10.1007/s00180-013-0442-5

Counting subsets of contingency tables

Original Paper
Published: 20 August 2013

Volume 29, pages 159–187, (2014)
Cite this article

Computational Statistics Aims and scope Submit manuscript

George S. Fishman¹

198 Accesses
Explore all metrics

Abstract

We describe multistage Markov chain Monte Carlo (MSMCMC) procedures which, in addition to estimating the total number of contingency tables with given positive row and column sums, estimate the number, $Q$, and the proportion, $P$, of those tables that satisfy an additional, possibly, nonlinear constraint. Three Options, A, B, and C, are studied. Options A and B exploit locally optimal statistical properties whereas judicious assignment of a particular parameter of Option C allows estimation with approximately minimal standard error. Ten examples of varying dimensions and total entries illustrate and compare the procedures, where $Q$ and $P$ denote the number and proportion of chi-squared statistics less than a given value. For both small and large dimensional tables, the comparisons favor Options A and B for moderate $P$ and Option C for small $P$. Additional comparison with sequential importance sampling estimates favors the latter for small dimensional tables and moderate $P$ but favors Option C for large dimensional tables for both small and moderate $P$. The proposed options extend an earlier MSMCMC technique for estimating total count and, in principle, can be further extended to incorporate additional constraints.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

Article Open access 05 May 2021

A simple algorithm for computing the probabilities of count models based on pure birth processes

Article 10 April 2024

Violating the normality assumption may be the lesser of two evils

Article Open access 07 May 2021

Notes

See Diaconis and Gangolli (1995) for a review of other approximating methods.
For additional discussion, see Bunea and Besag (2000) and Sabatti (2002).
For discussions of the known limitations of SIS when applied to binary contingency tables, see Bezáková (2008), Bezáková et al. (2012), and Blanchet (2009).
Minor adjustments are necessary at the boundaries.
See Liu and Chen (1998) for the theoretical framework for SIS in a Monte Carlo setting.
It is known that SIS has limitations for some binary contingency tables as the size of the table grows. See Bezáková (2008), Bezáková et al. (2012), and Blanchet (2009).
For more on how the inverse-temperature schedule affects $\xi _1,\ldots ,\xi _r$, see S̆ tefankovi c̆ et al. 2009.
Note that $g(\mathbf{x})\le g_{\max }(\mathcal{A})$ for some $\mathbf{x}\in \mathcal{X}$ does not imply $\mathbf{x}\in \mathcal{A}$.
For other heuristics, see Kalantari et al. (1993) and Mango (2003).
Those for Examples A.1 through A.8 originally appeared in Fishman (2012). LattE macchiato software computed the exact counts for Examples A.1 through A.7 (De Loera et al. 2003) and the double saddlepoint method computed the approximations for Examples, A.8, A.9, and A.10 (Zipunnikov et al. 2009).
Table 3 MSMCMC estimated and exact number of contingency tables
Full size table
The C code, ct.chisq.c, is accessible at http://www.unc.edu~gfish/.
The estimates were computed by using an augmented version of the C code used in Chen et al. (2005) for solely estimating $|\mathcal{A}|$. The augmentation added negligible CPU time. For input, all tables were ordered from smallest to largest row sums and from largest to smallest column sums.

References

Andrews DF, Herzberg AM (1985) Data: a collection of problems from many fields for the student and research worker. Springer, New York
Book MATH Google Scholar
Aoki S, Hara H, Takemura A (2012) Markov bases in algebraic statistics. Springer, New York
Book MATH Google Scholar
Bezáková I (2008) Sampling binary contingency tables. Comput Sci Eng 10:26–31
Article Google Scholar
Bezáková I, Sinclair A, S̆tefankovic̆ A, Vigoda E (2012) Negative examples for sequential importance sampling of binary contingency tables. Algorithmica 54:606–620
Article Google Scholar
Blanchet J (2009) Efficient importance sampling for binary contingency tables. Ann Appl Prob 19:949–982
Article MATH MathSciNet Google Scholar
Bunea F, Besag J (2000) MCMC in $I \times J \times K$ contingency tables, Fields Institute Communications, 26. American Mathematical Society, Providence
Google Scholar
Chen Y, Diaconis P, Holmes SP, Liu JS (2005) Sequential Monte Carlo methods for statistical analysis of tables. J Am Stat Assoc 100:109–120
Article MATH MathSciNet Google Scholar
Cox LH (2007) Contingency tables of network type: models, Markov basis and applications. Statistica Sinica 17:1371–1393
Google Scholar
De Loera JA, Haws D, Hemmecke R, Huggins P, Tauzer J, Yoshida R (2003) A User’s Guide for LattE, 1, software package LattE is available at http://www.math.ucdavis.edu/~latte/
Diaconis P, Efron B (1985) Testing for independence in a two-way table: new interpretation of the chi-square statistic. Ann Stat 13:845–874
Article MATH MathSciNet Google Scholar
Diaconis P, Gangolli A (1995) Rectangular arrays with fixed marginals. In: Aldous D, Diaconis P, Spencer J, Steele JM (eds) Discrete probability and algorithms, IMA volumes in mathematics and its applications, Vol. no 72. Springer, New York, pp 15–42
Google Scholar
Diaconis P, Holmes S (1995) Three examples of Monte–Carlo chains: at the interface of statistical computing, computer science, and statistical mechanics. In: Aldous D, Diaconis P, Spencer J, Steele JM (eds) Discrete probability and algorithms, IMA volumes in mathematics and its applications, Vol. no 72. Springer, New York, pp 43–56
Google Scholar
Diaconis P, Sturmfels B (1998) Algebraic algorithms for sampling from conditional distributions. Ann Stat 26:363–397
Article MATH MathSciNet Google Scholar
Dyer M, Kannan R, Mount J (1997) Sampling contingency tables. Rand Struct Algorithms 10:487–506
Article MATH MathSciNet Google Scholar
Fienberg SE (2007) The analysis of cross-classified categorical data, 2nd edn. Springer, New York
Book MATH Google Scholar
Fishman GS (2012) Counting contingency tables via multistage Markov Chain Monte Carlo. J Comput Graph Stat 21:713–738
Article MathSciNet Google Scholar
Goodman LA (1979) Simple models for the analysis of association in cross-classifications having ordered categories. J Am Stat Assoc 74:537–552
Article Google Scholar
Jerrum M, Valiant L, Vazirani V (1986) Random generation of combinatorial structures from a uniform distribution. Theor Comput Sci 43:169–188
Article MATH MathSciNet Google Scholar
Kalantari B, Lari I, Rizzi A, Simeone B (1993) Sharp bounds for the maximum of the chi-square index in a class of contingency tables with given marginals. Comput Stat Data Anal 16:19–34
Article MATH MathSciNet Google Scholar
Koch G, Amara J, Atkinson S, Stanish W (1983) Overview of categorical data analysis methods. SAS-SUGI 8:785–795
Google Scholar
Liu JS, Chen R (1998) Sequential Monte Carlo methods for dynamic systems. J Am Stat Assoc 93:1032–1044
Article MATH Google Scholar
Mango A (2003) On the normalization of $\chi ^2$ base contingency tables. Dev Appl Stat 19:11–123
Rapallo F, Yoshida R (2010) Markov bases and subbases for bounded contingency tables. Ann Inst Stat Math 62:785–805
Article MathSciNet Google Scholar
Sabatti C (2002) Measuring dependency with volume tests. Am Stat 56:1–5
Article MathSciNet Google Scholar
SAS, Example 3.6 Output Data Set of Chi-Square Statistics, http://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#/documentation/cdl/en/procstat/63104/HTML/default/procstat_freq_sect030.htm
S̆tefankovic̆ D, Vempala S, Vigoda E (2009) Adaptive simulated annealing: a near-optimal connection between sampling and counting. J Assoc Comput Mach 56(3):1–36
Google Scholar
Zipunnikov V, Booth JG, Yoshida R (2009) Counting tables using the double-saddlepoint approximation. J Comput Graph Stat 18:915–929
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, NC, 27599, USA
George S. Fishman

Authors

George S. Fishman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to George S. Fishman.

Additional information

Professor emeritus of operations research. The author is grateful to David Rubin for helpful discussions on computing a lower bound for $g_{\max }(\mathcal{A})$ in Sect. 2.4, to Vadim Zipunnikov for his guidance in executing the double-saddlepoint approximation, and to Yuguo Chen for kindly providing the R code for sequential importance sampling.

Appendix

Algorithm $\sim $GMAX first generates a greedy solution $\mathbf{x}$, satisfying row and column constraints, by sequentially maximizing the contribution of each of the $a\times b$ cells in the table. It then computes

$$\begin{aligned} \chi ^2_{\footnotesize {\mathrm{old}}}:= m\left( \sum \limits _{i=1}^a\sum \limits _{j=1}^b\frac{x_{ij}^2}{m_{i.}m_{.j}}-1\right) . \end{aligned}$$

Using the definitions

$$\begin{aligned} p(A)&:= mA\left( \frac{2x_{ij}+A}{m_{i.}m_{.j}}-\frac{2x_{il}-A}{m_{i.}m_{.l}} -\frac{2x_{kj}-A}{m_{k.}m_{.j}}+\frac{2x_{kl}+A}{m_{k.}m_{.l}}\right) \\&\quad A_{\min }\le A\le A_{\max } \\ B&:= -2\left( \frac{\frac{x_{ij}}{m_{i.}m_{.j}}-\frac{x_{il}}{m_{i.}m_{.l}}-\frac{x_{kj}}{m_{k.}m_{.j}}+\frac{x_{kl}}{m_{k.}m_{.l}}}{\frac{1}{m_{i.}m_{.j}}+\frac{1}{m_{i.}m_{.l}}+\frac{1}{m_{k.}m_{.j}}+\frac{1}{m_{k.}m_{.l}}}\right) \\ A_{\min }&:= -\min [x_{ij},x_{kl},\min (m_{i.},m_{.l})-x_{il},\min (m_{k.},m_{.j})-x_{kj}]\\ A_{\max }&:= ~~\min [x_{il},x_{kj},\min (m_{i.},m_{.j})-x_{ij},\min (m_{k.},m_{.l})-x_{kl}], \end{aligned}$$

it then iteratively augments $\chi ^2_{\footnotesize {\mathrm{old}}}$ by sequentially maximizing contributions make by every allowable four-cell combination $(i,j),~(i,l),~(k,j),~(k,l)$ where $i\ne k$ and $j\ne l$. If $B<0$, $A=A_{\min }$ maximizes the increment; if $B>0$, $A=A_{\max }$ maximizes the increment.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fishman, G.S. Counting subsets of contingency tables. Comput Stat 29, 159–187 (2014). https://doi.org/10.1007/s00180-013-0442-5

Download citation

Received: 16 April 2012
Accepted: 22 July 2013
Published: 20 August 2013
Issue Date: February 2014
DOI: https://doi.org/10.1007/s00180-013-0442-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Counting subsets of contingency tables

Abstract

Access this article

Similar content being viewed by others

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

A simple algorithm for computing the probabilities of count models based on pure birth processes

Violating the normality assumption may be the lesser of two evils

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Counting subsets of contingency tables

Abstract

Access this article

Similar content being viewed by others

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

A simple algorithm for computing the probabilities of count models based on pure birth processes

Violating the normality assumption may be the lesser of two evils

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation