Skip to main content
Log in

Counting subsets of contingency tables

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

We describe multistage Markov chain Monte Carlo (MSMCMC) procedures which, in addition to estimating the total number of contingency tables with given positive row and column sums, estimate the number, \(Q\), and the proportion, \(P\), of those tables that satisfy an additional, possibly, nonlinear constraint. Three Options, A, B, and C, are studied. Options A and B exploit locally optimal statistical properties whereas judicious assignment of a particular parameter of Option C allows estimation with approximately minimal standard error. Ten examples of varying dimensions and total entries illustrate and compare the procedures, where \(Q\) and \(P\) denote the number and proportion of chi-squared statistics less than a given value. For both small and large dimensional tables, the comparisons favor Options A and B for moderate \(P\) and Option C for small \(P\). Additional comparison with sequential importance sampling estimates favors the latter for small dimensional tables and moderate \(P\) but favors Option C for large dimensional tables for both small and moderate \(P\). The proposed options extend an earlier MSMCMC technique for estimating total count and, in principle, can be further extended to incorporate additional constraints.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. See Diaconis and Gangolli (1995) for a review of other approximating methods.

  2. For additional discussion, see Bunea and Besag (2000) and Sabatti (2002).

  3. For discussions of the known limitations of SIS when applied to binary contingency tables, see Bezáková (2008), Bezáková et al. (2012), and Blanchet (2009).

  4. Minor adjustments are necessary at the boundaries.

  5. See Liu and Chen (1998) for the theoretical framework for SIS in a Monte Carlo setting.

  6. It is known that SIS has limitations for some binary contingency tables as the size of the table grows. See Bezáková (2008), Bezáková et al. (2012), and Blanchet (2009).

  7. For more on how the inverse-temperature schedule affects \(\xi _1,\ldots ,\xi _r\), see S̆ tefankovi c̆ et al. 2009.

  8. Note that \(g(\mathbf{x})\le g_{\max }(\mathcal{A})\) for some \(\mathbf{x}\in \mathcal{X}\) does not imply \(\mathbf{x}\in \mathcal{A}\).

  9. For other heuristics, see Kalantari et al. (1993) and Mango (2003).

  10. Those for Examples A.1 through A.8 originally appeared in Fishman (2012). LattE macchiato software computed the exact counts for Examples A.1 through A.7 (De Loera et al. 2003) and the double saddlepoint method computed the approximations for Examples, A.8, A.9, and A.10 (Zipunnikov et al. 2009).

    Table 3 MSMCMC estimated and exact number of contingency tables
  11. The C code, ct.chisq.c, is accessible at http://www.unc.edu~gfish/.

  12. The estimates were computed by using an augmented version of the C code used in Chen et al. (2005) for solely estimating \(|\mathcal{A}|\). The augmentation added negligible CPU time. For input, all tables were ordered from smallest to largest row sums and from largest to smallest column sums.

References

  • Andrews DF, Herzberg AM (1985) Data: a collection of problems from many fields for the student and research worker. Springer, New York

    Book  MATH  Google Scholar 

  • Aoki S, Hara H, Takemura A (2012) Markov bases in algebraic statistics. Springer, New York

    Book  MATH  Google Scholar 

  • Bezáková I (2008) Sampling binary contingency tables. Comput Sci Eng 10:26–31

    Article  Google Scholar 

  • Bezáková I, Sinclair A, S̆tefankovic̆ A, Vigoda E (2012) Negative examples for sequential importance sampling of binary contingency tables. Algorithmica 54:606–620

    Article  Google Scholar 

  • Blanchet J (2009) Efficient importance sampling for binary contingency tables. Ann Appl Prob 19:949–982

    Article  MATH  MathSciNet  Google Scholar 

  • Bunea F, Besag J (2000) MCMC in \(I \times J \times K\) contingency tables, Fields Institute Communications, 26. American Mathematical Society, Providence

    Google Scholar 

  • Chen Y, Diaconis P, Holmes SP, Liu JS (2005) Sequential Monte Carlo methods for statistical analysis of tables. J Am Stat Assoc 100:109–120

    Article  MATH  MathSciNet  Google Scholar 

  • Cox LH (2007) Contingency tables of network type: models, Markov basis and applications. Statistica Sinica 17:1371–1393

    Google Scholar 

  • De Loera JA, Haws D, Hemmecke R, Huggins P, Tauzer J, Yoshida R (2003) A User’s Guide for LattE, 1, software package LattE is available at http://www.math.ucdavis.edu/~latte/

  • Diaconis P, Efron B (1985) Testing for independence in a two-way table: new interpretation of the chi-square statistic. Ann Stat 13:845–874

    Article  MATH  MathSciNet  Google Scholar 

  • Diaconis P, Gangolli A (1995) Rectangular arrays with fixed marginals. In: Aldous D, Diaconis P, Spencer J, Steele JM (eds) Discrete probability and algorithms, IMA volumes in mathematics and its applications, Vol. no 72. Springer, New York, pp 15–42

    Google Scholar 

  • Diaconis P, Holmes S (1995) Three examples of Monte–Carlo chains: at the interface of statistical computing, computer science, and statistical mechanics. In: Aldous D, Diaconis P, Spencer J, Steele JM (eds) Discrete probability and algorithms, IMA volumes in mathematics and its applications, Vol. no 72. Springer, New York, pp 43–56

    Google Scholar 

  • Diaconis P, Sturmfels B (1998) Algebraic algorithms for sampling from conditional distributions. Ann Stat 26:363–397

    Article  MATH  MathSciNet  Google Scholar 

  • Dyer M, Kannan R, Mount J (1997) Sampling contingency tables. Rand Struct Algorithms 10:487–506

    Article  MATH  MathSciNet  Google Scholar 

  • Fienberg SE (2007) The analysis of cross-classified categorical data, 2nd edn. Springer, New York

    Book  MATH  Google Scholar 

  • Fishman GS (2012) Counting contingency tables via multistage Markov Chain Monte Carlo. J Comput Graph Stat 21:713–738

    Article  MathSciNet  Google Scholar 

  • Goodman LA (1979) Simple models for the analysis of association in cross-classifications having ordered categories. J Am Stat Assoc 74:537–552

    Article  Google Scholar 

  • Jerrum M, Valiant L, Vazirani V (1986) Random generation of combinatorial structures from a uniform distribution. Theor Comput Sci 43:169–188

    Article  MATH  MathSciNet  Google Scholar 

  • Kalantari B, Lari I, Rizzi A, Simeone B (1993) Sharp bounds for the maximum of the chi-square index in a class of contingency tables with given marginals. Comput Stat Data Anal 16:19–34

    Article  MATH  MathSciNet  Google Scholar 

  • Koch G, Amara J, Atkinson S, Stanish W (1983) Overview of categorical data analysis methods. SAS-SUGI 8:785–795

    Google Scholar 

  • Liu JS, Chen R (1998) Sequential Monte Carlo methods for dynamic systems. J Am Stat Assoc 93:1032–1044

    Article  MATH  Google Scholar 

  • Mango A (2003) On the normalization of \(\chi ^2\) base contingency tables. Dev Appl Stat 19:11–123

  • Rapallo F, Yoshida R (2010) Markov bases and subbases for bounded contingency tables. Ann Inst Stat Math 62:785–805

    Article  MathSciNet  Google Scholar 

  • Sabatti C (2002) Measuring dependency with volume tests. Am Stat 56:1–5

    Article  MathSciNet  Google Scholar 

  • SAS, Example 3.6 Output Data Set of Chi-Square Statistics, http://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#/documentation/cdl/en/procstat/63104/HTML/default/procstat_freq_sect030.htm

  • S̆tefankovic̆ D, Vempala S, Vigoda E (2009) Adaptive simulated annealing: a near-optimal connection between sampling and counting. J Assoc Comput Mach 56(3):1–36

    Google Scholar 

  • Zipunnikov V, Booth JG, Yoshida R (2009) Counting tables using the double-saddlepoint approximation. J Comput Graph Stat 18:915–929

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to George S. Fishman.

Additional information

Professor emeritus of operations research. The author is grateful to David Rubin for helpful discussions on computing a lower bound for \(g_{\max }(\mathcal{A})\) in Sect. 2.4, to Vadim Zipunnikov for his guidance in executing the double-saddlepoint approximation, and to Yuguo Chen for kindly providing the R code for sequential importance sampling.

Appendix

Appendix

Algorithm \(\sim \)GMAX first generates a greedy solution \(\mathbf{x}\), satisfying row and column constraints, by sequentially maximizing the contribution of each of the \(a\times b\) cells in the table. It then computes

$$\begin{aligned} \chi ^2_{\footnotesize {\mathrm{old}}}:= m\left( \sum \limits _{i=1}^a\sum \limits _{j=1}^b\frac{x_{ij}^2}{m_{i.}m_{.j}}-1\right) . \end{aligned}$$

Using the definitions

$$\begin{aligned} p(A)&:= mA\left( \frac{2x_{ij}+A}{m_{i.}m_{.j}}-\frac{2x_{il}-A}{m_{i.}m_{.l}} -\frac{2x_{kj}-A}{m_{k.}m_{.j}}+\frac{2x_{kl}+A}{m_{k.}m_{.l}}\right) \\&\quad A_{\min }\le A\le A_{\max } \\ B&:= -2\left( \frac{\frac{x_{ij}}{m_{i.}m_{.j}}-\frac{x_{il}}{m_{i.}m_{.l}}-\frac{x_{kj}}{m_{k.}m_{.j}}+\frac{x_{kl}}{m_{k.}m_{.l}}}{\frac{1}{m_{i.}m_{.j}}+\frac{1}{m_{i.}m_{.l}}+\frac{1}{m_{k.}m_{.j}}+\frac{1}{m_{k.}m_{.l}}}\right) \\ A_{\min }&:= -\min [x_{ij},x_{kl},\min (m_{i.},m_{.l})-x_{il},\min (m_{k.},m_{.j})-x_{kj}]\\ A_{\max }&:= ~~\min [x_{il},x_{kj},\min (m_{i.},m_{.j})-x_{ij},\min (m_{k.},m_{.l})-x_{kl}], \end{aligned}$$

it then iteratively augments \(\chi ^2_{\footnotesize {\mathrm{old}}}\) by sequentially maximizing contributions make by every allowable four-cell combination \((i,j),~(i,l),~(k,j),~(k,l)\) where \(i\ne k\) and \(j\ne l\). If \(B<0\), \(A=A_{\min }\) maximizes the increment; if \(B>0\), \(A=A_{\max }\) maximizes the increment.

figure d

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fishman, G.S. Counting subsets of contingency tables. Comput Stat 29, 159–187 (2014). https://doi.org/10.1007/s00180-013-0442-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-013-0442-5

Keywords

Navigation