Skip to main content
Log in

Estimating the number of zero-one multi-way tables via sequential importance sampling

  • Published:
Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Abstract

In 2005, Chen et al. introduced a sequential importance sampling (SIS) procedure to analyze zero-one two-way tables with given fixed marginal sums (row and column sums) via the conditional Poisson (CP) distribution. They showed that compared with Monte Carlo Markov chain (MCMC)-based approaches, their importance sampling method is more efficient in terms of running time and also provides an easy and accurate estimate of the total number of contingency tables with fixed marginal sums. In this paper, we extend their result to zero-one multi-way (\(d\)-way, \(d \ge 2\)) contingency tables under the no \(d\)-way interaction model, i.e., with fixed \(d-1\) marginal sums. Also, we show by simulations that the SIS procedure with CP distribution to estimate the number of zero-one three-way tables under the no three-way interaction model given marginal sums works very well even with some rejections. We also applied our method to Samson’s monks data set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Similar content being viewed by others

References

  • Blitzstein, J., Diaconis, P. (2010). A sequential importance sampling algorithm for generating random graphs with prescribed degrees. Internet Mathematics, 6(4), 489–522.

    Google Scholar 

  • Breiger, R., Boorman, S., Arabie, P. (1975). An algorithm for clustering relational data with applications to social network analysis and comparison with multidimensional scaling. Journal of Mathematical Psychology, 12, 328–383.

    Google Scholar 

  • Chen, Y. (2007). Conditional inference on tables with structural zeros. Journal of Computational and Graphical Statistics, 16(2), 445–467.

    Article  MathSciNet  Google Scholar 

  • Chen, Y., Diaconis, P., Holmes, S., Liu, J. S. (2005). Sequential monte carlo methods for statistical analysis of tables. Journal of the American Statistical Association, 100, 109–120.

    Google Scholar 

  • Chen, Y., Dinwoodie, I., Sullivant, S. (2006). Sequential importance sampling for multiway tables. The Annals of Statistics, 34(1), 523–545.

    Google Scholar 

  • De Loera, J., Haws, D., Hemmecke, R., Huggins, P., Tauzer, J., Yoshida, R. (2005). LattE, version 1.2. http://www.math.ucdavis.edu/~latte/.

  • De Loera, J., Onn, S. (2006). All linear and integer programs are slim 3-way transportation programs. SIAM Journal on Optimization, 17, 806–821.

    Google Scholar 

  • Dinwoodie, I. H. (2008). Polynomials for classification trees and applications. Statistical and Applied Mathematical Sciences Institute Technical, Report 2008-7.

  • Dinwoodie, I. H., Chen, Y. (2011). Sampling large tables with constraints. Statistica Sinica, 21, 1591–1609.

    Google Scholar 

  • Garey, M. R., Johnson, D. S. (1979). Computers and intractabihty, a guide to the theory of NP-completeness. San Francisco: Freeman & Co.

  • Huber, M. (2006). Fast perfect sampling from linear extensions. Discrete Mathematics, 306, 420–428.

    Article  MathSciNet  MATH  Google Scholar 

  • R-Project-Team. (2011). R project. GNU software. http://www.r-project.org/.

  • Sampson, S. (1969). Crisis in a cloister. Doctoral dissertation (unpublished).

  • Snijders, T. A. B. (1991). Enumeration and simulation methods for \(0-1\) matriceswith given marginals. Psychometrika, 56, 397–417.

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

The authors would like to thank Drs. Stephen Fienberg and Yuguo Chen for useful conversations.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruriko Yoshida.

Appendix: Nonparametric bootstrap method

Appendix: Nonparametric bootstrap method

In this section, we explain how to use a nonparametric bootstrap method to get the \((1-\alpha )100\,\%\) confidence interval for \(|\varSigma |\). Notice that the bootstrap sample size is fixed as B, and notations here are consistent with Sect. 2.

(1) Drawing pseudo data set

Concept In an SIS procedure with sample size \(N\), we get a sequence of random tables \(\mathbf{X_1}, \ldots , \mathbf{X_N}\). Define \(\mathbf{Y_i}=\frac{ \mathbb{I }_{\mathbf{X_i} \in \varSigma }}{q(\mathbf{X_i})},\ i=1,\ldots ,N\) where \(q(\mathbf{X})\) is the trial distribution, then \(\mathbf{Y_1}, \ldots , \mathbf{Y_N}\) is a sequence of iid random variables. This means that it makes sense to consider the empirical distribution of \(\mathbf Y_i\), which is nonparametric maximum likelihood estimator of the real distribution of \(\mathbf{Y_i}\) (actually, as \(\mathbf{Y_i}\) can only take finitely many values, the empirical distribution becomes the maximum likelihood estimator of the real distribution). Draw a pseudo sample \(\mathbf{Y^*_1}, \ldots , \mathbf{Y^*_N}\) from the empirical distribution.

Algorithm Use the SIS procedure to get \(\mathbf{Y_i}=\frac{ \mathbb{I }_{\mathbf{X_i} \in \varSigma }}{q(\mathbf{X_i})},\ i=1,\ldots ,N\), which should be just a sequence of numbers. Draw N elements from this sequence with replacement.

(2) One Bootstrap replication

Concept Consider the pseudo sample \(\mathbf{Y^{*}_{1}}, \ldots , \mathbf{Y^*_N}\) as a ”new” sample from the empirical distribution, then the cumulative distribution function (CDF) of \(\widehat{\theta }^*=T(\mathbf{Y^*_1}, \ldots , \mathbf{Y^*_N})\) is a consistent estimator of the CDF of \(\widehat{\theta }=T(\mathbf{Y_1}, \ldots , \mathbf{Y_N})\). Here, we can consider our estimator of \(|\varSigma |\):

$$\begin{aligned} \widehat{|\varSigma |}=\widehat{\theta _1}=T_1(\mathbf{Y_1}, \ldots , \mathbf{Y_N})=\frac{1}{N} \sum _{i = 1}^N \mathbf{Y_i} \end{aligned}$$

The \(cv^2\):

$$\begin{aligned} \widehat{cv^{2}}=\widehat{\theta _{2}}=T_2(\mathbf{Y_1}, \ldots , \mathbf{Y_N})=\frac{\sum _{i = 1}^N \left\{ \mathbf{Y_{i}} - \left[\sum _{j =1}^N \mathbf{Y_{j}} \right] / N\right\} ^2 /(N - 1)}{\left\{ \left[ \sum _{j = 1}^N \mathbf{Y_{j}} \right] / N\right\} ^{2} } \end{aligned}$$

Algorithm Treat the pseudo sample as a sample from the SIS and compute the statistics based on it. That means, this bootstrap replication can be obtained by:

$$\begin{aligned} \widehat{|\varSigma |}^{*1} =\frac{1}{N} \sum _{i = 1}^N \mathbf{Y^*_i};\quad \widehat{cv^2}_{*1} =cv^2\, \text{ of}\, (\mathbf{Y^*_1}, \ldots , \mathbf{Y^*_N}) \end{aligned}$$

(3) Bootstrap t confidence interval

Concept Repeat the previous two steps until we get B Bootstrap replications: \(\widehat{\theta _i}^{*1},\ldots ,\widehat{\theta _i}^{*B},\ i=1,2\). The empirical distribution of \(\widehat{\theta _i}^*\) is the nonparametric maximum likelihood estimator of CDF of \(\widehat{\theta _i}^*\), and the latter is a consistent estimator of the CDF of \(\widehat{\theta _i}\). So, we can use \((\frac{\alpha }{2})100_{th}\) and \((1-\frac{\alpha }{2})100_{th}\) percentiles of the empirical distribution as our confidence interval.

Algorithm Repeat the previous two steps B times. For \(\{ \widehat{|\varSigma |}^{*1}, \ldots , \widehat{|\varSigma |}^{*B} \}\), define \(\widehat{|\varSigma |}_{(a)}^*\) as the \(100a_{th}\) percentile of the list of values. Then bootstrap-t \((1-\alpha )100\,\%\) confidence interval of \(\widehat{|\varSigma |}\) is \([\widehat{|\varSigma |}_{(\alpha /2)}^*,\widehat{|\varSigma |}_{(1- \alpha /2)}^*]\). Similarly, we can get confidence interval for \(\widehat{cv^2}\).

About this article

Cite this article

Xi, J., Yoshida, R. & Haws, D. Estimating the number of zero-one multi-way tables via sequential importance sampling. Ann Inst Stat Math 65, 763–783 (2013). https://doi.org/10.1007/s10463-012-0392-7

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10463-012-0392-7

Keywords

Navigation