Abstract
In 2005, Chen et al. introduced a sequential importance sampling (SIS) procedure to analyze zero-one two-way tables with given fixed marginal sums (row and column sums) via the conditional Poisson (CP) distribution. They showed that compared with Monte Carlo Markov chain (MCMC)-based approaches, their importance sampling method is more efficient in terms of running time and also provides an easy and accurate estimate of the total number of contingency tables with fixed marginal sums. In this paper, we extend their result to zero-one multi-way (\(d\)-way, \(d \ge 2\)) contingency tables under the no \(d\)-way interaction model, i.e., with fixed \(d-1\) marginal sums. Also, we show by simulations that the SIS procedure with CP distribution to estimate the number of zero-one three-way tables under the no three-way interaction model given marginal sums works very well even with some rejections. We also applied our method to Samson’s monks data set.
Similar content being viewed by others
References
Blitzstein, J., Diaconis, P. (2010). A sequential importance sampling algorithm for generating random graphs with prescribed degrees. Internet Mathematics, 6(4), 489–522.
Breiger, R., Boorman, S., Arabie, P. (1975). An algorithm for clustering relational data with applications to social network analysis and comparison with multidimensional scaling. Journal of Mathematical Psychology, 12, 328–383.
Chen, Y. (2007). Conditional inference on tables with structural zeros. Journal of Computational and Graphical Statistics, 16(2), 445–467.
Chen, Y., Diaconis, P., Holmes, S., Liu, J. S. (2005). Sequential monte carlo methods for statistical analysis of tables. Journal of the American Statistical Association, 100, 109–120.
Chen, Y., Dinwoodie, I., Sullivant, S. (2006). Sequential importance sampling for multiway tables. The Annals of Statistics, 34(1), 523–545.
De Loera, J., Haws, D., Hemmecke, R., Huggins, P., Tauzer, J., Yoshida, R. (2005). LattE, version 1.2. http://www.math.ucdavis.edu/~latte/.
De Loera, J., Onn, S. (2006). All linear and integer programs are slim 3-way transportation programs. SIAM Journal on Optimization, 17, 806–821.
Dinwoodie, I. H. (2008). Polynomials for classification trees and applications. Statistical and Applied Mathematical Sciences Institute Technical, Report 2008-7.
Dinwoodie, I. H., Chen, Y. (2011). Sampling large tables with constraints. Statistica Sinica, 21, 1591–1609.
Garey, M. R., Johnson, D. S. (1979). Computers and intractabihty, a guide to the theory of NP-completeness. San Francisco: Freeman & Co.
Huber, M. (2006). Fast perfect sampling from linear extensions. Discrete Mathematics, 306, 420–428.
R-Project-Team. (2011). R project. GNU software. http://www.r-project.org/.
Sampson, S. (1969). Crisis in a cloister. Doctoral dissertation (unpublished).
Snijders, T. A. B. (1991). Enumeration and simulation methods for \(0-1\) matriceswith given marginals. Psychometrika, 56, 397–417.
Acknowledgments
The authors would like to thank Drs. Stephen Fienberg and Yuguo Chen for useful conversations.
Author information
Authors and Affiliations
Corresponding author
Appendix: Nonparametric bootstrap method
Appendix: Nonparametric bootstrap method
In this section, we explain how to use a nonparametric bootstrap method to get the \((1-\alpha )100\,\%\) confidence interval for \(|\varSigma |\). Notice that the bootstrap sample size is fixed as B, and notations here are consistent with Sect. 2.
(1) Drawing pseudo data set
Concept In an SIS procedure with sample size \(N\), we get a sequence of random tables \(\mathbf{X_1}, \ldots , \mathbf{X_N}\). Define \(\mathbf{Y_i}=\frac{ \mathbb{I }_{\mathbf{X_i} \in \varSigma }}{q(\mathbf{X_i})},\ i=1,\ldots ,N\) where \(q(\mathbf{X})\) is the trial distribution, then \(\mathbf{Y_1}, \ldots , \mathbf{Y_N}\) is a sequence of iid random variables. This means that it makes sense to consider the empirical distribution of \(\mathbf Y_i\), which is nonparametric maximum likelihood estimator of the real distribution of \(\mathbf{Y_i}\) (actually, as \(\mathbf{Y_i}\) can only take finitely many values, the empirical distribution becomes the maximum likelihood estimator of the real distribution). Draw a pseudo sample \(\mathbf{Y^*_1}, \ldots , \mathbf{Y^*_N}\) from the empirical distribution.
Algorithm Use the SIS procedure to get \(\mathbf{Y_i}=\frac{ \mathbb{I }_{\mathbf{X_i} \in \varSigma }}{q(\mathbf{X_i})},\ i=1,\ldots ,N\), which should be just a sequence of numbers. Draw N elements from this sequence with replacement.
(2) One Bootstrap replication
Concept Consider the pseudo sample \(\mathbf{Y^{*}_{1}}, \ldots , \mathbf{Y^*_N}\) as a ”new” sample from the empirical distribution, then the cumulative distribution function (CDF) of \(\widehat{\theta }^*=T(\mathbf{Y^*_1}, \ldots , \mathbf{Y^*_N})\) is a consistent estimator of the CDF of \(\widehat{\theta }=T(\mathbf{Y_1}, \ldots , \mathbf{Y_N})\). Here, we can consider our estimator of \(|\varSigma |\):
The \(cv^2\):
Algorithm Treat the pseudo sample as a sample from the SIS and compute the statistics based on it. That means, this bootstrap replication can be obtained by:
(3) Bootstrap t confidence interval
Concept Repeat the previous two steps until we get B Bootstrap replications: \(\widehat{\theta _i}^{*1},\ldots ,\widehat{\theta _i}^{*B},\ i=1,2\). The empirical distribution of \(\widehat{\theta _i}^*\) is the nonparametric maximum likelihood estimator of CDF of \(\widehat{\theta _i}^*\), and the latter is a consistent estimator of the CDF of \(\widehat{\theta _i}\). So, we can use \((\frac{\alpha }{2})100_{th}\) and \((1-\frac{\alpha }{2})100_{th}\) percentiles of the empirical distribution as our confidence interval.
Algorithm Repeat the previous two steps B times. For \(\{ \widehat{|\varSigma |}^{*1}, \ldots , \widehat{|\varSigma |}^{*B} \}\), define \(\widehat{|\varSigma |}_{(a)}^*\) as the \(100a_{th}\) percentile of the list of values. Then bootstrap-t \((1-\alpha )100\,\%\) confidence interval of \(\widehat{|\varSigma |}\) is \([\widehat{|\varSigma |}_{(\alpha /2)}^*,\widehat{|\varSigma |}_{(1- \alpha /2)}^*]\). Similarly, we can get confidence interval for \(\widehat{cv^2}\).
About this article
Cite this article
Xi, J., Yoshida, R. & Haws, D. Estimating the number of zero-one multi-way tables via sequential importance sampling. Ann Inst Stat Math 65, 763–783 (2013). https://doi.org/10.1007/s10463-012-0392-7
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10463-012-0392-7