Estimating the number of zero-one multi-way tables via sequential importance sampling

Xi, Jing; Yoshida, Ruriko; Haws, David

doi:10.1007/s10463-012-0392-7

Estimating the number of zero-one multi-way tables via sequential importance sampling

Published: 30 December 2012

Volume 65, pages 763–783, (2013)
Cite this article

Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Jing Xi¹,
Ruriko Yoshida² &
David Haws³

163 Accesses
Explore all metrics

Abstract

In 2005, Chen et al. introduced a sequential importance sampling (SIS) procedure to analyze zero-one two-way tables with given fixed marginal sums (row and column sums) via the conditional Poisson (CP) distribution. They showed that compared with Monte Carlo Markov chain (MCMC)-based approaches, their importance sampling method is more efficient in terms of running time and also provides an easy and accurate estimate of the total number of contingency tables with fixed marginal sums. In this paper, we extend their result to zero-one multi-way ($d$-way, $d \ge 2$) contingency tables under the no $d$-way interaction model, i.e., with fixed $d-1$ marginal sums. Also, we show by simulations that the SIS procedure with CP distribution to estimate the number of zero-one three-way tables under the no three-way interaction model given marginal sums works very well even with some rejections. We also applied our method to Samson’s monks data set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Maximum Augmented Empirical Likelihood Estimation of Categorical Marginal Models for Large Sparse Contingency Tables

Article Open access 26 September 2023

L. Andries van der Ark, Wicher P. Bergsma & Letty Koopman

Random sampling of contingency tables via probabilistic divide-and-conquer

Article 04 June 2019

Stephen DeSalvo & James Zhao

Improving the Accuracy of Estimating Indexes in Contingency Tables Using Bayesian Estimators

Article 30 November 2023

Tomotaka Momozaki, Koji Cho, … Sadao Tomizawa

References

Blitzstein, J., Diaconis, P. (2010). A sequential importance sampling algorithm for generating random graphs with prescribed degrees. Internet Mathematics, 6(4), 489–522.
Google Scholar
Breiger, R., Boorman, S., Arabie, P. (1975). An algorithm for clustering relational data with applications to social network analysis and comparison with multidimensional scaling. Journal of Mathematical Psychology, 12, 328–383.
Google Scholar
Chen, Y. (2007). Conditional inference on tables with structural zeros. Journal of Computational and Graphical Statistics, 16(2), 445–467.
Article MathSciNet Google Scholar
Chen, Y., Diaconis, P., Holmes, S., Liu, J. S. (2005). Sequential monte carlo methods for statistical analysis of tables. Journal of the American Statistical Association, 100, 109–120.
Google Scholar
Chen, Y., Dinwoodie, I., Sullivant, S. (2006). Sequential importance sampling for multiway tables. The Annals of Statistics, 34(1), 523–545.
Google Scholar
De Loera, J., Haws, D., Hemmecke, R., Huggins, P., Tauzer, J., Yoshida, R. (2005). LattE, version 1.2. http://www.math.ucdavis.edu/~latte/.
De Loera, J., Onn, S. (2006). All linear and integer programs are slim 3-way transportation programs. SIAM Journal on Optimization, 17, 806–821.
Google Scholar
Dinwoodie, I. H. (2008). Polynomials for classification trees and applications. Statistical and Applied Mathematical Sciences Institute Technical, Report 2008-7.
Dinwoodie, I. H., Chen, Y. (2011). Sampling large tables with constraints. Statistica Sinica, 21, 1591–1609.
Google Scholar
Garey, M. R., Johnson, D. S. (1979). Computers and intractabihty, a guide to the theory of NP-completeness. San Francisco: Freeman & Co.
Huber, M. (2006). Fast perfect sampling from linear extensions. Discrete Mathematics, 306, 420–428.
Article MathSciNet MATH Google Scholar
R-Project-Team. (2011). R project. GNU software. http://www.r-project.org/.
Sampson, S. (1969). Crisis in a cloister. Doctoral dissertation (unpublished).
Snijders, T. A. B. (1991). Enumeration and simulation methods for $0-1$ matriceswith given marginals. Psychometrika, 56, 397–417.
Article MathSciNet MATH Google Scholar

Download references

Acknowledgments

The authors would like to thank Drs. Stephen Fienberg and Yuguo Chen for useful conversations.

Author information

Authors and Affiliations

Statistics Department, University of Kentucky, 325 Multidisplinary Science Building, Lexington, KY, 40506-0082, USA
Jing Xi
Statistics Department, University of Kentucky, 325D Multidisplinary Science Building, Lexington, KY, 40506-0082, USA
Ruriko Yoshida
Computational Genetics, IBM, Thomas J. Watson Research Center, Yorktown Heights, NY, 10598, USA
David Haws

Authors

Jing Xi
View author publications
You can also search for this author in PubMed Google Scholar
Ruriko Yoshida
View author publications
You can also search for this author in PubMed Google Scholar
David Haws
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ruriko Yoshida.

Appendix: Nonparametric bootstrap method

In this section, we explain how to use a nonparametric bootstrap method to get the $(1-\alpha )100\,\%$ confidence interval for $|\varSigma |$. Notice that the bootstrap sample size is fixed as B, and notations here are consistent with Sect. 2.

(1) Drawing pseudo data set

Concept In an SIS procedure with sample size $N$, we get a sequence of random tables $\mathbf{X_1}, \ldots , \mathbf{X_N}$. Define $\mathbf{Y_i}=\frac{ \mathbb{I }_{\mathbf{X_i} \in \varSigma }}{q(\mathbf{X_i})},\ i=1,\ldots ,N$ where $q(\mathbf{X})$ is the trial distribution, then $\mathbf{Y_1}, \ldots , \mathbf{Y_N}$ is a sequence of iid random variables. This means that it makes sense to consider the empirical distribution of $\mathbf Y_i$, which is nonparametric maximum likelihood estimator of the real distribution of $\mathbf{Y_i}$ (actually, as $\mathbf{Y_i}$ can only take finitely many values, the empirical distribution becomes the maximum likelihood estimator of the real distribution). Draw a pseudo sample $\mathbf{Y^*_1}, \ldots , \mathbf{Y^*_N}$ from the empirical distribution.

Algorithm Use the SIS procedure to get $\mathbf{Y_i}=\frac{ \mathbb{I }_{\mathbf{X_i} \in \varSigma }}{q(\mathbf{X_i})},\ i=1,\ldots ,N$, which should be just a sequence of numbers. Draw N elements from this sequence with replacement.

(2) One Bootstrap replication

Concept Consider the pseudo sample $\mathbf{Y^{*}_{1}}, \ldots , \mathbf{Y^*_N}$ as a ”new” sample from the empirical distribution, then the cumulative distribution function (CDF) of $\widehat{\theta }^*=T(\mathbf{Y^*_1}, \ldots , \mathbf{Y^*_N})$ is a consistent estimator of the CDF of $\widehat{\theta }=T(\mathbf{Y_1}, \ldots , \mathbf{Y_N})$. Here, we can consider our estimator of $|\varSigma |$:

$$\begin{aligned} \widehat{|\varSigma |}=\widehat{\theta _1}=T_1(\mathbf{Y_1}, \ldots , \mathbf{Y_N})=\frac{1}{N} \sum _{i = 1}^N \mathbf{Y_i} \end{aligned}$$

The $cv^2$:

$$\begin{aligned} \widehat{cv^{2}}=\widehat{\theta _{2}}=T_2(\mathbf{Y_1}, \ldots , \mathbf{Y_N})=\frac{\sum _{i = 1}^N \left\{ \mathbf{Y_{i}} - \left[\sum _{j =1}^N \mathbf{Y_{j}} \right] / N\right\} ^2 /(N - 1)}{\left\{ \left[ \sum _{j = 1}^N \mathbf{Y_{j}} \right] / N\right\} ^{2} } \end{aligned}$$

Algorithm Treat the pseudo sample as a sample from the SIS and compute the statistics based on it. That means, this bootstrap replication can be obtained by:

$$\begin{aligned} \widehat{|\varSigma |}^{*1} =\frac{1}{N} \sum _{i = 1}^N \mathbf{Y^*_i};\quad \widehat{cv^2}_{*1} =cv^2\, \text{ of}\, (\mathbf{Y^*_1}, \ldots , \mathbf{Y^*_N}) \end{aligned}$$

(3) Bootstrap t confidence interval

Concept Repeat the previous two steps until we get B Bootstrap replications: $\widehat{\theta _i}^{*1},\ldots ,\widehat{\theta _i}^{*B},\ i=1,2$. The empirical distribution of $\widehat{\theta _i}^*$ is the nonparametric maximum likelihood estimator of CDF of $\widehat{\theta _i}^*$, and the latter is a consistent estimator of the CDF of $\widehat{\theta _i}$. So, we can use $(\frac{\alpha }{2})100_{th}$ and $(1-\frac{\alpha }{2})100_{th}$ percentiles of the empirical distribution as our confidence interval.

Algorithm Repeat the previous two steps B times. For $\{ \widehat{|\varSigma |}^{*1}, \ldots , \widehat{|\varSigma |}^{*B} \}$, define $\widehat{|\varSigma |}_{(a)}^*$ as the $100a_{th}$ percentile of the list of values. Then bootstrap-t $(1-\alpha )100\,\%$ confidence interval of $\widehat{|\varSigma |}$ is $[\widehat{|\varSigma |}_{(\alpha /2)}^*,\widehat{|\varSigma |}_{(1- \alpha /2)}^*]$. Similarly, we can get confidence interval for $\widehat{cv^2}$.

About this article

Cite this article

Xi, J., Yoshida, R. & Haws, D. Estimating the number of zero-one multi-way tables via sequential importance sampling. Ann Inst Stat Math 65, 763–783 (2013). https://doi.org/10.1007/s10463-012-0392-7

Download citation

Received: 01 May 2012
Revised: 22 October 2012
Published: 30 December 2012
Issue Date: August 2013
DOI: https://doi.org/10.1007/s10463-012-0392-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Estimating the number of zero-one multi-way tables via sequential importance sampling

Abstract

Access this article

Similar content being viewed by others

Maximum Augmented Empirical Likelihood Estimation of Categorical Marginal Models for Large Sparse Contingency Tables

Random sampling of contingency tables via probabilistic divide-and-conquer

Improving the Accuracy of Estimating Indexes in Contingency Tables Using Bayesian Estimators

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: Nonparametric bootstrap method

About this article

Cite this article

Keywords

Navigation

Estimating the number of zero-one multi-way tables via sequential importance sampling

Abstract

Access this article

Similar content being viewed by others

Maximum Augmented Empirical Likelihood Estimation of Categorical Marginal Models for Large Sparse Contingency Tables

Random sampling of contingency tables via probabilistic divide-and-conquer

Improving the Accuracy of Estimating Indexes in Contingency Tables Using Bayesian Estimators

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: Nonparametric bootstrap method

Appendix: Nonparametric bootstrap method

About this article

Cite this article

Share this article

Keywords

Search

Navigation