Abstract
It is usual to rely on the quasi-likelihood methods for deriving statistical methods applied to clustered multinomial data with no underlying distribution. Even though extensive literature can be encountered for these kind of data sets, there are few investigations to deal with unequal cluster sizes. This paper aims to contribute to fill this gap by proposing new estimators for the intracluster correlation coefficient.
Similar content being viewed by others
References
Ahn, H., James, J.C.: Generation of over-dispersed and under-dispersed binomial variates. J. Comput. Graph. Stat. 4, 55–64 (1995)
Altham, P.M.E.: Discrete variable analysis for individuals grouped into families. Biometrika 63, 263–269 (1976)
Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)
Brier, S.S.: Analysis of contingency tables under cluster sampling. Biometrika 67, 591–596 (1980)
Budowle, B., Moretti, T.R.: Genotype profiles for six population groups at the 13 CODIS short tandem repeat core loci and other PCR-based loci. Forensic Science Communications 1999 (1999). http://www.fbi.gov/about-us/lab/forensic-science-communications/fsc/july1999/budowle.htm
Cohen, J.E.: The distribution of the chi-squared statistic under clustered sampling from contingency tables. J. Am. Stat. Assoc. 71, 665–670 (1976)
Cressie, N., Pardo, L.: Minimum \(\phi \)-divergence estimator and hierarchical testing in loglinear models. Statistica Sinica 10, 867–884 (2000)
Cressie, N., Pardo, L.: Model checking in loglinear models using \(\phi \)-divergences and MLEs. J. Stat. Plan. Inference 103, 437–453 (2002)
Cressie, N., Pardo, L., Pardo, M.C.: Size and power considerations for testing loglinear models using \(\phi \)-divergence test statistics. Statistica Sinica 13, 555–570 (2003)
Fienberg, S.E., Rinaldo, A.: Maximum likelihood estimation in log-linear models. Ann. Stat. 40, 996–1023 (2012)
Grizzle, J.E., Starmer, C.F., Koch, G.G.: Analysis of categorical data by linear models. Biometrics 25, 489–504 (1969)
Haberman, S.J.: The Analysis of Frequency Data. University of Chicago Press, Chicago (1974)
Hall, D.B.: Zero-inflated poisson and binomial regression with random effects: a case study. Biometrics 56, 1030–1039 (2000)
Martín, N., Pardo, L.: New families of estimators and test statistics in log-linear models. J. Multivar. Anal. 99, 1590–1609 (2008a)
Martín, N., Pardo, L.: Minimum phi-divergence estimators for loglinear models with linear constraints and multinomial sampling. Stat. Pap. 49, 15–36 (2008b)
Martín, N., Pardo, L.: A new measure of leverage cells in multinomial loglinear models. Commun. Stat. 39, 517–530 (2010)
Martín, N., Pardo, L.: Fitting DNA sequences through log-linear modelling with linear constraints. Statistics 45, 605–621 (2011)
Martín, N., Pardo, L.: Poisson loglinear modeling with linear constraints on the expected cell frequencies. Sankhya 74B, 238–267 (2012)
Menéndez, M.L., Morales, D., Pardo, L., Vajda, I.: Divergence-based estimation and testing of statistical models of classification. J. Multivar. Anal. 54, 329–354 (1995)
Menéndez, M.L., Morales, D., Pardo, L., Vajda, I.: About divergence-based goodness-of-fit tests in the Dirichlet-multinomial model. Commun. Stat. 25, 1119–1133 (1996)
Morel, J.G., Nagaraj, N.K.: A finite mixture distribution for modelling multinomial extra variation. Biometrika 80, 363–371 (1993)
Morel, J.G., Neerchal, N.K.: Overdispersion Models in SAS. SAS Press, Cary (2012)
Mosimann, J.E.: On the compound multinomial distributions, the multivariate \(\beta \)-distribution and correlation among proportions. Biometrika 49, 65–82 (1962)
Neerchal, N.K., Morel, J.G.: Large cluster results for two parametric multinomial extra variation models. J. Am. Stat. Assoc. 93, 1078–1087 (1998)
Pardo, L.: Statistical Inference Based on Divergence Measures. Chapman & Hall/CRC, Boca Raton (2006)
Raim, A.M.: Computational Methods for Finite Mixtures Using Approximate Information and Regression Linked to the Mixture Mean. PhD Thesis, University of Mayland (2014)
Raim, A.M. , Neerchal, N.K. Morel, J.G.: Modeling overdispersion in \(R\). Technical Report HPCI-2015-1 UMBCH High Performance Computing Facility, University of Maryland (2015)
Vos, P.W.: Minimum f-divergence estimators and quasi-likelihood functions. Ann. Inst. Stat. Math. 44, 261–279 (1992)
Wedderburn, R.W.M.: Quasi-likelihood functions, generalized linear models, and the Gauss–Newton method. Biometrika 61, 439–447 (1974)
Weir, B.S., Hill, W.G.: Estimating F-statistics. Annu. Rev. Genet. 36, 721–750 (2002)
Acknowledgments
We would like to thank the referees for their helpful comments and suggestions. This research is supported by the Spanish Grant MTM2012-33740 from Ministerio de Economia y Competitividad.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
1.1 Zero-inflated binomial distribution
The binomial distribution with zero inflation in the first cell, i.e., n-inflation in the second cell, is given by
Its first order moment vector is given by
The derivation for the the second order moment matrix calculation is given by
and hence
where
This result matches the one given in Morel and Neerchal (2012, p. 83). Let
be the multinomial distribution with zero inflation in the first \(M-1\) cells, i.e., n-inflation in the M-th cell.
For \(M\ge 3\), a univariate homogeneous intracluster correlation coefficient, \(\rho ^{2}\), seems not to be an appropriate measure to characterize the variability of this distribution, since the intracluster correlation along the cells seems to be heterogeneous. The reason for this is that for \(M\ge 3\) there is not an expression for the variance-covariance matrix of the multinomial distribution defined as a matrix not depending on parameters multiplied by a scalar with all the information about the parameters of the distribution.
1.2 Proof of Theorem 3.2
Let
the matrix of quasi-variances and quasi-covariances of the simple random sample \(\varvec{Y}^{(1)},\ldots ,\varvec{Y}^{(N)}\) and
It is well-known that each diagonal element of \(\overline{\varvec{S} }_{\varvec{Y}}\) is a consistent estimator of each diagonal element of \(\vartheta _{n}n\varvec{\Sigma }_{\varvec{p}(\varvec{\theta })}\), i.e.,
and
It is not difficult to establish that
which is consistent for \(\mathrm {trace}(\vartheta _{n}n\varvec{\Sigma }_{\varvec{p}(\varvec{\theta })})=\vartheta _{n}n\sum _{r=1}^{M} p_{r}(\varvec{\theta })\left( 1-p_{r}(\varvec{\theta })\right) \). We know that the chi-square test-statistic \(X^{2}(\widetilde{\varvec{Y}})\), given in (3.3), has an asymptotic \(\mathcal {\chi }_{(N-1)(M-1)}^{2}\) distribution for fixed values of number of clusters N and an increasing cluster size, n, under the assumption of inter-cluster level homogeneity. However, this distribution is not a useful device for the proof. Based on the expression of the chi-square test-statistic, \(X^{2}(\widetilde{\varvec{Y} })\), in terms of the variance-covariance matrix, as well as the same steps to obtain the expression and consistency of (8.2), we are going to establish (3.4). We have
and
Hence,
and taking into account that \(\widehat{\varvec{p}}\) is a consistent estimator of \(\varvec{p}(\varvec{\theta })\), as \(N\rightarrow \infty \), as well as (8.1),
tends in probability to \(\vartheta _{n}\), as \(N\rightarrow \infty \). In other words,
In addition, taking into account (1.9), the right hand size of (3.4) follows. Finally, we like to mention that even though \(X^{2}(\widetilde{\varvec{Y}})\) and \(\vartheta _{n}(N-1)(M-1)\) have the same expectation for a fixed value of N, this proof is not trivial since \(\vartheta _{n}(N-1)(M-1)\) as well as \(X^{2}(\widetilde{\varvec{Y}})\) tend to infinite as \(N\rightarrow \infty \).
1.3 Proof of Theorem 2.2
By applying the Central Limit Theorem it holds (3.1). Hence, from Pardo (2006, formula (7.10)), for the minimum phi-divergence estimator of \(\varvec{\theta }\) of a log-linear model it holds
and the variance-covariance matrix of \(\sqrt{N}(\widehat{\varvec{\theta } }_{\phi }-\varvec{\theta }_{0})\) is
The last equality comes from
From the Taylor expansion of \(\varvec{p}(\widehat{\varvec{\theta } }_{\phi })\) around \(\varvec{p}(\varvec{\theta }_{0})\) we obtain
and the variance-covariance matrix of \(\sqrt{N}(\varvec{p} (\widehat{\varvec{\theta }}_{\phi })-\varvec{p}(\varvec{\theta } _{0}))\) is
Since \(\sqrt{N}\left( \widehat{\varvec{p}}-\varvec{p}\left( \varvec{\theta }_{0}\right) \right) \) is normal and centred, from (8.3) and (8.4), (2.8) is obtained. Similarly, since \(\sqrt{N}(\widehat{\varvec{\theta }}_{\phi }-\varvec{\theta }_{0})\) is normal and centred, from (8.5) and (8.6), (2.9) is obtained.
1.4 Derivation of Formula (4.4)
Multiplying (4.3) by \(\sqrt{N_{g}}n_{g}\big / \sum \limits _{h=1}^{G}n_{h}N_{h}\)
hence summing up from \(g=1\) to G and by the independence of clusters
Finally multiplying the previous expression by \(\sum \nolimits _{h=1} ^{G}n_{h}N_{h}\big / \sqrt{\sum \nolimits _{g=1}^{G}n_{g}N_{g}\vartheta _{n_{g}}}\), the desired expression is obtained.
1.5 Algorithms for Dirichlet-multinomial, n-inflated and random-clumped distributions
The usual parameters of the M-dimensional random variable \(\varvec{Y} =(Y_{1},\ldots ,Y_{M})^{T}\) with Dirichlet-multinomial distribution are \(\varvec{\alpha }=\left( \alpha _{11},\ldots ,\alpha _{M1}\right) ^{T}\), where \(\alpha _{r1}=\frac{1-\rho ^{2}}{\rho ^{2}}p_{r}\left( \varvec{\theta }\right) , r=1,\ldots ,M\). For convenience it is considered with parameters \(\varvec{\beta }= {\begin{pmatrix} \rho ^{2}\\ \varvec{p}(\varvec{\theta }) \end{pmatrix}} , \varvec{p}\left( \varvec{\theta }\right) =\left( p_{1}\left( \varvec{\theta }\right) ,\ldots ,p_{M}\left( \varvec{\theta }\right) \right) ^{T}\), and is generated as follows:
- STEP 1. :
-
Generate \( B_{1}\sim Beta(\alpha _{11},\alpha _{12})\), with \(\alpha _{11}=\frac{1-\rho ^{2}}{\rho ^{2}}p_{1}\left( \varvec{\theta }\right) , \alpha _{12}=\frac{1-\rho ^{2}}{\rho ^{2}}(1-p_{1}\left( \varvec{\theta }\right) )\).
- STEP 2. :
-
Generate \(\left( Y_{1} |B_{1}=b_{1}\right) \sim Bin(n,b_{1})\).
- STEP 3. :
-
For \(r=2,\ldots ,M-1\) do:
Generate \(B_{r}\sim Beta(\alpha _{r1},\alpha _{r2})\) , with \(\alpha _{r1} =\frac{1-\rho ^{2}}{\rho ^{2}}p_{r}\left( \varvec{\theta }\right) , \alpha _{r2}=\frac{1-\rho ^{2}}{\rho ^{2}}\left( 1-\sum _{h=1}^{r} p_{h}\left( \varvec{\theta }\right) \right) \).
Generate \(( Y_{r}|Y_{1}=y_{1},\ldots ,Y_{r-1} =y_{r-1},B_{r}=b_{r}) \sim Bin\left( n-\sum _{h=1}^{r-1} y_{h},b_{r}\right) \).
- STEP 4. :
-
Do \(\left( Y_{M}|Y_{1}=y_{1},\ldots ,Y_{M-1}=y_{M-1}\right) =n-\sum _{h=1}^{M-1}y_{h} \).
The random variable \(\varvec{Y}=(Y_{1},\ldots ,Y_{M})^{T}\) of the n-inflated multinomial distribution with parameters \(\varvec{\beta }, \varvec{p}\left( \varvec{\theta }\right) \), is generated as follows:
- STEP 1. :
-
Generate \(V\sim Ber(\rho ^{2})\).
- STEP 2. :
-
Generate
$$\begin{aligned}&\left( \varvec{Y|}V=v\right) =\left\{ \begin{array}{l@{\quad }l} \mathcal {M}(n,\varvec{p}\left( \varvec{\theta }\right) ), &{} \text {if }v=0\\ n\mathcal {M}(1,\varvec{p}\left( \varvec{\theta }\right) ), &{} \text {if }v=1 \end{array} \right. . \end{aligned}$$
The random variable \(\varvec{Y}=(Y_{1},\ldots ,Y_{M})^{T}\) of the random clumped distribution with parameters \(\varvec{\beta }, \varvec{p} \left( \varvec{\theta }\right) \), is generated as follows:
- STEP 1. :
-
Generate \(\varvec{Y}_{0}=(Y_{01} ,\ldots ,Y_{0M})^{T}\sim \mathcal {M}(1,\varvec{p}\left( \varvec{\theta }\right) )\).
- STEP 2. :
-
Generate \(K_{1}\sim Bin(n,\rho )\).
- STEP 3. :
-
Generate \(\left( \varvec{Y}_{1}|K_{1}=k_{1}\right) =\big ( (Y_{11},\ldots ,Y_{1M})^{T} | K_{1}=k_{1}\big ) \sim \mathcal {M}(n-k_{1},\varvec{p}\left( \varvec{\theta }\right) )\).
- STEP 4. :
-
Do \(\left( \varvec{Y|}K_{1}=k_{1}\right) \varvec{=Y}_{0}k_{1}+\left( \varvec{Y}_{1}|K_{1}=k_{1}\right) \).
For the details about the equivalence of this algorithm and (1.12), see Morel and Nagaraj (1993).
It is interesting to note that there exists the package “Modeling overdispersion in R” useful to generate the distributions considered in this Appendix. For more details see Raim et al. (2015).
Rights and permissions
About this article
Cite this article
Alonso-Revenga, J.M., Martín, N. & Pardo, L. New improved estimators for overdispersion in models with clustered multinomial data and unequal cluster sizes. Stat Comput 27, 193–217 (2017). https://doi.org/10.1007/s11222-015-9616-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-015-9616-z