Abstract
The Poisson distribution is a fundamental tool in categorical data analysis. This paper reviews conditional inference for the independent Poisson model. It is noted that the conditioning variable is not an ancillary statistic in the exact sense except in the case of the product multinomial sampling scheme, whereas two versions of the ancillary property hold in general. The ancillary properties justify the use of conditional inference, as first proposed by R. A. Fisher and subsequently discussed by many researchers. The mixed coordinate system developed in information geometry is emphasized as effective for the description of facts.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Conditional distributions of the independent Poisson model play a central role in categorical data analysis. For example, the multinomial distribution is obtained from the Poisson model by the conditioning of the total count. The hypergeometric distribution is also obtained from the Poisson model by the conditioning of the marginal counts. These conditioning variables are written as Ax, where x is the Poisson random vector and A is an integer matrix. The conditional distribution of x given Ax is called the conditional Poisson model [1] or the A-hypergeometric distribution [2]. See Sect. 2 for a more precise definition. Inference on the Poisson mean parameters is usually performed on the basis of the conditional distribution. This type of inference is referred to as conditional inference.
The best-known example is Fisher’s exact test of independence for \(2\times 2\) contingency tables, where the p-value of a test statistic is computed under the conditional probability given marginal counts. Computation of the p-value for more complicated hypotheses is one of the central topics in algebraic statistics [3,4,5]. For parameter estimation, the conditional maximum likelihood estimator is defined as the maximizer of the conditional likelihood [6, 7]. A confidence interval of the parameter is constructed from the conditional distribution of the estimator. Exact computation of the estimator is difficult in most cases and has been recently investigated in an algebraic manner [2, 8,9,10].
One might ask, why is conditioning so important?
The main reason is that the conditional likelihood does not depend on the nuisance parameter, as we will see in Sect. 2. Here, the nuisance parameter in the current problem refers to the marginal distribution of Ax. The existence of nuisance is problematic if its dimension is high. This is known as the Neyman–Scott problem [11, 12]. Conditional inference is effective in such cases.
Another reason is that the conditional distribution removes the effect of the data sampling scheme. In \(2\times 2\) contingency tables, the data are collected under various sampling schemes: no constraints, given the total, given the row (or column) marginals or given all the marginals. The underlying distribution of x changes from the independent Poisson distribution according to the scheme. However, the conditional distribution given all the marginals is common to all cases.
According to the first reason cited above, the conditional likelihood has no information regarding the nuisance parameter, which leads to a natural question: How much information on the parameter of interest remains in the marginal distribution of Ax? The goal of this paper is to review the answers to this question from the view point of ancillarity.
In general, a statistic is said to be ancillary if its marginal distribution has no information about the parameter of interest. It would be desirable if our conditioning variable Ax were ancillary. Unfortunately, this is not true in general [13], as investigated for \(2\times 2\) tables by [14,15,16,17]. We recall this fact in Sect. 4. On the other hand, Ax is shown to be asymptotically ancillary in the sense of Liang [18], where a conditional limit theorem established by [1, 2] is essential. We also note the ancillary criterion proposed by Godambe [19], which focuses on the space of estimating functions. Usefully, the mixed coordinate system developed in information geometry is quite effective for describing these results [13, 20, 21]. This is a particular example of parameter orthogonality [22], under which conditional inference works well in general.
The remainder of the paper is organized as follows. The Poisson model, together with the mixed coordinate system, is introduced in Sect. 2. Section 3 reviews the ancillary properties of general statistical models. The ancillary properties of the Poisson model are summarized in Sect. 4. A self-contained description of the mixed coordinate system and theorem proofs are provided in Appendices A and B, respectively.
2 Conditional inference of Poisson models
2.1 Definition
Let \({\mathbb {N}}\) and \({\mathbb {R}}_+\) be the sets of non-negative integers and positive real numbers, respectively. Consider an independent Poisson model
with the mean parameter \(p=(p_i)_{i=1}^n\in {\mathbb {R}}_+^n\), where the multi-index notation is adopted as \(p^x=\prod _i p_i^{x_i}\), \(x!=\prod _i x_i!\), and \(1_n=(1,\ldots ,1)^\top \in {\mathbb {N}}^n\). This model is an exponential family with the natural parameter \(\log p=(\log p_i)\in {\mathbb {R}}^n\) and the expectation parameter \(p\in {\mathbb {R}}_+^n\). We can read p as a “point” in the geometric sense. The maximum likelihood estimator of p, which maximizes f(x; p) with respect to p, is \({\hat{p}}=x\) whenever x is positive.
Many statistical models for categorical data have the form (e.g., [8])
where \(A\in {\mathbb {N}}^{d\times n}\) is a given matrix, \(g:{\mathbb {R}}^q\rightarrow {\mathbb {R}}^n\) is a given smooth function, \(\alpha \in {\mathbb {R}}^d\) is a nuisance parameter, and \(\theta \in {\mathbb {R}}^q\) is a parameter of interest. The model is called a log-affine model if g is affine, and a log-linear model if g is linear. The model is said to be saturated if the map \((\alpha ,\theta )\mapsto p\) is surjective.
Example 1
(Poisson regression) Let \(A=(1,\ldots ,1)\in {\mathbb {N}}^{1\times n}\) and \(g(\theta )=D^\top \theta \), where \(D^\top \in {\mathbb {R}}^{n\times q}\) is a design matrix. Then \(\alpha \) is a baseline and \(\theta \) represents regression coefficients. This model is not saturated if \(n>q+1\).
We next show an “unconventional” use of the log-linear model.
Example 2
(Fisher’s iris data) Table 1 gives the number \(x_{ij}\) of cases of Iris setosa that have sepal length \(L_i\) and sepal width \(W_j\), where the length scales are \(\{L_i\}_{i=1}^{100}=\{W_j\}_{j=1}^{100}=\{0.1,0.2,\ldots ,10.0\}\) in centimeters. The contingency table has 100 rows and 100 columns; the total number of cases is \(N=50\). The table is very sparse, as only 39 of 10,000 cells have a non-zero entry. Consider the statistical model
where \(\alpha _i,\beta _j\) are nuisance parameters and \(\theta \in {\mathbb {R}}\) is the parameter of interest. We set \(\beta _1=0\) without loss of generality. Then the dimension of the nuisance parameter is 199. This model is written in the form of (1) with an integer matrix \(A\in {\mathbb {N}}^{199\times 10000}\) and \(g_{ij}(\theta )=\theta L_iW_j\). We will see that the conditional maximum likelihood estimate of \(\theta \) exists; see Example 6. This is an example of the minimum information dependence modeling recently proposed by [23].
The conditional distribution of x given \(t=Ax\) is
where \(Z(t;p) = \sum _{Ay=t} p^y/y!\) is the normalizing constant. Here, by abuse of notation, we use the same symbol f for the joint and conditional distributions.
Definition 1
The conditional distribution (2) is called the the conditional Poisson model or A-hypergeometric distribution. Matrix A is called a configuration. We assume that A is of full row rank (i.e., of rank d) and that the row space of A contains \(1_n\) unless otherwise stated.
Lemma 1
(Chapter 1 of [1]) Assume the model (1). Then the conditional distribution (2) does not depend on \(\alpha \). In other words, \(t=Ax\) is a sufficient statistic for \(\alpha \).
Proof
Under the model (1), we have \(p^x= e^{\alpha ^\top t + g(\theta )^\top x}\), where \(t=Ax\). The factor depending only on t is canceled out in (2). \(\square \)
The lemma states that the conditional distribution does not depend on the nuisance parameter. This is one reason that the conditional distribution is important, as stated in Sect. 1. We sometimes use \(f(x\mid t;\theta )\) rather than \(f(x\mid t;p)\).
Example 3
(Continuation) The conditional distribution for the iris data in Example 2 is
where y ranges over the tables that have the same marginals as x. All the nuisance parameters are canceled out.
2.2 Mixed coordinate system
For a given configuration \(A\in {\mathbb {N}}^{d\times n}\) with \(n>d\), choose a matrix \(B\in {\mathbb {Z}}^{(n-d)\times n}\) of rank \(n-d\) such that \(AB^\top =0\). Matrix B is called the Gale transform of A in the theory of convex polytopes (see [2]). We will use the following coordinate system of p:
The map \((\psi ,\lambda ):{\mathbb {R}}_+^n\rightarrow {\mathbb {R}}^{n-d}\times {\mathbb {R}}^d\) actually defines a coordinate system, which is called the mixed coordinate system in information geometry (see [13, 20]). It is known that \(\psi (p)\) and \(\lambda (p)\) are orthogonal with respect to the Fisher information metric. Moreover, the range of \((\psi ,\lambda )\) is written as \(\Omega _\psi \times \Omega _\lambda ={\mathbb {R}}^{n-d}\times A{\mathbb {R}}_+^n\). See Appendix A for details regarding these facts. The symbols \(\psi \) and \(\lambda \) are used in accordance with [16].
Since p and \((\psi (p),\lambda (p))\) have a one-to-one correspondence, we have the following lemma.
Lemma 2
(Chapter 8 of [13]) Consider a log-linear model
This model is saturated, and the nuisance parameter \(\alpha \) is one-to-one with \(\lambda (p)\) for given \(\psi \).
Proof
Since the row spaces of A and B span \({\mathbb {R}}^n\), (4) is saturated. It can be immediately seen that \(B\log p=\psi \) since \(BA^\top =0\). Then the one-to-one correspondence between \(\alpha \) and \(\lambda (p)\) for given \(\psi =\psi (p)\) follows from the correspondences \((\alpha ,\psi )\leftrightarrow p\) and \(p\leftrightarrow (\psi (p),\lambda (p))\). \(\square \)
From the lemma, \(\lambda (p)\) is considered as the nuisance parameter and \(\psi (p)\) is the parameter of interest. The A-hypergeometric distribution (2) is then
where the normalizing constant is omitted. The quantity \(e^{\psi (p)}\in {\mathbb {R}}_+^{n-d}\) is called the generalized odds ratio in [2].
Example 4
(2 by 2 contingency table) Let \(p=(p_{11},p_{12},p_{21},p_{22})\) represent a \(2\times 2\) contingency table. In many applications, the log-odds ratio \(\psi (p)=\log (p_{11}p_{22}/p_{12}p_{21})\) is the parameter of interest. The marginal distributions \(p_{i+}=\sum _j p_{ij}\) and \(p_{+j}=\sum _ip_{ij}\) are nuisance. In this case, matrices A and B are
Indeed, \(\psi (p)=B\log p=\log (p_{11}p_{22}/p_{12}p_{21})\) and \(\lambda (p)=Ap=(p_{1+},p_{2+},p_{+1})\). Note that Ap determines all the marginals because \(p_{+2}=p_{1+}+p_{2+}-p_{+1}\) . The A-hypergeometric distribution is
where \(x=(x_{11},x_{12},x_{21},x_{22})^\top \). If we write \(x_{11}=k,x_{12}=t_1-k,x_{21}=t_3-k,x_{22}=t_2-t_3+k\) with \(k\in {\mathbb {N}}\), then we have a more familiar form of the noncentral hypergeometric distribution
Fisher’s exact test examines the hypothesis \(\psi =0\) against \(\psi \ne 0\). The p-value is calculated on the basis of the hypergeometric distribution for \(\psi =0\).
Table 2 compares symbols used in several papers for ease of reference.
2.3 Conditional maximum likelihood estimator
The conditional maximum likelihood estimator \({\hat{\psi }}\) of \(\psi (p)=B\log p\) is defined as a maximizer of \(f(x\mid t;p)\) with respect to \(\psi \) for given x. The estimator is, in general, different from the unconditional maximum likelihood estimator \({\hat{\psi }}_\mathrm{MLE}=B\log x\), as the following example shows. A sufficient condition for the coincidence of the two estimators is provided in Sect. 4.
Example 5
(Continuation) Consider again Example 4. If \(t=(3,2,3)^\top \), then possible outcomes of x are
in the form of contingency tables. If the observation is \(x=x(2)\), the conditional likelihood is
which is maximized at \({\hat{\psi }}=(1/2)\log 3\). This value is different from the unconditional maximum likelihood estimate \({\hat{\psi }}_\mathrm{MLE}=B\log x(2)=\log 2\). If the observation is \(x=x(1)\) or \(x=x(3)\), the conditional and unconditional maximum likelihood estimates do not exist.
The conditional maximum likelihood estimator for unsaturated models is defined as the maximizer of \(f(x\mid t;\theta )\) as well. Below is an example.
Example 6
(Continuation) For the iris data in Example 2, the conditional maximum likelihood estimate is approximately \({\hat{\theta }}=12.6\), where the Markov chain Monte Carlo method with a Markov basis is used to solve the likelihood equation
numerically. See [4, 5] for the details of Markov bases. Indeed, the denominator of (3) has approximately \(50!\approx 3\times 10^{64}\) terms, and an exact computation seems impossible. For the same data, we can also fit a Gaussian model \(f(L,W)\propto \exp (\alpha _1 L + \alpha _2 L^2+\beta _1 W + \beta _2 W^2+\theta LW)\) to the sepal length L and width W of I. setosa, from which we obtain \({\hat{\theta }}_\mathrm{Gauss}=12.4\). The two estimates are surprisingly close. See [23] for other examples.
One important class of unsaturated models is the hierarchical model for contingency tables. Aoki et al. [25] applied the hierarchical model to an analysis of stratified educational data, which is further analyzed by [8]. For other examples of unsaturated models, we refer to [9] for the Gibbs random partition and [26] for exponential permutation models.
The existence and uniqueness of the conditional maximum likelihood estimator for log-affine models are summarized as follows.
Lemma 3
(e.g. [27]) Let A be a configuration. Consider a log-affine model
where \(D\in {\mathbb {R}}^{q\times n}\) is a matrix such that the rows of A and D are linearly independent, and \(\nu _0\in {\mathbb {R}}^n\) is a fixed vector. Then for a given observation x, the conditional maximum likelihood estimate \({\hat{\theta }}\) exists if and only if Dx lies in the interior of the convex hull of \(\{D y\mid Ay=t,\ y\in {\mathbb {N}}^n\}\). If the estimate exists, it is unique.
An extension of the maximum likelihood estimator that admits zero probabilities is discussed by [24, 27, 28]. See also [10] for issues of sufficient statistics caused by zero counts.
Although the conditional maximum likelihood estimator is mathematically characterized, its computation is not easy in most cases [29], as indicated by Example 6. Recently, exact computation of the estimator is investigated in an algebraic framework such as the holonomic gradient method. See [2, 8,9,10] for this direction. We do not discuss it here and focus on statistical properties in the sections that follow.
3 Ancillary statistics
In this section, we review the definitions and properties of ancillary statistics for general statistical models.
3.1 Exact sense
Consider a parametric model \(f(x; \psi ,\lambda )\) of probability density functions, where \(x\in \Omega _x\) denotes the data and \((\psi ,\lambda )\in \Omega _\psi \times \Omega _\lambda \) is a parameter. The parameter \(\psi \) is of interest and \(\lambda \) is a nuisance parameter. Throughout this section, we will assume that f is positive everywhere.
Definition 2
(Ancillary statistic; [30]) A statistic \(t=t(x)\) is said to be ancillary for \(\psi \) (in the exact sense) if the marginal density of t does not depend on \(\psi \) and the conditional density of x given t does not depend on \(\lambda \); that is,
By abuse of notation, we use the same symbol f for the joint, conditional, and marginal densities.
If t is not given a priori, there exists an ambiguity in the choice of t [31]. We do not address this problem.
Example 7
Consider an independent and identically distributed sequence \(y_1,y_2,\ldots \) with the density function \(f(y_i;\psi )\). We observe the data up to a random time t so that the observed data are \(x=(y_1,\ldots ,y_t)\). Suppose that the probability mass function of t is \(f(t;\lambda )\). The likelihood function of x is
which is of the form (5). Hence, t is ancillary for \(\psi \).
If t is ancillary for \(\psi \), then \(f(t;\lambda )\) has no information regarding \(\psi \). Thus, it is natural to use the conditional likelihood \(f(x\mid t;\psi )\) for inference on \(\psi \).
The conditional maximum likelihood estimator \({\hat{\psi }}\) of \(\psi \) based on \(f(x\mid t;\psi )\) coincides with the unconditional maximum likelihood estimator if t is ancillary. This immediately follows from Eq. (5). A controversial point may be which of the distributions, the conditional or unconditional sampling distribution of \({\hat{\psi }}\), should be used for interval estimation. The common practice is to use the conditional distribution since it does not depend on the nuisance parameter. The conditioning also avoids any unnecessary assumption concerning t for inferences of \(\psi \), which is Fisher’s original argument for conditional inference.
In the Bayesian method, any inference is derived from the posterior distribution. Suppose that the prior density is independent: \(\pi (\psi ,\lambda )=\pi (\psi )\pi (\lambda )\). Then, under the assumption that \(t=t(x)\) is ancillary, the posterior density is decomposed as
which implies \(\pi (\psi ,\lambda \mid x)=\pi (\psi \mid x)\pi (\lambda \mid t)\) and \(\pi (\psi \mid x)\propto f(x\mid t;\psi )\pi (\psi )\). Hence, the inference on \(\psi \) is the same as that based on the conditional model.
3.2 Godambe’s sense
We provide a version of ancillarity introduced by [19]. Regularity conditions are not mentioned explicitly; see [19, 32, 33] for details.
Suppose that t is sufficient for \(\lambda \); that is,
The Poisson model satisfies this condition (see Lemma 2).
If our interest is to estimate the parameter \(\psi \), it is reasonable to consider an estimating equation of the form
where \(g(x,\psi )\) is a vector-valued function referred to as an estimating function. For example, the estimating function providing the conditional maximum likelihood estimator is \(g(x;\psi )=\partial _\psi \log f(x\mid t;\psi )\). Note that different estimating functions may define the same estimator.
The estimating function is often assumed to be unbiased. That is,
for any \((\psi ,\lambda )\), because it implies consistency and asymptotic normality of the estimator in typical problems. Here, the sums need to be replaced with integrals for general sample spaces. The unbiasedness condition is rewritten as
Definition 3
(Godambe’s ancillarity) Suppose that \(t=t(x)\) is a sufficient statistic for \(\lambda \). Then t is said to be ancillary in Godambe’s sense (or in a complete sense) if t is a complete sufficient statistic for the marginal model \(\{f(t;\psi ,\lambda )\mid \lambda \in \Omega _\lambda \}\) for each fixed \(\psi \), where completeness means that a functional equation \(\sum _t f(t;\psi ,\lambda )h(t;\psi )=0\) for an integrable function h has only the trivial solution \(h=0\).
This definition is not an extension of the exact ancillarity defined in the preceding subsection. Indeed, even if \(f(t;\psi ,\lambda )\) does not depend on \(\psi \), the statistic t is not ancillary in Godambe’s sense unless t is complete.
Suppose that t is ancillary in Godambe’s sense and that \(g(x;\psi )\) is an unbiased estimating function. Then, from (7) and the completeness of t, we can deduce
for any \(\psi \), which means \(g(x;\psi )\) is also an unbiased estimating function with respect to the conditional density. Therefore, we can reduce the class of estimating functions to those that are unbiased with respect to the conditional model.
The conditional maximum likelihood estimator is optimal in the sense of the following lemma, which is slightly modified from the original theorem of [19] for simplicity. We use \(M\succeq N\) for matrices M and N if \(M-N\) is positive semi-definite.
Lemma 4
[19] Let t be ancillary in Godambe’s sense. Then, for any unbiased estimating function g, the following inequality holds:
where \(s=\partial _\psi \log f(x\mid t;\psi )\) and E denotes expectation with respect to \(f(x;\psi ,\lambda )\). The equality is attained if \(g=s\).
Proof
The proof is similar to that of the Cramér–Rao inequality. By differentiating the unbiasedness condition (8) and taking expectation with respect to t, we obtain \(E[sg^\top + \partial _\psi g^\top ] = 0\). The Cauchy–Schwarz inequality yields \((a^\top E[\partial _\psi g^\top ]b)^2\le (a^\top E[ss^\top ]a)(b^\top E[gg^\top ]b)\) for any vectors a and b. Put \(a=E[ss^\top ]^{-1}v\) and \(b=E[\partial _\psi g^\top ]^{-1}v\) to obtain the desired inequality. If \(g=s\), then \(E[ss^\top ]=-E[\partial _\psi s^\top ]\) and the equality follows. \(\square \)
Example 8
Consider a normal model \(x_i \sim \mathrm {N}(\lambda ,\psi ^2)\), \(1\le i\le n\), where \(\lambda \) is a nuisance and \(\psi \) is of interest. The mean statistic \({\bar{x}}=\sum _{i=1}^n x_i/n\) is sufficient for \(\lambda \), but not ancillary for \(\psi \) in the exact sense because the marginal distribution \(\mathrm {N}(\lambda ,\psi ^2/n)\) depends on \(\psi \). However, \({\bar{x}}\) is ancillary in Godambe’s sense. Indeed, the completeness of \({\bar{x}}\) on \(\lambda \) follows from a general fact regarding the exponential families (see Theorem 4.3.1 of [34]). The conditional maximum likelihood estimator is shown to be the sample variance \({\hat{\psi }}^2=(n-1)^{-1}\sum _i(x_i-{\bar{x}})^2\).
3.3 Asymptotic sense
We define asymptotic ancillarity according to [18]. Regularity conditions are not mentioned explicitly; see [18] for details.
For any statistical model \(f(x;\theta )\) and statistic \(t=t(x)\), the Fisher information metric tensor and its conditional counterpart are defined by
respectively. The sums need to be replaced with integrals for general sample spaces. The Fisher information metric quantifies relative changes in the probability density against changes in the parameter. The decomposition
holds in general. The following lemma characterizes the ancillary property in terms of the Fisher information.
Lemma 5
(Chapter 7 of [35]) A statistic t is ancillary for \(\psi \) (in the exact sense) if and only if
where \(G_{\psi \psi }^{x\mid t}\) denotes the \(\psi \)-components of the matrix \(G^{x\mid t}\) and so on.
Proof
The only if part is straightforward. Conversely, assume (9). Then \(G_{\psi \psi }^t=G_{\psi \psi }^x-G_{\psi \psi }^{x\mid t}=0\) and \(G_{\lambda \lambda }^{x\mid t}=G_{\lambda \lambda }^x-G_{\lambda \lambda }^t=0\). From the definition of Fisher information, we deduce that \(\partial _\psi \log f(t)=0\) and \(\partial _\lambda \log f(x\mid t)=0\). Therefore, t is ancillary for \(\psi \). \(\square \)
Let us now consider a sequence of statistical models \(f_N(x;\psi ,\lambda )\) indexed by N. Let \(G_N^x\) and \(G_N^{x\mid t}\) be the Fisher information with respect to \(f_N\). Define the asymptotic Fisher information and its conditional counterpart by
whenever they exist. We assume \({\bar{G}}^x(\theta )\) is positive definite to avoid uninteresting cases.
Definition 4
[18] A statistic \(t=t(x)\) is said to be asymptotically ancillary for \(\psi \) if the following relation holds:
Example 9
(Continuation of Example 8) Let \(x_i\sim \mathrm {N}(\lambda ,\psi ^2)\) for \(1\le i\le N\). The Fisher information matrices of \((\psi ,\lambda )\) on \(x=(x_1,\ldots ,x_N)\) and \(t={\bar{x}}\) are
respectively. The asymptotic Fisher information matrices are
Thus, t is asymptotically ancillary because of the identity \({\bar{G}}^x={\bar{G}}^{x\mid t}+{\bar{G}}^t\).
4 Ancillary statistics of the Poisson model
Consider again the independent Poisson model with a configuration matrix A and its Gale transform B. The parameter of interest is \(\psi (p)=B\log p\) and the nuisance parameter is \(\lambda (p)=Ap\).
4.1 Conditions for exact ancillarity
We first give an example of ancillary statistics of the independent Poisson model.
Example 10
(Product binomial sampling) Consider a \(2\times 2\) contingency table \(x=(x_{11},x_{12},x_{21},x_{22})^\top \) and the configuration
Let \(t=Ax=(x_{1+},x_{2+})^\top \). Then the marginal and conditional distributions are
respectively, where \(\psi =(\psi _1,\psi _2)=(\log (p_{11}/p_{12}),\log (p_{21}/p_{22}))\) and \(\lambda =(\lambda _1,\lambda _2)=(p_{1+},p_{2+})\). Hence, t is ancillary for \(\psi \). This conditioning scheme is called the product binomial sampling because it is the product of binomial distributions (see, e.g., [14]).
One may expect that the conditioning variable Ax for any configuration A is ancillary for \(\psi =B\log p\). However, this is not true in general, as the following example shows.
Example 11
(Example 2 of [15]) For the contingency table \(x=(x_{11},x_{12},x_{21},x_{22})^\top \), let \(t=Ax=(x_{1+},x_{2+},x_{+1})^\top \), where
Let us check that the marginal distribution of t depends on \(\psi (p)\) so that t is not ancillary. The marginal distribution of t is
where \(p_{++}=\sum _i\sum _j p_{ij}\). If \(t=(1,0,1)^\top \), then the possible outcome is \(x=(1,0,0,0)\) only, and so \(f(t;p) = p_{11}e^{-p_{++}}\). This means \(p_{11}e^{-p_{++}}\) is identifiable from the marginal distribution. Furthermore, \(p_{++}\) is also identifiable since the marginal distribution of \(t_1+t_2=x_{++}\) is the Poisson distribution with mean \(p_{++}\). Therefore, \(p_{11}\) is identifiable. By symmetry, we need all \(p_{ij}\) to parameterize f(t; p).
These examples are generalized as follows. A proof is given in Appendix B.
Theorem 1
(Theorem 10.9 of [13]) Let x be distributed according to the independent Poisson model and A be a configuration. Suppose that the row space of A contains the all-one vector. Then Ax is ancillary for \(\psi =B\log p\) if and only if there exists an invertible matrix \(L\in {\mathbb {R}}^{d\times d}\) such that
where \(1_{n_i}=(1,\ldots ,1)^\top \in {\mathbb {N}}^{n_i}\) and \(n_1,\ldots ,n_d\in {\mathbb {N}}\) with \(n_1+\cdots +n_d=n\).
Corollary 1
(Chapter 2 of [1]) If A is of the form (11), then the conditional and unconditional maximum likelihood estimators coincide.
The theorem shows that Ax is not ancillary in the exact sense unless A is the product multinomial sampling scheme. In the subsequent subsection, however, we demonstrate that other ancillary properties hold for any A.
4.2 Godambe’s and asymptotic ancillarity
The ancillarity of Ax in Godambe’s sense when A is the \(2\times 2\) independence model is shown by Godambe himself in [33]. The following theorem is derived in a similar manner. See Appendix B for an outline of the proof.
Theorem 2
[19] Consider the independent Poisson model together with any configuration matrix A. Then Ax is ancillary in Godambe’s sense for \(\psi =B\log p\).
We will proceed to show the asymptotic ancillarity. Suppose that \(x^{(N)}\) has the independent Poisson distribution with the mean vector Np, where p is a fixed positive vector. Denote the n-dimensional normal distribution with mean vector \(\mu \) and covariance matrix \(\Sigma \) by \(\mathrm {N}_n(\mu ,\Sigma )\). Convergence in the distribution is denoted as \(\displaystyle \mathop {\rightarrow }^\mathrm{d}\). Let \(D_p\) be the diagonal matrix with diagonal entries \(p=(p_i)_{i=1}^n\). We begin with a standard result.
Lemma 6
(e.g. [22]) Let \({\hat{p}}=x^{(N)}/N\). Denote the unconditional maximum likelihood estimator of \((\psi ,\lambda )\) by \(({\hat{\psi }},{\hat{\lambda }})=(B\log {\hat{p}},A{\hat{p}})\). Then we have
as \(N\rightarrow \infty \).
Proof
This is a consequence of the classical central limit theorem. Indeed, from the reproducibility of the Poisson distribution, \(N^{1/2}({\hat{p}}-p)\) weakly converges to \(\mathrm {N}_n(0,D_p)\). Apply the delta method to the function \(\phi (x)=(B\log x,Ax)\) (see, e.g., Chapter 3 of [36]) and use \(AB^\top =0\). \(\square \)
Since the asymptotic covariance of the maximum likelihood estimator is the inverse of the Fisher information matrix, we obtain from Lemma 6 that
To prove the asymptotic ancillarity of \(t^{(N)}=Ax^{(N)}\), we check the two conditions \({\bar{G}}_{\psi \psi }^x={\bar{G}}_{\psi \psi }^{x\mid t}\) and \({\bar{G}}_{\lambda \lambda }^x={\bar{G}}_{\lambda \lambda }^t\). The second condition immediately follows from the fact that \(t^{(N)}\) is a sufficient statistic for \(\lambda \). The first condition is also expected since \({\hat{\psi }}\) is asymptotically independent of \(t^{(N)}=N{\hat{\lambda }}\) by Lemma 6. In fact, the following result holds.
Theorem 3
[18] Let A be any configuration matrix and fix \(p\in {\mathbb {R}}_+^n\). Suppose that \(x^{(N)}\) has the independent Poisson distribution with the mean vector Np. Then the statistic \(Ax^{(N)}\) is asymptotically ancillary for \(\psi =B\log p\).
The proof of Theorem 3 is complicated due to the discrete nature of the conditioning variable. In Appendix B, we provide an outline of the proof with the help of Theorem 1.1 of [1] on the conditional central limit theorem; see also Theorem 4 of [2] for a refined result.
References
Haberman, S.J.: The Analysis of Frequency Data: Statistical Research Monographs. University of Chicago Press, Chicago (1974)
Takayama, N., Kuriki, S., Takemura, A.: \(A\)-Hypergeometric distributions and Newton polytopes. Adv. Appl. Math. 99, 109–133 (2018)
Diaconis, P., Sturmfels, B.: Algebraic algorithms for sampling from conditional distributions. Ann. Stat. 26(1), 363–397 (1998)
Hibi, T.: Gröbner Bases: Statistics and Software Systems. Springer, Tokyo (2013)
Aoki, S., Hara, H., Takemura, A.: Markov Bases in Algebraic Statistics. Springer, New York (2012)
Harkness, W.: Properties of the extended hypergeometric distribution. Ann. Math. Stat. 36(3), 938–945 (1965)
Plackett: The Analysis of Categorical Data, 2nd edn. Griffin, London (1981)
Ogawa, M.: Algebraic statistical methods for conditional inference of discrete statistical models. Ph.D. thesis, The University of Tokyo (2014)
Mano, S.: Partition structure and the \(A\)-hypergeometric distribution associated with the rational normal curve. Electron. J. Stat. 11, 4452–4487 (2017)
Tachibana, Y., Goto, Y., Koyama, T., Takayama, N.: Holonomic gradient method for two-way contingency tables. Algebraic Stat. 11(2), 125–153 (2020)
Neyman, J., Scott, E.L.: Consistent estimates based on partially consistent observations. Econometrica 16, 1–32 (1948)
Amari, S.: Information Geometry and its Applications. Springer, Tokyo (2016)
Barndorff-Nielsen, O.: Information and Exponential Families: in Statistical Theory. Wiley, New York (1978)
Little, R.J.A.: Testing the equality of two independent binomial proportions. Am. Stat. 43(4), 283–288 (1989)
Zhu, Y., Reid, N.: Information, ancillarity, and sufficiency in the presence of nuisance parameters. Can. J. Stat. 22(1), 111–123 (1994)
Reid, N.: The roles of conditioning in inference. Stat. Sci. 10(2), 138–157 (1995)
Choi, L., Blume, J.D., Dupont, W.D.: Elucidating the foundations of statistical inference with 2 x 2 tables. PLoS One 10(4), 0121263 (2015)
Liang, K.-Y.: The asymptotic efficiency of conditional likelihood methods. Biometrika 71(2), 305–313 (1984)
Godambe, V.P.: On ancillarity and Fisher information in the presence of a nuisance parameter. Biometrika 71(3), 626–629 (1984)
Amari, S., Nagaoka, H.: Methods of Information Geometry. American Mathematical Society, Providence (2000)
Amari, S.: Information geometry on hierarchy of probability distributions. IEEE Trans. Inf. Theory 47(5), 1701–1711 (2001)
Cox, D.R., Reid, N.: Parameter orthogonality and approximate conditional inference. J. R. Stat. Soc. Ser. B (Methodol.) 49(1), 1–18 (1987)
Sei, T., Yano, K.: Minimum information dependence modeling. (2022). arXiv:2206.06792
Fienberg, S.E., Rinaldo, A.: Maximum likelihood estimation in log-linear models. Ann. Stat. 40(2), 996–1023 (2012)
Aoki, S., Otsu, T., Takemura, A., Numata, Y.: Statistical analysis of subject selection data in NCUEE examination. Ouyou Toukeigaku 39, 71–100 (2010)
Mukherjee, S.: Estimation in exponential families on permutations. Ann. Stat. 44(2), 853–875 (2016)
Rinaldo, A., Fienberg, S.E., Zhou, Y.: On the geometry of discrete exponential families with application to exponential random graph models. Electron. J. Stat. 3, 446–484 (2009)
Csiszár, I., Matúš, F.: Generalized maximum likelihood estimates for exponential families. Probab. Theory Relat. Fields 141(1), 213–246 (2008)
Agresti, A.: A survey of exact inference for contingency tables. Stat. Sci. 7(1), 131–153 (1992)
Cox, D.R., Hinkley, D.V.: Theoretical Statistics. Chapman and Hall/CRC, Boca Raton (1974)
Basu, D.: Recovery of ancillary information. Sankhyā Indian J. Stat. Ser. A 26(1), 3–16 (1964)
Godambe, V.P.: Conditional likelihood and unconditional optimum estimating equations. Biometrika 63(2), 277–284 (1976)
Godambe, V.P.: On sufficiency and ancillarity in the presence of a nuisance parameter. Biometrika 67(1), 155–162 (1980)
Lehmann, E.L., Romano, J.P.: Testing Statistical Hypotheses. Springer, New York (2005)
Amari, S.: Differential-Geometrical Methods in Statistics. Springer, Berlin (1985)
van der Vaart, A.W.: Asymptotic Statistics. Cambridge University Press, Cambridge (1998)
Csiszár, I.: I-Divergence geometry of probability distributions and minimization problems. Ann. Probab. 3(1), 146–158 (1975)
Acknowledgements
The author is grateful to the co-editor and two anonymous referees for their careful reading and insightful suggestions. He also thanks Mitsunori Ogawa for providing his PhD thesis and helpful comments, and Keisuke Yano for fruitful discussions. This work was supported by JSPS KAKENHI Grant numbers JP21K11781 and JP19K11865, and by JST CREST Grant number JPMJCR1763, Japan.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Data availability
The dataset analyzed during the current study is available in the Comprehensive R Archive Network (CRAN), https://cran.r-project.org.
Conflict of interest
The corresponding author states that there is no conflict of interest.
Additional information
Communicated by Shinto Eguchi.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Mixed coordinate system
Denote the independent Poisson distribution by \(f(x;p)=(p^x/x!)e^{-1_n^\top p}\) for \(x\in {\mathbb {N}}^n\) and \(p\in {\mathbb {R}}_+^n\). The Kullback–Leibler divergence from f(x; p) to f(x; q) is
Let A and B be the matrices introduced in Sect. 2. Define two sets
and
Lemma 7
(Pythagorean theorem; [37]) Let \(p,q,r\in {\mathbb {R}}_+^n\). If \(q\in {\mathcal {M}}(p)\cap {\mathcal {E}}(r)\), then
In particular, \({\mathcal {M}}(p)\cap {\mathcal {E}}(p)=\{p\}\) for each p.
Proof
Let \(p=q+B^\top \beta \) and \(r=qe^{A^\top \alpha }\). The first assertion follows from
Next, choose any \({\tilde{p}}\in {\mathcal {M}}(p)\cap {\mathcal {E}}(p)\). Then we have \(D(p,p)=D(p,{\tilde{p}})+D({\tilde{p}},p)\) by the Pythagorean relation, which implies \(p={\tilde{p}}\) by positive definiteness of D. \(\square \)
Since \(B\log p\) defines \({\mathcal {E}}(p)\) and Ap defines \({\mathcal {M}}(p)\), the lemma implies that the map \(p\mapsto (B\log p,Ap)\) is injective. Therefore, the mixed coordinate system is well defined.
Next, we show the orthogonality of the two manifolds \({\mathcal {E}}(p)\) and \({\mathcal {M}}(p)\) at \(p\in {\mathbb {R}}_+^n\) with respect to the Fisher information metric
where \(D_p\) is the diagonal matrix with diagonal vector p and \(T_p\) denotes the tangent space. A tangent vector in \(T_p{\mathcal {E}}(p)\) is written as \(D_pA^\top \alpha \) for some \(\alpha \in {\mathbb {R}}^d\) and that in \(T_p{\mathcal {M}}(p)\) is written as \(B^\top \beta \) for some \(\beta \in {\mathbb {R}}^{n-d}\). So we have \(g_p(D_pA^\top \alpha ,B^\top \beta ) = \alpha ^\top AB^\top \beta = 0\).
Finally, we prove that the range of \((\psi ,\lambda )\) is of the form \(\Omega _\psi \times \Omega _\lambda \) with \(\Omega _\psi ={\mathbb {R}}^{n-d}\) and \(\Omega _\lambda =A{\mathbb {R}}_+^n=\{Ap\mid p\in {\mathbb {R}}_+^n\}\). It is easy to see that the range of \(\psi \) is \({\mathbb {R}}^{n-d}\) and the range of \(\lambda \) is \(A{\mathbb {R}}_+^n\). Therefore, it is enough to show that \({\mathcal {M}}(p)\cap {\mathcal {E}}(r)\ne \emptyset \) for any pair \(p,r\in {\mathbb {R}}_+^n\). Consider the following convex optimization problem:
for given \(p,r\in {\mathbb {R}}_+^n\). Since D(q, r) diverges as q tends to the boundary of \({\mathcal {M}}(p)\), the optimal solution exists and satisfies the stationary condition
where \(\nu \) is the Lagrange multiplier. This implies \(q\in {\mathcal {M}}(p)\cap {\mathcal {E}}(r)\).
See [2] for solving the minimization problem by iterative proportional scaling.
Appendix B: Proof of Theorems
Proof of Theorem 1
Suppose that A is written as
Then the marginal distribution of Ax is the product of the independent Poisson distributions with mean vector Ap. The conditional distribution of x given Ax depends only on \(\psi (p)=B\log p\) by Lemma 2. Hence, Ax is ancillary. It is straightforward to see that Ax is ancillary if and only if LAx is ancillary for an invertible matrix L.
Conversely, suppose that the row space of A contains the all-one vector and Ax is ancillary. The marginal distribution of \(t=Ax\) is
Since the row space of A contains \(1_n\), \(1_n^\top p\) is identifiable from the marginal distribution of \(1_n^\top x\). Next, we prove that \(\{x\in {\mathbb {N}}^n\mid Ax=Ae_i\}=\{e_j\mid Ae_j=Ae_i\}\) for each i, where \(e_i\) denotes the i-th unit vector in \({\mathbb {R}}^n\). Indeed, if \(x\in {\mathbb {N}}^n\) and \(Ax=Ae_i\), then \(1_n^\top x=1_n^\top e_i=1\) and therefore \(x=e_j\) for some j. Define a partition \(\{I_k\}_{k=1}^K\) of \(\{1,\ldots ,n\}\) by
We have \(f(Ae_i)=\sum _{j\in I_k}p_j e^{-1_n^\top p}\) for \(i\in I_k\). Thus, \(\sum _{j\in I_k}p_j\) is identifiable from the marginal distribution f(t). Since the rank of A is d, we have \(K\ge d\). Suppose that \(K>d\). Define a configuration \({\tilde{A}}=({\tilde{a}}_{ki})\in {\mathbb {N}}^{K\times n}\) by \({\tilde{a}}_{ki}=1\) if \(i\in I_k\) and 0 otherwise. Note that \({\tilde{A}}p\) is identifiable from the discussion so far. Since the rank of \({\tilde{A}}\) is K, there exist two points p and q such that \(Ap=Aq\) and \({\tilde{A}}p\ne {\tilde{A}}q\), which implies \(B\log p\ne B\log q\) because the map \(p\mapsto (Ap,B\log p)\) is one-to-one. Thus, the distribution of f(t) depends on \(\psi (p)=B\log p\) and Ax is not ancillary for \(\psi \). This contradicts the assumption. Hence, \(K=d\), which means A has the form (11). \(\square \)
Proof of Theorem 2
We use the parameterization (4) and prove that \(t=Ax\) is complete for \(\alpha \). The marginal distribution of t is an exponential family
for fixed \(\psi \), where \(\phi (\alpha ,\psi )=1_n^\top p\) and \(C(t,\psi )=\sum _{Ax=t}e^{\psi ^\top (BB^\top )^{-1}Bx}/x!\). The range of the parameter \(\alpha \) is the whole space \({\mathbb {R}}^d\). Then the completeness follows from a general fact on exponential families. See, for example, Theorem 4.3.1 of [34]. \(\square \)
Proof of Theorem 3
We prove \({\bar{G}}_{\psi \psi }^{x\mid t}=(BD_pB^\top )^{-1}\). First, the conditional score function with respect to \(\psi \) is
We show
Indeed, by using a decomposition of the identity matrix into two projection matrices
we obtain
from which (B1) follows. Now, we recall Theorem 1.1 of [1], which says
as \(N\rightarrow \infty \) on the event \(\{A{\hat{p}}\rightarrow Ap\}\), where \({\hat{p}}=x^{(N)}/N\). From (B1), it follows that
on the event \(\{A{\hat{p}}\rightarrow Ap\}\) and, in particular,
Finally, we use the argument given by Liang [18]. Let v be any vector in \({\mathbb {R}}^{n-d}\). The Portmanteau lemma ([36], Lemma 2.2) implies
where Z is a Gaussian random vector having the covariance matrix \((BD_p^{-1}B^\top )^{-1}\). Conversely, we have
because of the relation \(G_{N,\psi \psi }^{x\mid t}\preceq G_{N,\psi \psi }^x=N{\bar{G}}_{\psi \psi }^x\) for finite N. We deduce that \({\bar{G}}_{\psi \psi }^{x\mid t}=\lim N^{-1}E[s_\psi s_\psi ^\top ]\) exists and is equal to \({\bar{G}}_{\psi \psi }^x\). The proof is completed. \(\square \)
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Sei, T. Conditional inference of Poisson models and information geometry: an ancillary review. Info. Geo. 7 (Suppl 1), 131–150 (2024). https://doi.org/10.1007/s41884-022-00082-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41884-022-00082-w