1 Introduction

Conditional distributions of the independent Poisson model play a central role in categorical data analysis. For example, the multinomial distribution is obtained from the Poisson model by the conditioning of the total count. The hypergeometric distribution is also obtained from the Poisson model by the conditioning of the marginal counts. These conditioning variables are written as Ax, where x is the Poisson random vector and A is an integer matrix. The conditional distribution of x given Ax is called the conditional Poisson model [1] or the A-hypergeometric distribution [2]. See Sect. 2 for a more precise definition. Inference on the Poisson mean parameters is usually performed on the basis of the conditional distribution. This type of inference is referred to as conditional inference.

The best-known example is Fisher’s exact test of independence for \(2\times 2\) contingency tables, where the p-value of a test statistic is computed under the conditional probability given marginal counts. Computation of the p-value for more complicated hypotheses is one of the central topics in algebraic statistics [3,4,5]. For parameter estimation, the conditional maximum likelihood estimator is defined as the maximizer of the conditional likelihood [6, 7]. A confidence interval of the parameter is constructed from the conditional distribution of the estimator. Exact computation of the estimator is difficult in most cases and has been recently investigated in an algebraic manner [2, 8,9,10].

One might ask, why is conditioning so important?

The main reason is that the conditional likelihood does not depend on the nuisance parameter, as we will see in Sect. 2. Here, the nuisance parameter in the current problem refers to the marginal distribution of Ax. The existence of nuisance is problematic if its dimension is high. This is known as the Neyman–Scott problem [11, 12]. Conditional inference is effective in such cases.

Another reason is that the conditional distribution removes the effect of the data sampling scheme. In \(2\times 2\) contingency tables, the data are collected under various sampling schemes: no constraints, given the total, given the row (or column) marginals or given all the marginals. The underlying distribution of x changes from the independent Poisson distribution according to the scheme. However, the conditional distribution given all the marginals is common to all cases.

According to the first reason cited above, the conditional likelihood has no information regarding the nuisance parameter, which leads to a natural question: How much information on the parameter of interest remains in the marginal distribution of Ax? The goal of this paper is to review the answers to this question from the view point of ancillarity.

In general, a statistic is said to be ancillary if its marginal distribution has no information about the parameter of interest. It would be desirable if our conditioning variable Ax were ancillary. Unfortunately, this is not true in general [13], as investigated for \(2\times 2\) tables by [14,15,16,17]. We recall this fact in Sect. 4. On the other hand, Ax is shown to be asymptotically ancillary in the sense of Liang [18], where a conditional limit theorem established by [1, 2] is essential. We also note the ancillary criterion proposed by Godambe [19], which focuses on the space of estimating functions. Usefully, the mixed coordinate system developed in information geometry is quite effective for describing these results [13, 20, 21]. This is a particular example of parameter orthogonality [22], under which conditional inference works well in general.

The remainder of the paper is organized as follows. The Poisson model, together with the mixed coordinate system, is introduced in Sect. 2. Section 3 reviews the ancillary properties of general statistical models. The ancillary properties of the Poisson model are summarized in Sect. 4. A self-contained description of the mixed coordinate system and theorem proofs are provided in Appendices A and B, respectively.

2 Conditional inference of Poisson models

2.1 Definition

Let \({\mathbb {N}}\) and \({\mathbb {R}}_+\) be the sets of non-negative integers and positive real numbers, respectively. Consider an independent Poisson model

$$\begin{aligned} f(x;p) = \frac{p^x}{x!}e^{-1_n^\top p}, \quad x=(x_i)_{i=1}^n\in {\mathbb {N}}^n, \end{aligned}$$

with the mean parameter \(p=(p_i)_{i=1}^n\in {\mathbb {R}}_+^n\), where the multi-index notation is adopted as \(p^x=\prod _i p_i^{x_i}\), \(x!=\prod _i x_i!\), and \(1_n=(1,\ldots ,1)^\top \in {\mathbb {N}}^n\). This model is an exponential family with the natural parameter \(\log p=(\log p_i)\in {\mathbb {R}}^n\) and the expectation parameter \(p\in {\mathbb {R}}_+^n\). We can read p as a “point” in the geometric sense. The maximum likelihood estimator of p, which maximizes f(xp) with respect to p, is \({\hat{p}}=x\) whenever x is positive.

Many statistical models for categorical data have the form (e.g., [8])

$$\begin{aligned} \log p = A^\top \alpha + g(\theta ), \end{aligned}$$
(1)

where \(A\in {\mathbb {N}}^{d\times n}\) is a given matrix, \(g:{\mathbb {R}}^q\rightarrow {\mathbb {R}}^n\) is a given smooth function, \(\alpha \in {\mathbb {R}}^d\) is a nuisance parameter, and \(\theta \in {\mathbb {R}}^q\) is a parameter of interest. The model is called a log-affine model if g is affine, and a log-linear model if g is linear. The model is said to be saturated if the map \((\alpha ,\theta )\mapsto p\) is surjective.

Example 1

(Poisson regression) Let \(A=(1,\ldots ,1)\in {\mathbb {N}}^{1\times n}\) and \(g(\theta )=D^\top \theta \), where \(D^\top \in {\mathbb {R}}^{n\times q}\) is a design matrix. Then \(\alpha \) is a baseline and \(\theta \) represents regression coefficients. This model is not saturated if \(n>q+1\).

We next show an “unconventional” use of the log-linear model.

Example 2

(Fisher’s iris data) Table 1 gives the number \(x_{ij}\) of cases of Iris setosa that have sepal length \(L_i\) and sepal width \(W_j\), where the length scales are \(\{L_i\}_{i=1}^{100}=\{W_j\}_{j=1}^{100}=\{0.1,0.2,\ldots ,10.0\}\) in centimeters. The contingency table has 100 rows and 100 columns; the total number of cases is \(N=50\). The table is very sparse, as only 39 of 10,000 cells have a non-zero entry. Consider the statistical model

$$\begin{aligned} \log p_{ij} = \alpha _i + \beta _j + \theta L_iW_j,\quad 1\le i,j\le 100, \end{aligned}$$

where \(\alpha _i,\beta _j\) are nuisance parameters and \(\theta \in {\mathbb {R}}\) is the parameter of interest. We set \(\beta _1=0\) without loss of generality. Then the dimension of the nuisance parameter is 199. This model is written in the form of (1) with an integer matrix \(A\in {\mathbb {N}}^{199\times 10000}\) and \(g_{ij}(\theta )=\theta L_iW_j\). We will see that the conditional maximum likelihood estimate of \(\theta \) exists; see Example 6. This is an example of the minimum information dependence modeling recently proposed by [23].

Table 1 Sepal length and width of I. setosa in Fisher’s iris data

The conditional distribution of x given \(t=Ax\) is

$$\begin{aligned} f(x\mid t;p) = \frac{f(x;p)}{\sum _{Ay=t}f(y;p)} = \frac{1}{Z(t;p)}\frac{p^x}{x!}, \end{aligned}$$
(2)

where \(Z(t;p) = \sum _{Ay=t} p^y/y!\) is the normalizing constant. Here, by abuse of notation, we use the same symbol f for the joint and conditional distributions.

Definition 1

The conditional distribution (2) is called the the conditional Poisson model or A-hypergeometric distribution. Matrix A is called a configuration. We assume that A is of full row rank (i.e., of rank d) and that the row space of A contains \(1_n\) unless otherwise stated.

Lemma 1

(Chapter 1 of [1]) Assume the model (1). Then the conditional distribution (2) does not depend on \(\alpha \). In other words, \(t=Ax\) is a sufficient statistic for \(\alpha \).

Proof

Under the model (1), we have \(p^x= e^{\alpha ^\top t + g(\theta )^\top x}\), where \(t=Ax\). The factor depending only on t is canceled out in (2). \(\square \)

The lemma states that the conditional distribution does not depend on the nuisance parameter. This is one reason that the conditional distribution is important, as stated in Sect. 1. We sometimes use \(f(x\mid t;\theta )\) rather than \(f(x\mid t;p)\).

Example 3

(Continuation) The conditional distribution for the iris data in Example 2 is

$$\begin{aligned} f(x\mid t; \theta ) = \frac{e^{\theta \sum _{i,j} L_iW_jx_{ij}}/x_{ij}!}{\sum _{Ay=t}e^{\theta \sum _{i,j}L_iW_jy_{ij}}/y_{ij}!}, \end{aligned}$$
(3)

where y ranges over the tables that have the same marginals as x. All the nuisance parameters are canceled out.

2.2 Mixed coordinate system

For a given configuration \(A\in {\mathbb {N}}^{d\times n}\) with \(n>d\), choose a matrix \(B\in {\mathbb {Z}}^{(n-d)\times n}\) of rank \(n-d\) such that \(AB^\top =0\). Matrix B is called the Gale transform of A in the theory of convex polytopes (see [2]). We will use the following coordinate system of p:

$$\begin{aligned} \psi (p) = B\log p \quad \text{ and } \quad \lambda (p) = Ap. \end{aligned}$$

The map \((\psi ,\lambda ):{\mathbb {R}}_+^n\rightarrow {\mathbb {R}}^{n-d}\times {\mathbb {R}}^d\) actually defines a coordinate system, which is called the mixed coordinate system in information geometry (see [13, 20]). It is known that \(\psi (p)\) and \(\lambda (p)\) are orthogonal with respect to the Fisher information metric. Moreover, the range of \((\psi ,\lambda )\) is written as \(\Omega _\psi \times \Omega _\lambda ={\mathbb {R}}^{n-d}\times A{\mathbb {R}}_+^n\). See Appendix A for details regarding these facts. The symbols \(\psi \) and \(\lambda \) are used in accordance with [16].

Since p and \((\psi (p),\lambda (p))\) have a one-to-one correspondence, we have the following lemma.

Lemma 2

(Chapter 8 of [13]) Consider a log-linear model

$$\begin{aligned} \log p=A^\top \alpha + B^\top (BB^\top )^{-1}\psi ,\quad (\alpha ,\psi )\in {\mathbb {R}}^d\times {\mathbb {R}}^{n-d}. \end{aligned}$$
(4)

This model is saturated, and the nuisance parameter \(\alpha \) is one-to-one with \(\lambda (p)\) for given \(\psi \).

Proof

Since the row spaces of A and B span \({\mathbb {R}}^n\), (4) is saturated. It can be immediately seen that \(B\log p=\psi \) since \(BA^\top =0\). Then the one-to-one correspondence between \(\alpha \) and \(\lambda (p)\) for given \(\psi =\psi (p)\) follows from the correspondences \((\alpha ,\psi )\leftrightarrow p\) and \(p\leftrightarrow (\psi (p),\lambda (p))\). \(\square \)

From the lemma, \(\lambda (p)\) is considered as the nuisance parameter and \(\psi (p)\) is the parameter of interest. The A-hypergeometric distribution (2) is then

$$\begin{aligned} f(x\mid t;p) \propto \frac{1}{x!}e^{\psi (p)^\top (BB^\top )^{-1}Bx}, \end{aligned}$$

where the normalizing constant is omitted. The quantity \(e^{\psi (p)}\in {\mathbb {R}}_+^{n-d}\) is called the generalized odds ratio in [2].

Example 4

(2 by 2 contingency table) Let \(p=(p_{11},p_{12},p_{21},p_{22})\) represent a \(2\times 2\) contingency table. In many applications, the log-odds ratio \(\psi (p)=\log (p_{11}p_{22}/p_{12}p_{21})\) is the parameter of interest. The marginal distributions \(p_{i+}=\sum _j p_{ij}\) and \(p_{+j}=\sum _ip_{ij}\) are nuisance. In this case, matrices A and B are

$$\begin{aligned} A = \begin{pmatrix} 1&{} 1&{} 0&{} 0\\ 0&{} 0&{} 1&{} 1\\ 1&{} 0&{} 1&{} 0 \end{pmatrix} \quad \text{ and } \quad B = \begin{pmatrix} 1&-1&-1&1 \end{pmatrix}. \end{aligned}$$

Indeed, \(\psi (p)=B\log p=\log (p_{11}p_{22}/p_{12}p_{21})\) and \(\lambda (p)=Ap=(p_{1+},p_{2+},p_{+1})\). Note that Ap determines all the marginals because \(p_{+2}=p_{1+}+p_{2+}-p_{+1}\) . The A-hypergeometric distribution is

$$\begin{aligned} f(x\mid t;p) \propto \frac{1}{x_{11}!x_{12}!x_{21}!x_{22}!} e^{\psi (p)(x_{11}+x_{22}-x_{12}-x_{21})/4}, \end{aligned}$$

where \(x=(x_{11},x_{12},x_{21},x_{22})^\top \). If we write \(x_{11}=k,x_{12}=t_1-k,x_{21}=t_3-k,x_{22}=t_2-t_3+k\) with \(k\in {\mathbb {N}}\), then we have a more familiar form of the noncentral hypergeometric distribution

$$\begin{aligned} f(k\mid t;\psi ) = \frac{\left( {\begin{array}{c}t_1\\ k\end{array}}\right) \left( {\begin{array}{c}t_2\\ t_3-k\end{array}}\right) e^{k\psi }}{\sum _j\left( {\begin{array}{c}t_1\\ j\end{array}}\right) \left( {\begin{array}{c}t_2\\ t_3-j\end{array}}\right) e^{j\psi }}. \end{aligned}$$

Fisher’s exact test examines the hypothesis \(\psi =0\) against \(\psi \ne 0\). The p-value is calculated on the basis of the hypergeometric distribution for \(\psi =0\).

Table 2 compares symbols used in several papers for ease of reference.

Table 2 Correspondence of symbols

2.3 Conditional maximum likelihood estimator

The conditional maximum likelihood estimator \({\hat{\psi }}\) of \(\psi (p)=B\log p\) is defined as a maximizer of \(f(x\mid t;p)\) with respect to \(\psi \) for given x. The estimator is, in general, different from the unconditional maximum likelihood estimator \({\hat{\psi }}_\mathrm{MLE}=B\log x\), as the following example shows. A sufficient condition for the coincidence of the two estimators is provided in Sect. 4.

Example 5

(Continuation) Consider again Example 4. If \(t=(3,2,3)^\top \), then possible outcomes of x are

$$\begin{aligned} x(1) = \begin{array}{|c|c|} \hline 3&{} 0\\ \hline 0&{} 2\\ \hline \end{array}, \quad x(2) = \begin{array}{|c|c|} \hline 2&{} 1\\ \hline 1&{} 1\\ \hline \end{array} \quad \text{ and } \quad x(3) = \begin{array}{|c|c|} \hline 1&{} 2\\ \hline 2&{} 0\\ \hline \end{array} \end{aligned}$$

in the form of contingency tables. If the observation is \(x=x(2)\), the conditional likelihood is

$$\begin{aligned} f(x(2)\mid t;p)&= \frac{\frac{1}{2!1!1!1!}p_{11}^2p_{12}p_{21}p_{22}}{\frac{1}{3!0!0!2!}p_{11}^3p_{22}^2+\frac{1}{2!1!1!1!}p_{11}^2p_{12}p_{21}p_{22}+\frac{1}{1!2!2!0!}p_{11}p_{12}^2p_{21}^2} \\&= \frac{\frac{1}{2}e^\psi }{\frac{1}{12}e^{2\psi }+\frac{1}{2}e^\psi +\frac{1}{4}}, \end{aligned}$$

which is maximized at \({\hat{\psi }}=(1/2)\log 3\). This value is different from the unconditional maximum likelihood estimate \({\hat{\psi }}_\mathrm{MLE}=B\log x(2)=\log 2\). If the observation is \(x=x(1)\) or \(x=x(3)\), the conditional and unconditional maximum likelihood estimates do not exist.

The conditional maximum likelihood estimator for unsaturated models is defined as the maximizer of \(f(x\mid t;\theta )\) as well. Below is an example.

Example 6

(Continuation) For the iris data in Example 2, the conditional maximum likelihood estimate is approximately \({\hat{\theta }}=12.6\), where the Markov chain Monte Carlo method with a Markov basis is used to solve the likelihood equation

$$\begin{aligned} \partial _\theta \log f(x\mid t;\theta ) = 0 \end{aligned}$$

numerically. See [4, 5] for the details of Markov bases. Indeed, the denominator of (3) has approximately \(50!\approx 3\times 10^{64}\) terms, and an exact computation seems impossible. For the same data, we can also fit a Gaussian model \(f(L,W)\propto \exp (\alpha _1 L + \alpha _2 L^2+\beta _1 W + \beta _2 W^2+\theta LW)\) to the sepal length L and width W of I. setosa, from which we obtain \({\hat{\theta }}_\mathrm{Gauss}=12.4\). The two estimates are surprisingly close. See [23] for other examples.

One important class of unsaturated models is the hierarchical model for contingency tables. Aoki et al. [25] applied the hierarchical model to an analysis of stratified educational data, which is further analyzed by [8]. For other examples of unsaturated models, we refer to [9] for the Gibbs random partition and [26] for exponential permutation models.

The existence and uniqueness of the conditional maximum likelihood estimator for log-affine models are summarized as follows.

Lemma 3

(e.g. [27]) Let A be a configuration. Consider a log-affine model

$$\begin{aligned} \log p = A^\top \alpha + D^\top \theta + \nu _0, \end{aligned}$$

where \(D\in {\mathbb {R}}^{q\times n}\) is a matrix such that the rows of A and D are linearly independent, and \(\nu _0\in {\mathbb {R}}^n\) is a fixed vector. Then for a given observation x, the conditional maximum likelihood estimate \({\hat{\theta }}\) exists if and only if Dx lies in the interior of the convex hull of \(\{D y\mid Ay=t,\ y\in {\mathbb {N}}^n\}\). If the estimate exists, it is unique.

An extension of the maximum likelihood estimator that admits zero probabilities is discussed by [24, 27, 28]. See also [10] for issues of sufficient statistics caused by zero counts.

Although the conditional maximum likelihood estimator is mathematically characterized, its computation is not easy in most cases [29], as indicated by Example 6. Recently, exact computation of the estimator is investigated in an algebraic framework such as the holonomic gradient method. See [2, 8,9,10] for this direction. We do not discuss it here and focus on statistical properties in the sections that follow.

3 Ancillary statistics

In this section, we review the definitions and properties of ancillary statistics for general statistical models.

3.1 Exact sense

Consider a parametric model \(f(x; \psi ,\lambda )\) of probability density functions, where \(x\in \Omega _x\) denotes the data and \((\psi ,\lambda )\in \Omega _\psi \times \Omega _\lambda \) is a parameter. The parameter \(\psi \) is of interest and \(\lambda \) is a nuisance parameter. Throughout this section, we will assume that f is positive everywhere.

Definition 2

(Ancillary statistic; [30]) A statistic \(t=t(x)\) is said to be ancillary for \(\psi \) (in the exact sense) if the marginal density of t does not depend on \(\psi \) and the conditional density of x given t does not depend on \(\lambda \); that is,

$$\begin{aligned} f(x;\psi ,\lambda ) = f(x\mid t;\psi )f(t;\lambda ). \end{aligned}$$
(5)

By abuse of notation, we use the same symbol f for the joint, conditional, and marginal densities.

If t is not given a priori, there exists an ambiguity in the choice of t [31]. We do not address this problem.

Example 7

Consider an independent and identically distributed sequence \(y_1,y_2,\ldots \) with the density function \(f(y_i;\psi )\). We observe the data up to a random time t so that the observed data are \(x=(y_1,\ldots ,y_t)\). Suppose that the probability mass function of t is \(f(t;\lambda )\). The likelihood function of x is

$$\begin{aligned} f(x;\psi ,\lambda ) = f(t;\lambda )\prod _{i=1}^t f(y_i;\psi ), \end{aligned}$$

which is of the form (5). Hence, t is ancillary for \(\psi \).

If t is ancillary for \(\psi \), then \(f(t;\lambda )\) has no information regarding \(\psi \). Thus, it is natural to use the conditional likelihood \(f(x\mid t;\psi )\) for inference on \(\psi \).

The conditional maximum likelihood estimator \({\hat{\psi }}\) of \(\psi \) based on \(f(x\mid t;\psi )\) coincides with the unconditional maximum likelihood estimator if t is ancillary. This immediately follows from Eq. (5). A controversial point may be which of the distributions, the conditional or unconditional sampling distribution of \({\hat{\psi }}\), should be used for interval estimation. The common practice is to use the conditional distribution since it does not depend on the nuisance parameter. The conditioning also avoids any unnecessary assumption concerning t for inferences of \(\psi \), which is Fisher’s original argument for conditional inference.

In the Bayesian method, any inference is derived from the posterior distribution. Suppose that the prior density is independent: \(\pi (\psi ,\lambda )=\pi (\psi )\pi (\lambda )\). Then, under the assumption that \(t=t(x)\) is ancillary, the posterior density is decomposed as

$$\begin{aligned} \pi (\psi ,\lambda \mid x)&\propto f(x\mid \psi ,\lambda )\pi (\psi ,\lambda ) \\&\propto \{f(x\mid t;\psi )\pi (\psi )\}\{f(t;\lambda )\pi (\lambda )\}, \end{aligned}$$

which implies \(\pi (\psi ,\lambda \mid x)=\pi (\psi \mid x)\pi (\lambda \mid t)\) and \(\pi (\psi \mid x)\propto f(x\mid t;\psi )\pi (\psi )\). Hence, the inference on \(\psi \) is the same as that based on the conditional model.

3.2 Godambe’s sense

We provide a version of ancillarity introduced by [19]. Regularity conditions are not mentioned explicitly; see [19, 32, 33] for details.

Suppose that t is sufficient for \(\lambda \); that is,

$$\begin{aligned} f(x;\psi ,\lambda )=f(x\mid t;\psi )f(t;\psi ,\lambda ). \end{aligned}$$

The Poisson model satisfies this condition (see Lemma 2).

If our interest is to estimate the parameter \(\psi \), it is reasonable to consider an estimating equation of the form

$$\begin{aligned} g(x;{\hat{\psi }}) = 0, \end{aligned}$$
(6)

where \(g(x,\psi )\) is a vector-valued function referred to as an estimating function. For example, the estimating function providing the conditional maximum likelihood estimator is \(g(x;\psi )=\partial _\psi \log f(x\mid t;\psi )\). Note that different estimating functions may define the same estimator.

The estimating function is often assumed to be unbiased. That is,

$$\begin{aligned} \sum _x f(x;\psi ,\lambda )g(x;\psi ) = 0 \end{aligned}$$

for any \((\psi ,\lambda )\), because it implies consistency and asymptotic normality of the estimator in typical problems. Here, the sums need to be replaced with integrals for general sample spaces. The unbiasedness condition is rewritten as

$$\begin{aligned} \sum _t f(t;\psi ,\lambda )\left\{ \sum _{x\mid t}f(x\mid t;\psi )g(x;\psi ) \right\} = 0. \end{aligned}$$
(7)

Definition 3

(Godambe’s ancillarity) Suppose that \(t=t(x)\) is a sufficient statistic for \(\lambda \). Then t is said to be ancillary in Godambe’s sense (or in a complete sense) if t is a complete sufficient statistic for the marginal model \(\{f(t;\psi ,\lambda )\mid \lambda \in \Omega _\lambda \}\) for each fixed \(\psi \), where completeness means that a functional equation \(\sum _t f(t;\psi ,\lambda )h(t;\psi )=0\) for an integrable function h has only the trivial solution \(h=0\).

This definition is not an extension of the exact ancillarity defined in the preceding subsection. Indeed, even if \(f(t;\psi ,\lambda )\) does not depend on \(\psi \), the statistic t is not ancillary in Godambe’s sense unless t is complete.

Suppose that t is ancillary in Godambe’s sense and that \(g(x;\psi )\) is an unbiased estimating function. Then, from (7) and the completeness of t, we can deduce

$$\begin{aligned} \sum _{x\mid t}f(x\mid t;\psi )g(x;\psi ) = 0 \end{aligned}$$
(8)

for any \(\psi \), which means \(g(x;\psi )\) is also an unbiased estimating function with respect to the conditional density. Therefore, we can reduce the class of estimating functions to those that are unbiased with respect to the conditional model.

The conditional maximum likelihood estimator is optimal in the sense of the following lemma, which is slightly modified from the original theorem of [19] for simplicity. We use \(M\succeq N\) for matrices M and N if \(M-N\) is positive semi-definite.

Lemma 4

[19] Let t be ancillary in Godambe’s sense. Then, for any unbiased estimating function g, the following inequality holds:

$$\begin{aligned} \left( E[\partial _\psi g^\top ]^{-1}\right) ^\top E[gg^\top ]E[\partial _\psi g^\top ]^{-1} \succeq E[ss^\top ]^{-1}, \end{aligned}$$

where \(s=\partial _\psi \log f(x\mid t;\psi )\) and E denotes expectation with respect to \(f(x;\psi ,\lambda )\). The equality is attained if \(g=s\).

Proof

The proof is similar to that of the Cramér–Rao inequality. By differentiating the unbiasedness condition (8) and taking expectation with respect to t, we obtain \(E[sg^\top + \partial _\psi g^\top ] = 0\). The Cauchy–Schwarz inequality yields \((a^\top E[\partial _\psi g^\top ]b)^2\le (a^\top E[ss^\top ]a)(b^\top E[gg^\top ]b)\) for any vectors a and b. Put \(a=E[ss^\top ]^{-1}v\) and \(b=E[\partial _\psi g^\top ]^{-1}v\) to obtain the desired inequality. If \(g=s\), then \(E[ss^\top ]=-E[\partial _\psi s^\top ]\) and the equality follows. \(\square \)

Example 8

Consider a normal model \(x_i \sim \mathrm {N}(\lambda ,\psi ^2)\), \(1\le i\le n\), where \(\lambda \) is a nuisance and \(\psi \) is of interest. The mean statistic \({\bar{x}}=\sum _{i=1}^n x_i/n\) is sufficient for \(\lambda \), but not ancillary for \(\psi \) in the exact sense because the marginal distribution \(\mathrm {N}(\lambda ,\psi ^2/n)\) depends on \(\psi \). However, \({\bar{x}}\) is ancillary in Godambe’s sense. Indeed, the completeness of \({\bar{x}}\) on \(\lambda \) follows from a general fact regarding the exponential families (see Theorem 4.3.1 of [34]). The conditional maximum likelihood estimator is shown to be the sample variance \({\hat{\psi }}^2=(n-1)^{-1}\sum _i(x_i-{\bar{x}})^2\).

3.3 Asymptotic sense

We define asymptotic ancillarity according to [18]. Regularity conditions are not mentioned explicitly; see [18] for details.

For any statistical model \(f(x;\theta )\) and statistic \(t=t(x)\), the Fisher information metric tensor and its conditional counterpart are defined by

$$\begin{aligned} G^x(\theta )&= \sum _x f(x;\theta )\{\partial _\theta \log f(x;\theta )\}\{\partial _\theta \log f(x;\theta )\}^\top \quad \text{ and } \\ G^{x\mid t}(\theta )&= \sum _t f(t;\theta )\sum _{x\mid t}f(x\mid t;\theta )\{\partial _\theta \log f(x\mid t;\theta )\}\{\partial _\theta \log f(x\mid t;\theta )\}^\top , \end{aligned}$$

respectively. The sums need to be replaced with integrals for general sample spaces. The Fisher information metric quantifies relative changes in the probability density against changes in the parameter. The decomposition

$$\begin{aligned} G^x(\theta ) = G^{x\mid t}(\theta ) + G^t(\theta ) \end{aligned}$$

holds in general. The following lemma characterizes the ancillary property in terms of the Fisher information.

Lemma 5

(Chapter 7 of [35]) A statistic t is ancillary for \(\psi \) (in the exact sense) if and only if

$$\begin{aligned} G^x(\psi ,\lambda ) = \begin{pmatrix} G_{\psi \psi }^{x\mid t}(\psi ,\lambda )&{} 0\\ 0&{} G_{\lambda \lambda }^t(\psi ,\lambda ) \end{pmatrix}, \end{aligned}$$
(9)

where \(G_{\psi \psi }^{x\mid t}\) denotes the \(\psi \)-components of the matrix \(G^{x\mid t}\) and so on.

Proof

The only if part is straightforward. Conversely, assume (9). Then \(G_{\psi \psi }^t=G_{\psi \psi }^x-G_{\psi \psi }^{x\mid t}=0\) and \(G_{\lambda \lambda }^{x\mid t}=G_{\lambda \lambda }^x-G_{\lambda \lambda }^t=0\). From the definition of Fisher information, we deduce that \(\partial _\psi \log f(t)=0\) and \(\partial _\lambda \log f(x\mid t)=0\). Therefore, t is ancillary for \(\psi \). \(\square \)

Let us now consider a sequence of statistical models \(f_N(x;\psi ,\lambda )\) indexed by N. Let \(G_N^x\) and \(G_N^{x\mid t}\) be the Fisher information with respect to \(f_N\). Define the asymptotic Fisher information and its conditional counterpart by

$$\begin{aligned} {\bar{G}}^x(\theta ) = \lim _{N\rightarrow \infty }\frac{1}{N}G_N^x(\theta ) \quad \text{ and }\quad {\bar{G}}^{x\mid t}(\theta ) = \lim _{N\rightarrow \infty }\frac{1}{N}G_N^{x\mid t}(\theta ), \end{aligned}$$

whenever they exist. We assume \({\bar{G}}^x(\theta )\) is positive definite to avoid uninteresting cases.

Definition 4

[18] A statistic \(t=t(x)\) is said to be asymptotically ancillary for \(\psi \) if the following relation holds:

$$\begin{aligned} {\bar{G}}^x(\psi ,\lambda ) = \begin{pmatrix} {\bar{G}}_{\psi \psi }^{x\mid t}(\psi ,\lambda )&{} 0\\ 0&{} {\bar{G}}_{\lambda \lambda }^t(\psi ,\lambda ) \end{pmatrix}. \end{aligned}$$
(10)

Example 9

(Continuation of Example 8) Let \(x_i\sim \mathrm {N}(\lambda ,\psi ^2)\) for \(1\le i\le N\). The Fisher information matrices of \((\psi ,\lambda )\) on \(x=(x_1,\ldots ,x_N)\) and \(t={\bar{x}}\) are

$$\begin{aligned} G^x = \begin{pmatrix} 2N/\psi ^2&{} 0\\ 0&{} N/\psi ^2 \end{pmatrix} \quad \text{ and }\quad G^t = \begin{pmatrix} 2/\psi ^2&{} 0\\ 0&{} N/\psi ^2 \end{pmatrix}, \end{aligned}$$

respectively. The asymptotic Fisher information matrices are

$$\begin{aligned} {\bar{G}}^x = \begin{pmatrix} 2/\psi ^2&{} 0\\ 0&{} 1/\psi ^2 \end{pmatrix} \quad \text{ and }\quad {\bar{G}}^t = \begin{pmatrix} 0&{} 0\\ 0&{} 1/\psi ^2 \end{pmatrix}. \end{aligned}$$

Thus, t is asymptotically ancillary because of the identity \({\bar{G}}^x={\bar{G}}^{x\mid t}+{\bar{G}}^t\).

4 Ancillary statistics of the Poisson model

Consider again the independent Poisson model with a configuration matrix A and its Gale transform B. The parameter of interest is \(\psi (p)=B\log p\) and the nuisance parameter is \(\lambda (p)=Ap\).

4.1 Conditions for exact ancillarity

We first give an example of ancillary statistics of the independent Poisson model.

Example 10

(Product binomial sampling) Consider a \(2\times 2\) contingency table \(x=(x_{11},x_{12},x_{21},x_{22})^\top \) and the configuration

$$\begin{aligned} A= \begin{pmatrix} 1&{} 1&{} 0&{} 0\\ 0&{} 0&{} 1&{} 1 \end{pmatrix} \quad \text{ and } \quad B=\begin{pmatrix} 1&{} -1&{} 0&{} 0\\ 0&{} 0&{} 1&{} -1 \end{pmatrix}. \end{aligned}$$

Let \(t=Ax=(x_{1+},x_{2+})^\top \). Then the marginal and conditional distributions are

$$\begin{aligned} f(t)&= \frac{\lambda _1^{t_1}e^{-\lambda _1}}{t_1!}\frac{\lambda _2^{t_2}e^{-\lambda _2}}{t_2!}\quad \text{ and } \\ f(x\mid t)&= \left( {\begin{array}{c}t_1\\ x_{11}\end{array}}\right) \left( {\begin{array}{c}t_2\\ x_{21}\end{array}}\right) \frac{e^{\psi _1 x_{11}}}{(e^{\psi _1}+1)^{t_1}} \frac{e^{\psi _2 x_{21}}}{(e^{\psi _2}+1)^{t_2}}, \end{aligned}$$

respectively, where \(\psi =(\psi _1,\psi _2)=(\log (p_{11}/p_{12}),\log (p_{21}/p_{22}))\) and \(\lambda =(\lambda _1,\lambda _2)=(p_{1+},p_{2+})\). Hence, t is ancillary for \(\psi \). This conditioning scheme is called the product binomial sampling because it is the product of binomial distributions (see, e.g., [14]).

One may expect that the conditioning variable Ax for any configuration A is ancillary for \(\psi =B\log p\). However, this is not true in general, as the following example shows.

Example 11

(Example 2 of [15]) For the contingency table \(x=(x_{11},x_{12},x_{21},x_{22})^\top \), let \(t=Ax=(x_{1+},x_{2+},x_{+1})^\top \), where

$$\begin{aligned} A = \begin{pmatrix} 1&{} 1&{} 0&{} 0\\ 0&{} 0&{} 1&{} 1\\ 1&{} 0&{} 1&{} 0 \end{pmatrix}. \end{aligned}$$

Let us check that the marginal distribution of t depends on \(\psi (p)\) so that t is not ancillary. The marginal distribution of t is

$$\begin{aligned} f(t;p) = \sum _{x: Ax=t}\frac{p_{11}^{x_{11}}p_{12}^{x_{12}}p_{21}^{x_{21}}p_{22}^{x_{22}}}{x_{11}!x_{12}!x_{21}!x_{22}!}e^{-p_{++}}, \end{aligned}$$

where \(p_{++}=\sum _i\sum _j p_{ij}\). If \(t=(1,0,1)^\top \), then the possible outcome is \(x=(1,0,0,0)\) only, and so \(f(t;p) = p_{11}e^{-p_{++}}\). This means \(p_{11}e^{-p_{++}}\) is identifiable from the marginal distribution. Furthermore, \(p_{++}\) is also identifiable since the marginal distribution of \(t_1+t_2=x_{++}\) is the Poisson distribution with mean \(p_{++}\). Therefore, \(p_{11}\) is identifiable. By symmetry, we need all \(p_{ij}\) to parameterize f(tp).

These examples are generalized as follows. A proof is given in Appendix B.

Theorem 1

(Theorem 10.9 of [13]) Let x be distributed according to the independent Poisson model and A be a configuration. Suppose that the row space of A contains the all-one vector. Then Ax is ancillary for \(\psi =B\log p\) if and only if there exists an invertible matrix \(L\in {\mathbb {R}}^{d\times d}\) such that

$$\begin{aligned} LA = \begin{pmatrix} 1_{n_1}^\top &{}&{}\mathbf {0}\\ &{}\ddots &{}\\ \mathbf {0}&{}&{}1_{n_d}^\top \end{pmatrix}, \end{aligned}$$
(11)

where \(1_{n_i}=(1,\ldots ,1)^\top \in {\mathbb {N}}^{n_i}\) and \(n_1,\ldots ,n_d\in {\mathbb {N}}\) with \(n_1+\cdots +n_d=n\).

Corollary 1

(Chapter 2 of [1]) If A is of the form (11), then the conditional and unconditional maximum likelihood estimators coincide.

The theorem shows that Ax is not ancillary in the exact sense unless A is the product multinomial sampling scheme. In the subsequent subsection, however, we demonstrate that other ancillary properties hold for any A.

4.2 Godambe’s and asymptotic ancillarity

The ancillarity of Ax in Godambe’s sense when A is the \(2\times 2\) independence model is shown by Godambe himself in [33]. The following theorem is derived in a similar manner. See Appendix B for an outline of the proof.

Theorem 2

[19] Consider the independent Poisson model together with any configuration matrix A. Then Ax is ancillary in Godambe’s sense for \(\psi =B\log p\).

We will proceed to show the asymptotic ancillarity. Suppose that \(x^{(N)}\) has the independent Poisson distribution with the mean vector Np, where p is a fixed positive vector. Denote the n-dimensional normal distribution with mean vector \(\mu \) and covariance matrix \(\Sigma \) by \(\mathrm {N}_n(\mu ,\Sigma )\). Convergence in the distribution is denoted as \(\displaystyle \mathop {\rightarrow }^\mathrm{d}\). Let \(D_p\) be the diagonal matrix with diagonal entries \(p=(p_i)_{i=1}^n\). We begin with a standard result.

Lemma 6

(e.g. [22]) Let \({\hat{p}}=x^{(N)}/N\). Denote the unconditional maximum likelihood estimator of \((\psi ,\lambda )\) by \(({\hat{\psi }},{\hat{\lambda }})=(B\log {\hat{p}},A{\hat{p}})\). Then we have

$$\begin{aligned} N^{1/2}\left( {\begin{array}{c}{\hat{\psi }}-\psi \\ {\hat{\lambda }}-\lambda \end{array}}\right) \ \mathop {\rightarrow }^\mathrm{d}\ \mathrm {N}_n\left( \left( {\begin{array}{c}0\\ 0\end{array}}\right) ,\begin{pmatrix}BD_p^{-1}B^\top &{}0\\ 0&{}AD_pA^\top \end{pmatrix}\right) \end{aligned}$$

as \(N\rightarrow \infty \).

Proof

This is a consequence of the classical central limit theorem. Indeed, from the reproducibility of the Poisson distribution, \(N^{1/2}({\hat{p}}-p)\) weakly converges to \(\mathrm {N}_n(0,D_p)\). Apply the delta method to the function \(\phi (x)=(B\log x,Ax)\) (see, e.g., Chapter 3 of [36]) and use \(AB^\top =0\). \(\square \)

Since the asymptotic covariance of the maximum likelihood estimator is the inverse of the Fisher information matrix, we obtain from Lemma 6 that

$$\begin{aligned} {\bar{G}}^x = G_{N=1}^x = \begin{pmatrix} (BD_p^{-1}B^\top )^{-1}&{} 0\\ 0&{} (AD_pA^\top )^{-1} \end{pmatrix}. \end{aligned}$$

To prove the asymptotic ancillarity of \(t^{(N)}=Ax^{(N)}\), we check the two conditions \({\bar{G}}_{\psi \psi }^x={\bar{G}}_{\psi \psi }^{x\mid t}\) and \({\bar{G}}_{\lambda \lambda }^x={\bar{G}}_{\lambda \lambda }^t\). The second condition immediately follows from the fact that \(t^{(N)}\) is a sufficient statistic for \(\lambda \). The first condition is also expected since \({\hat{\psi }}\) is asymptotically independent of \(t^{(N)}=N{\hat{\lambda }}\) by Lemma 6. In fact, the following result holds.

Theorem 3

[18] Let A be any configuration matrix and fix \(p\in {\mathbb {R}}_+^n\). Suppose that \(x^{(N)}\) has the independent Poisson distribution with the mean vector Np. Then the statistic \(Ax^{(N)}\) is asymptotically ancillary for \(\psi =B\log p\).

The proof of Theorem 3 is complicated due to the discrete nature of the conditioning variable. In Appendix B, we provide an outline of the proof with the help of Theorem 1.1 of [1] on the conditional central limit theorem; see also Theorem 4 of [2] for a refined result.