Conditional inference of Poisson models and information geometry: an ancillary review

Sei, Tomonari

doi:10.1007/s41884-022-00082-w

Conditional inference of Poisson models and information geometry: an ancillary review

Survey Paper
Open access
Published: 27 November 2022

Volume 7, pages 131–150, (2024)
Cite this article

Download PDF

You have full access to this open access article

Information Geometry Aims and scope Submit manuscript

Conditional inference of Poisson models and information geometry: an ancillary review

Download PDF

Tomonari Sei ORCID: orcid.org/0000-0003-2548-0314¹

2636 Accesses
5 Altmetric
Explore all metrics

Abstract

The Poisson distribution is a fundamental tool in categorical data analysis. This paper reviews conditional inference for the independent Poisson model. It is noted that the conditioning variable is not an ancillary statistic in the exact sense except in the case of the product multinomial sampling scheme, whereas two versions of the ancillary property hold in general. The ancillary properties justify the use of conditional inference, as first proposed by R. A. Fisher and subsequently discussed by many researchers. The mixed coordinate system developed in information geometry is emphasized as effective for the description of facts.

Computational Information Geometry in Statistics: Mixture Modelling

On some statistical models with a random number of observations

Article 01 December 2016

On the correspondence from Bayesian log-linear modelling to logistic regression modelling with g-priors

Article Open access 18 May 2017

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Conditional distributions of the independent Poisson model play a central role in categorical data analysis. For example, the multinomial distribution is obtained from the Poisson model by the conditioning of the total count. The hypergeometric distribution is also obtained from the Poisson model by the conditioning of the marginal counts. These conditioning variables are written as Ax, where x is the Poisson random vector and A is an integer matrix. The conditional distribution of x given Ax is called the conditional Poisson model [1] or the A-hypergeometric distribution [2]. See Sect. 2 for a more precise definition. Inference on the Poisson mean parameters is usually performed on the basis of the conditional distribution. This type of inference is referred to as conditional inference.

The best-known example is Fisher’s exact test of independence for $2\times 2$ contingency tables, where the p-value of a test statistic is computed under the conditional probability given marginal counts. Computation of the p-value for more complicated hypotheses is one of the central topics in algebraic statistics [3,4,5]. For parameter estimation, the conditional maximum likelihood estimator is defined as the maximizer of the conditional likelihood [6, 7]. A confidence interval of the parameter is constructed from the conditional distribution of the estimator. Exact computation of the estimator is difficult in most cases and has been recently investigated in an algebraic manner [2, 8,9,10].

One might ask, why is conditioning so important?

The main reason is that the conditional likelihood does not depend on the nuisance parameter, as we will see in Sect. 2. Here, the nuisance parameter in the current problem refers to the marginal distribution of Ax. The existence of nuisance is problematic if its dimension is high. This is known as the Neyman–Scott problem [11, 12]. Conditional inference is effective in such cases.

Another reason is that the conditional distribution removes the effect of the data sampling scheme. In $2\times 2$ contingency tables, the data are collected under various sampling schemes: no constraints, given the total, given the row (or column) marginals or given all the marginals. The underlying distribution of x changes from the independent Poisson distribution according to the scheme. However, the conditional distribution given all the marginals is common to all cases.

According to the first reason cited above, the conditional likelihood has no information regarding the nuisance parameter, which leads to a natural question: How much information on the parameter of interest remains in the marginal distribution of Ax? The goal of this paper is to review the answers to this question from the view point of ancillarity.

In general, a statistic is said to be ancillary if its marginal distribution has no information about the parameter of interest. It would be desirable if our conditioning variable Ax were ancillary. Unfortunately, this is not true in general [13], as investigated for $2\times 2$ tables by [14,15,16,17]. We recall this fact in Sect. 4. On the other hand, Ax is shown to be asymptotically ancillary in the sense of Liang [18], where a conditional limit theorem established by [1, 2] is essential. We also note the ancillary criterion proposed by Godambe [19], which focuses on the space of estimating functions. Usefully, the mixed coordinate system developed in information geometry is quite effective for describing these results [13, 20, 21]. This is a particular example of parameter orthogonality [22], under which conditional inference works well in general.

The remainder of the paper is organized as follows. The Poisson model, together with the mixed coordinate system, is introduced in Sect. 2. Section 3 reviews the ancillary properties of general statistical models. The ancillary properties of the Poisson model are summarized in Sect. 4. A self-contained description of the mixed coordinate system and theorem proofs are provided in Appendices A and B, respectively.

2 Conditional inference of Poisson models

2.1 Definition

Let ${\mathbb {N}}$ and ${\mathbb {R}}_+$ be the sets of non-negative integers and positive real numbers, respectively. Consider an independent Poisson model

$$\begin{aligned} f(x;p) = \frac{p^x}{x!}e^{-1_n^\top p}, \quad x=(x_i)_{i=1}^n\in {\mathbb {N}}^n, \end{aligned}$$

with the mean parameter $p=(p_i)_{i=1}^n\in {\mathbb {R}}_+^n$, where the multi-index notation is adopted as $p^x=\prod _i p_i^{x_i}$, $x!=\prod _i x_i!$, and $1_n=(1,\ldots ,1)^\top \in {\mathbb {N}}^n$. This model is an exponential family with the natural parameter $\log p=(\log p_i)\in {\mathbb {R}}^n$ and the expectation parameter $p\in {\mathbb {R}}_+^n$. We can read p as a “point” in the geometric sense. The maximum likelihood estimator of p, which maximizes f(x; p) with respect to p, is ${\hat{p}}=x$ whenever x is positive.

Many statistical models for categorical data have the form (e.g., [8])

$$\begin{aligned} \log p = A^\top \alpha + g(\theta ), \end{aligned}$$

(1)

where $A\in {\mathbb {N}}^{d\times n}$ is a given matrix, $g:{\mathbb {R}}^q\rightarrow {\mathbb {R}}^n$ is a given smooth function, $\alpha \in {\mathbb {R}}^d$ is a nuisance parameter, and $\theta \in {\mathbb {R}}^q$ is a parameter of interest. The model is called a log-affine model if g is affine, and a log-linear model if g is linear. The model is said to be saturated if the map $(\alpha ,\theta )\mapsto p$ is surjective.

Example 1

(Poisson regression) Let $A=(1,\ldots ,1)\in {\mathbb {N}}^{1\times n}$ and $g(\theta )=D^\top \theta $, where $D^\top \in {\mathbb {R}}^{n\times q}$ is a design matrix. Then $\alpha $ is a baseline and $\theta $ represents regression coefficients. This model is not saturated if $n>q+1$.

We next show an “unconventional” use of the log-linear model.

Example 2

(Fisher’s iris data) Table 1 gives the number $x_{ij}$ of cases of Iris setosa that have sepal length $L_i$ and sepal width $W_j$, where the length scales are $\{L_i\}_{i=1}^{100}=\{W_j\}_{j=1}^{100}=\{0.1,0.2,\ldots ,10.0\}$ in centimeters. The contingency table has 100 rows and 100 columns; the total number of cases is $N=50$. The table is very sparse, as only 39 of 10,000 cells have a non-zero entry. Consider the statistical model

$$\begin{aligned} \log p_{ij} = \alpha _i + \beta _j + \theta L_iW_j,\quad 1\le i,j\le 100, \end{aligned}$$

where $\alpha _i,\beta _j$ are nuisance parameters and $\theta \in {\mathbb {R}}$ is the parameter of interest. We set $\beta _1=0$ without loss of generality. Then the dimension of the nuisance parameter is 199. This model is written in the form of (1) with an integer matrix $A\in {\mathbb {N}}^{199\times 10000}$ and $g_{ij}(\theta )=\theta L_iW_j$. We will see that the conditional maximum likelihood estimate of $\theta $ exists; see Example 6. This is an example of the minimum information dependence modeling recently proposed by [23].

Table 1 Sepal length and width of I. setosa in Fisher’s iris data

Full size table

The conditional distribution of x given $t=Ax$ is

$$\begin{aligned} f(x\mid t;p) = \frac{f(x;p)}{\sum _{Ay=t}f(y;p)} = \frac{1}{Z(t;p)}\frac{p^x}{x!}, \end{aligned}$$

(2)

where $Z(t;p) = \sum _{Ay=t} p^y/y!$ is the normalizing constant. Here, by abuse of notation, we use the same symbol f for the joint and conditional distributions.

Definition 1

The conditional distribution (2) is called the the conditional Poisson model or A-hypergeometric distribution. Matrix A is called a configuration. We assume that A is of full row rank (i.e., of rank d) and that the row space of A contains $1_n$ unless otherwise stated.

Lemma 1

(Chapter 1 of [1]) Assume the model (1). Then the conditional distribution (2) does not depend on $\alpha $. In other words, $t=Ax$ is a sufficient statistic for $\alpha $.

Proof

Under the model (1), we have $p^x= e^{\alpha ^\top t + g(\theta )^\top x}$, where $t=Ax$. The factor depending only on t is canceled out in (2). $\square $

The lemma states that the conditional distribution does not depend on the nuisance parameter. This is one reason that the conditional distribution is important, as stated in Sect. 1. We sometimes use $f(x\mid t;\theta )$ rather than $f(x\mid t;p)$.

Example 3

(Continuation) The conditional distribution for the iris data in Example 2 is

$$\begin{aligned} f(x\mid t; \theta ) = \frac{e^{\theta \sum _{i,j} L_iW_jx_{ij}}/x_{ij}!}{\sum _{Ay=t}e^{\theta \sum _{i,j}L_iW_jy_{ij}}/y_{ij}!}, \end{aligned}$$

(3)

where y ranges over the tables that have the same marginals as x. All the nuisance parameters are canceled out.

2.2 Mixed coordinate system

For a given configuration $A\in {\mathbb {N}}^{d\times n}$ with $n>d$, choose a matrix $B\in {\mathbb {Z}}^{(n-d)\times n}$ of rank $n-d$ such that $AB^\top =0$. Matrix B is called the Gale transform of A in the theory of convex polytopes (see [2]). We will use the following coordinate system of p:

$$\begin{aligned} \psi (p) = B\log p \quad \text{ and } \quad \lambda (p) = Ap. \end{aligned}$$

The map $(\psi ,\lambda ):{\mathbb {R}}_+^n\rightarrow {\mathbb {R}}^{n-d}\times {\mathbb {R}}^d$ actually defines a coordinate system, which is called the mixed coordinate system in information geometry (see [13, 20]). It is known that $\psi (p)$ and $\lambda (p)$ are orthogonal with respect to the Fisher information metric. Moreover, the range of $(\psi ,\lambda )$ is written as $\Omega _\psi \times \Omega _\lambda ={\mathbb {R}}^{n-d}\times A{\mathbb {R}}_+^n$. See Appendix A for details regarding these facts. The symbols $\psi $ and $\lambda $ are used in accordance with [16].

Since p and $(\psi (p),\lambda (p))$ have a one-to-one correspondence, we have the following lemma.

Lemma 2

(Chapter 8 of [13]) Consider a log-linear model

$$\begin{aligned} \log p=A^\top \alpha + B^\top (BB^\top )^{-1}\psi ,\quad (\alpha ,\psi )\in {\mathbb {R}}^d\times {\mathbb {R}}^{n-d}. \end{aligned}$$

(4)

This model is saturated, and the nuisance parameter $\alpha $ is one-to-one with $\lambda (p)$ for given $\psi $.

Proof

Since the row spaces of A and B span ${\mathbb {R}}^n$, (4) is saturated. It can be immediately seen that $B\log p=\psi $ since $BA^\top =0$. Then the one-to-one correspondence between $\alpha $ and $\lambda (p)$ for given $\psi =\psi (p)$ follows from the correspondences $(\alpha ,\psi )\leftrightarrow p$ and $p\leftrightarrow (\psi (p),\lambda (p))$. $\square $

From the lemma, $\lambda (p)$ is considered as the nuisance parameter and $\psi (p)$ is the parameter of interest. The A-hypergeometric distribution (2) is then

$$\begin{aligned} f(x\mid t;p) \propto \frac{1}{x!}e^{\psi (p)^\top (BB^\top )^{-1}Bx}, \end{aligned}$$

where the normalizing constant is omitted. The quantity $e^{\psi (p)}\in {\mathbb {R}}_+^{n-d}$ is called the generalized odds ratio in [2].

Example 4

(2 by 2 contingency table) Let $p=(p_{11},p_{12},p_{21},p_{22})$ represent a $2\times 2$ contingency table. In many applications, the log-odds ratio $\psi (p)=\log (p_{11}p_{22}/p_{12}p_{21})$ is the parameter of interest. The marginal distributions $p_{i+}=\sum _j p_{ij}$ and $p_{+j}=\sum _ip_{ij}$ are nuisance. In this case, matrices A and B are

$$\begin{aligned} A = \begin{pmatrix} 1&{} 1&{} 0&{} 0\\ 0&{} 0&{} 1&{} 1\\ 1&{} 0&{} 1&{} 0 \end{pmatrix} \quad \text{ and } \quad B = \begin{pmatrix} 1&-1&-1&1 \end{pmatrix}. \end{aligned}$$

Indeed, $\psi (p)=B\log p=\log (p_{11}p_{22}/p_{12}p_{21})$ and $\lambda (p)=Ap=(p_{1+},p_{2+},p_{+1})$. Note that Ap determines all the marginals because $p_{+2}=p_{1+}+p_{2+}-p_{+1}$ . The A-hypergeometric distribution is

$$\begin{aligned} f(x\mid t;p) \propto \frac{1}{x_{11}!x_{12}!x_{21}!x_{22}!} e^{\psi (p)(x_{11}+x_{22}-x_{12}-x_{21})/4}, \end{aligned}$$

where $x=(x_{11},x_{12},x_{21},x_{22})^\top $. If we write $x_{11}=k,x_{12}=t_1-k,x_{21}=t_3-k,x_{22}=t_2-t_3+k$ with $k\in {\mathbb {N}}$, then we have a more familiar form of the noncentral hypergeometric distribution

$$\begin{aligned} f(k\mid t;\psi ) = \frac{\left( {\begin{array}{c}t_1\\ k\end{array}}\right) \left( {\begin{array}{c}t_2\\ t_3-k\end{array}}\right) e^{k\psi }}{\sum _j\left( {\begin{array}{c}t_1\\ j\end{array}}\right) \left( {\begin{array}{c}t_2\\ t_3-j\end{array}}\right) e^{j\psi }}. \end{aligned}$$

Fisher’s exact test examines the hypothesis $\psi =0$ against $\psi \ne 0$. The p-value is calculated on the basis of the hypergeometric distribution for $\psi =0$.

Table 2 compares symbols used in several papers for ease of reference.

Table 2 Correspondence of symbols

Full size table

2.3 Conditional maximum likelihood estimator

The conditional maximum likelihood estimator ${\hat{\psi }}$ of $\psi (p)=B\log p$ is defined as a maximizer of $f(x\mid t;p)$ with respect to $\psi $ for given x. The estimator is, in general, different from the unconditional maximum likelihood estimator ${\hat{\psi }}_\mathrm{MLE}=B\log x$, as the following example shows. A sufficient condition for the coincidence of the two estimators is provided in Sect. 4.

Example 5

(Continuation) Consider again Example 4. If $t=(3,2,3)^\top $, then possible outcomes of x are

$$\begin{aligned} x(1) = \begin{array}{|c|c|} \hline 3&{} 0\\ \hline 0&{} 2\\ \hline \end{array}, \quad x(2) = \begin{array}{|c|c|} \hline 2&{} 1\\ \hline 1&{} 1\\ \hline \end{array} \quad \text{ and } \quad x(3) = \begin{array}{|c|c|} \hline 1&{} 2\\ \hline 2&{} 0\\ \hline \end{array} \end{aligned}$$

in the form of contingency tables. If the observation is $x=x(2)$, the conditional likelihood is

$$\begin{aligned} f(x(2)\mid t;p)&= \frac{\frac{1}{2!1!1!1!}p_{11}^2p_{12}p_{21}p_{22}}{\frac{1}{3!0!0!2!}p_{11}^3p_{22}^2+\frac{1}{2!1!1!1!}p_{11}^2p_{12}p_{21}p_{22}+\frac{1}{1!2!2!0!}p_{11}p_{12}^2p_{21}^2} \\&= \frac{\frac{1}{2}e^\psi }{\frac{1}{12}e^{2\psi }+\frac{1}{2}e^\psi +\frac{1}{4}}, \end{aligned}$$

which is maximized at ${\hat{\psi }}=(1/2)\log 3$. This value is different from the unconditional maximum likelihood estimate ${\hat{\psi }}_\mathrm{MLE}=B\log x(2)=\log 2$. If the observation is $x=x(1)$ or $x=x(3)$, the conditional and unconditional maximum likelihood estimates do not exist.

The conditional maximum likelihood estimator for unsaturated models is defined as the maximizer of $f(x\mid t;\theta )$ as well. Below is an example.

Example 6

(Continuation) For the iris data in Example 2, the conditional maximum likelihood estimate is approximately ${\hat{\theta }}=12.6$, where the Markov chain Monte Carlo method with a Markov basis is used to solve the likelihood equation

$$\begin{aligned} \partial _\theta \log f(x\mid t;\theta ) = 0 \end{aligned}$$

numerically. See [4, 5] for the details of Markov bases. Indeed, the denominator of (3) has approximately $50!\approx 3\times 10^{64}$ terms, and an exact computation seems impossible. For the same data, we can also fit a Gaussian model $f(L,W)\propto \exp (\alpha _1 L + \alpha _2 L^2+\beta _1 W + \beta _2 W^2+\theta LW)$ to the sepal length L and width W of I. setosa, from which we obtain ${\hat{\theta }}_\mathrm{Gauss}=12.4$. The two estimates are surprisingly close. See [23] for other examples.

One important class of unsaturated models is the hierarchical model for contingency tables. Aoki et al. [25] applied the hierarchical model to an analysis of stratified educational data, which is further analyzed by [8]. For other examples of unsaturated models, we refer to [9] for the Gibbs random partition and [26] for exponential permutation models.

The existence and uniqueness of the conditional maximum likelihood estimator for log-affine models are summarized as follows.

Lemma 3

(e.g. [27]) Let A be a configuration. Consider a log-affine model

$$\begin{aligned} \log p = A^\top \alpha + D^\top \theta + \nu _0, \end{aligned}$$

where $D\in {\mathbb {R}}^{q\times n}$ is a matrix such that the rows of A and D are linearly independent, and $\nu _0\in {\mathbb {R}}^n$ is a fixed vector. Then for a given observation x, the conditional maximum likelihood estimate ${\hat{\theta }}$ exists if and only if Dx lies in the interior of the convex hull of $\{D y\mid Ay=t,\ y\in {\mathbb {N}}^n\}$. If the estimate exists, it is unique.

An extension of the maximum likelihood estimator that admits zero probabilities is discussed by [24, 27, 28]. See also [10] for issues of sufficient statistics caused by zero counts.

Although the conditional maximum likelihood estimator is mathematically characterized, its computation is not easy in most cases [29], as indicated by Example 6. Recently, exact computation of the estimator is investigated in an algebraic framework such as the holonomic gradient method. See [2, 8,9,10] for this direction. We do not discuss it here and focus on statistical properties in the sections that follow.

3 Ancillary statistics

In this section, we review the definitions and properties of ancillary statistics for general statistical models.

3.1 Exact sense

Consider a parametric model $f(x; \psi ,\lambda )$ of probability density functions, where $x\in \Omega _x$ denotes the data and $(\psi ,\lambda )\in \Omega _\psi \times \Omega _\lambda $ is a parameter. The parameter $\psi $ is of interest and $\lambda $ is a nuisance parameter. Throughout this section, we will assume that f is positive everywhere.

Definition 2

(Ancillary statistic; [30]) A statistic $t=t(x)$ is said to be ancillary for $\psi $ (in the exact sense) if the marginal density of t does not depend on $\psi $ and the conditional density of x given t does not depend on $\lambda $; that is,

$$\begin{aligned} f(x;\psi ,\lambda ) = f(x\mid t;\psi )f(t;\lambda ). \end{aligned}$$

(5)

By abuse of notation, we use the same symbol f for the joint, conditional, and marginal densities.

If t is not given a priori, there exists an ambiguity in the choice of t [31]. We do not address this problem.

Example 7

Consider an independent and identically distributed sequence $y_1,y_2,\ldots $ with the density function $f(y_i;\psi )$. We observe the data up to a random time t so that the observed data are $x=(y_1,\ldots ,y_t)$. Suppose that the probability mass function of t is $f(t;\lambda )$. The likelihood function of x is

$$\begin{aligned} f(x;\psi ,\lambda ) = f(t;\lambda )\prod _{i=1}^t f(y_i;\psi ), \end{aligned}$$

which is of the form (5). Hence, t is ancillary for $\psi $.

If t is ancillary for $\psi $, then $f(t;\lambda )$ has no information regarding $\psi $. Thus, it is natural to use the conditional likelihood $f(x\mid t;\psi )$ for inference on $\psi $.

The conditional maximum likelihood estimator ${\hat{\psi }}$ of $\psi $ based on $f(x\mid t;\psi )$ coincides with the unconditional maximum likelihood estimator if t is ancillary. This immediately follows from Eq. (5). A controversial point may be which of the distributions, the conditional or unconditional sampling distribution of ${\hat{\psi }}$, should be used for interval estimation. The common practice is to use the conditional distribution since it does not depend on the nuisance parameter. The conditioning also avoids any unnecessary assumption concerning t for inferences of $\psi $, which is Fisher’s original argument for conditional inference.

In the Bayesian method, any inference is derived from the posterior distribution. Suppose that the prior density is independent: $\pi (\psi ,\lambda )=\pi (\psi )\pi (\lambda )$. Then, under the assumption that $t=t(x)$ is ancillary, the posterior density is decomposed as

$$\begin{aligned} \pi (\psi ,\lambda \mid x)&\propto f(x\mid \psi ,\lambda )\pi (\psi ,\lambda ) \\&\propto \{f(x\mid t;\psi )\pi (\psi )\}\{f(t;\lambda )\pi (\lambda )\}, \end{aligned}$$

which implies $\pi (\psi ,\lambda \mid x)=\pi (\psi \mid x)\pi (\lambda \mid t)$ and $\pi (\psi \mid x)\propto f(x\mid t;\psi )\pi (\psi )$. Hence, the inference on $\psi $ is the same as that based on the conditional model.

3.2 Godambe’s sense

We provide a version of ancillarity introduced by [19]. Regularity conditions are not mentioned explicitly; see [19, 32, 33] for details.

Suppose that t is sufficient for $\lambda $; that is,

$$\begin{aligned} f(x;\psi ,\lambda )=f(x\mid t;\psi )f(t;\psi ,\lambda ). \end{aligned}$$

The Poisson model satisfies this condition (see Lemma 2).

If our interest is to estimate the parameter $\psi $, it is reasonable to consider an estimating equation of the form

$$\begin{aligned} g(x;{\hat{\psi }}) = 0, \end{aligned}$$

(6)

where $g(x,\psi )$ is a vector-valued function referred to as an estimating function. For example, the estimating function providing the conditional maximum likelihood estimator is $g(x;\psi )=\partial _\psi \log f(x\mid t;\psi )$. Note that different estimating functions may define the same estimator.

The estimating function is often assumed to be unbiased. That is,

$$\begin{aligned} \sum _x f(x;\psi ,\lambda )g(x;\psi ) = 0 \end{aligned}$$

for any $(\psi ,\lambda )$, because it implies consistency and asymptotic normality of the estimator in typical problems. Here, the sums need to be replaced with integrals for general sample spaces. The unbiasedness condition is rewritten as

$$\begin{aligned} \sum _t f(t;\psi ,\lambda )\left\{ \sum _{x\mid t}f(x\mid t;\psi )g(x;\psi ) \right\} = 0. \end{aligned}$$

(7)

Definition 3

(Godambe’s ancillarity) Suppose that $t=t(x)$ is a sufficient statistic for $\lambda $. Then t is said to be ancillary in Godambe’s sense (or in a complete sense) if t is a complete sufficient statistic for the marginal model $\{f(t;\psi ,\lambda )\mid \lambda \in \Omega _\lambda \}$ for each fixed $\psi $, where completeness means that a functional equation $\sum _t f(t;\psi ,\lambda )h(t;\psi )=0$ for an integrable function h has only the trivial solution $h=0$.

This definition is not an extension of the exact ancillarity defined in the preceding subsection. Indeed, even if $f(t;\psi ,\lambda )$ does not depend on $\psi $, the statistic t is not ancillary in Godambe’s sense unless t is complete.

Suppose that t is ancillary in Godambe’s sense and that $g(x;\psi )$ is an unbiased estimating function. Then, from (7) and the completeness of t, we can deduce

$$\begin{aligned} \sum _{x\mid t}f(x\mid t;\psi )g(x;\psi ) = 0 \end{aligned}$$

(8)

for any $\psi $, which means $g(x;\psi )$ is also an unbiased estimating function with respect to the conditional density. Therefore, we can reduce the class of estimating functions to those that are unbiased with respect to the conditional model.

The conditional maximum likelihood estimator is optimal in the sense of the following lemma, which is slightly modified from the original theorem of [19] for simplicity. We use $M\succeq N$ for matrices M and N if $M-N$ is positive semi-definite.

Lemma 4

[19] Let t be ancillary in Godambe’s sense. Then, for any unbiased estimating function g, the following inequality holds:

$$\begin{aligned} \left( E[\partial _\psi g^\top ]^{-1}\right) ^\top E[gg^\top ]E[\partial _\psi g^\top ]^{-1} \succeq E[ss^\top ]^{-1}, \end{aligned}$$

where $s=\partial _\psi \log f(x\mid t;\psi )$ and E denotes expectation with respect to $f(x;\psi ,\lambda )$. The equality is attained if $g=s$.

Proof

The proof is similar to that of the Cramér–Rao inequality. By differentiating the unbiasedness condition (8) and taking expectation with respect to t, we obtain $E[sg^\top + \partial _\psi g^\top ] = 0$. The Cauchy–Schwarz inequality yields $(a^\top E[\partial _\psi g^\top ]b)^2\le (a^\top E[ss^\top ]a)(b^\top E[gg^\top ]b)$ for any vectors a and b. Put $a=E[ss^\top ]^{-1}v$ and $b=E[\partial _\psi g^\top ]^{-1}v$ to obtain the desired inequality. If $g=s$, then $E[ss^\top ]=-E[\partial _\psi s^\top ]$ and the equality follows. $\square $

Example 8

Consider a normal model $x_i \sim \mathrm {N}(\lambda ,\psi ^2)$, $1\le i\le n$, where $\lambda $ is a nuisance and $\psi $ is of interest. The mean statistic ${\bar{x}}=\sum _{i=1}^n x_i/n$ is sufficient for $\lambda $, but not ancillary for $\psi $ in the exact sense because the marginal distribution $\mathrm {N}(\lambda ,\psi ^2/n)$ depends on $\psi $. However, ${\bar{x}}$ is ancillary in Godambe’s sense. Indeed, the completeness of ${\bar{x}}$ on $\lambda $ follows from a general fact regarding the exponential families (see Theorem 4.3.1 of [34]). The conditional maximum likelihood estimator is shown to be the sample variance ${\hat{\psi }}^2=(n-1)^{-1}\sum _i(x_i-{\bar{x}})^2$.

3.3 Asymptotic sense

We define asymptotic ancillarity according to [18]. Regularity conditions are not mentioned explicitly; see [18] for details.

For any statistical model $f(x;\theta )$ and statistic $t=t(x)$, the Fisher information metric tensor and its conditional counterpart are defined by

$$\begin{aligned} G^x(\theta )&= \sum _x f(x;\theta )\{\partial _\theta \log f(x;\theta )\}\{\partial _\theta \log f(x;\theta )\}^\top \quad \text{ and } \\ G^{x\mid t}(\theta )&= \sum _t f(t;\theta )\sum _{x\mid t}f(x\mid t;\theta )\{\partial _\theta \log f(x\mid t;\theta )\}\{\partial _\theta \log f(x\mid t;\theta )\}^\top , \end{aligned}$$

respectively. The sums need to be replaced with integrals for general sample spaces. The Fisher information metric quantifies relative changes in the probability density against changes in the parameter. The decomposition

$$\begin{aligned} G^x(\theta ) = G^{x\mid t}(\theta ) + G^t(\theta ) \end{aligned}$$

holds in general. The following lemma characterizes the ancillary property in terms of the Fisher information.

Lemma 5

(Chapter 7 of [35]) A statistic t is ancillary for $\psi $ (in the exact sense) if and only if

$$\begin{aligned} G^x(\psi ,\lambda ) = \begin{pmatrix} G_{\psi \psi }^{x\mid t}(\psi ,\lambda )&{} 0\\ 0&{} G_{\lambda \lambda }^t(\psi ,\lambda ) \end{pmatrix}, \end{aligned}$$

(9)

where $G_{\psi \psi }^{x\mid t}$ denotes the $\psi $-components of the matrix $G^{x\mid t}$ and so on.

Proof

The only if part is straightforward. Conversely, assume (9). Then $G_{\psi \psi }^t=G_{\psi \psi }^x-G_{\psi \psi }^{x\mid t}=0$ and $G_{\lambda \lambda }^{x\mid t}=G_{\lambda \lambda }^x-G_{\lambda \lambda }^t=0$. From the definition of Fisher information, we deduce that $\partial _\psi \log f(t)=0$ and $\partial _\lambda \log f(x\mid t)=0$. Therefore, t is ancillary for $\psi $. $\square $

Let us now consider a sequence of statistical models $f_N(x;\psi ,\lambda )$ indexed by N. Let $G_N^x$ and $G_N^{x\mid t}$ be the Fisher information with respect to $f_N$. Define the asymptotic Fisher information and its conditional counterpart by

$$\begin{aligned} {\bar{G}}^x(\theta ) = \lim _{N\rightarrow \infty }\frac{1}{N}G_N^x(\theta ) \quad \text{ and }\quad {\bar{G}}^{x\mid t}(\theta ) = \lim _{N\rightarrow \infty }\frac{1}{N}G_N^{x\mid t}(\theta ), \end{aligned}$$

whenever they exist. We assume ${\bar{G}}^x(\theta )$ is positive definite to avoid uninteresting cases.

Definition 4

[18] A statistic $t=t(x)$ is said to be asymptotically ancillary for $\psi $ if the following relation holds:

$$\begin{aligned} {\bar{G}}^x(\psi ,\lambda ) = \begin{pmatrix} {\bar{G}}_{\psi \psi }^{x\mid t}(\psi ,\lambda )&{} 0\\ 0&{} {\bar{G}}_{\lambda \lambda }^t(\psi ,\lambda ) \end{pmatrix}. \end{aligned}$$

(10)

Example 9

(Continuation of Example 8) Let $x_i\sim \mathrm {N}(\lambda ,\psi ^2)$ for $1\le i\le N$. The Fisher information matrices of $(\psi ,\lambda )$ on $x=(x_1,\ldots ,x_N)$ and $t={\bar{x}}$ are

$$\begin{aligned} G^x = \begin{pmatrix} 2N/\psi ^2&{} 0\\ 0&{} N/\psi ^2 \end{pmatrix} \quad \text{ and }\quad G^t = \begin{pmatrix} 2/\psi ^2&{} 0\\ 0&{} N/\psi ^2 \end{pmatrix}, \end{aligned}$$

respectively. The asymptotic Fisher information matrices are

$$\begin{aligned} {\bar{G}}^x = \begin{pmatrix} 2/\psi ^2&{} 0\\ 0&{} 1/\psi ^2 \end{pmatrix} \quad \text{ and }\quad {\bar{G}}^t = \begin{pmatrix} 0&{} 0\\ 0&{} 1/\psi ^2 \end{pmatrix}. \end{aligned}$$

Thus, t is asymptotically ancillary because of the identity ${\bar{G}}^x={\bar{G}}^{x\mid t}+{\bar{G}}^t$.

4 Ancillary statistics of the Poisson model

Consider again the independent Poisson model with a configuration matrix A and its Gale transform B. The parameter of interest is $\psi (p)=B\log p$ and the nuisance parameter is $\lambda (p)=Ap$.

4.1 Conditions for exact ancillarity

We first give an example of ancillary statistics of the independent Poisson model.

Example 10

(Product binomial sampling) Consider a $2\times 2$ contingency table $x=(x_{11},x_{12},x_{21},x_{22})^\top $ and the configuration

$$\begin{aligned} A= \begin{pmatrix} 1&{} 1&{} 0&{} 0\\ 0&{} 0&{} 1&{} 1 \end{pmatrix} \quad \text{ and } \quad B=\begin{pmatrix} 1&{} -1&{} 0&{} 0\\ 0&{} 0&{} 1&{} -1 \end{pmatrix}. \end{aligned}$$

Let $t=Ax=(x_{1+},x_{2+})^\top $. Then the marginal and conditional distributions are

$$\begin{aligned} f(t)&= \frac{\lambda _1^{t_1}e^{-\lambda _1}}{t_1!}\frac{\lambda _2^{t_2}e^{-\lambda _2}}{t_2!}\quad \text{ and } \\ f(x\mid t)&= \left( {\begin{array}{c}t_1\\ x_{11}\end{array}}\right) \left( {\begin{array}{c}t_2\\ x_{21}\end{array}}\right) \frac{e^{\psi _1 x_{11}}}{(e^{\psi _1}+1)^{t_1}} \frac{e^{\psi _2 x_{21}}}{(e^{\psi _2}+1)^{t_2}}, \end{aligned}$$

respectively, where $\psi =(\psi _1,\psi _2)=(\log (p_{11}/p_{12}),\log (p_{21}/p_{22}))$ and $\lambda =(\lambda _1,\lambda _2)=(p_{1+},p_{2+})$. Hence, t is ancillary for $\psi $. This conditioning scheme is called the product binomial sampling because it is the product of binomial distributions (see, e.g., [14]).

One may expect that the conditioning variable Ax for any configuration A is ancillary for $\psi =B\log p$. However, this is not true in general, as the following example shows.

Example 11

(Example 2 of [15]) For the contingency table $x=(x_{11},x_{12},x_{21},x_{22})^\top $, let $t=Ax=(x_{1+},x_{2+},x_{+1})^\top $, where

$$\begin{aligned} A = \begin{pmatrix} 1&{} 1&{} 0&{} 0\\ 0&{} 0&{} 1&{} 1\\ 1&{} 0&{} 1&{} 0 \end{pmatrix}. \end{aligned}$$

Let us check that the marginal distribution of t depends on $\psi (p)$ so that t is not ancillary. The marginal distribution of t is

$$\begin{aligned} f(t;p) = \sum _{x: Ax=t}\frac{p_{11}^{x_{11}}p_{12}^{x_{12}}p_{21}^{x_{21}}p_{22}^{x_{22}}}{x_{11}!x_{12}!x_{21}!x_{22}!}e^{-p_{++}}, \end{aligned}$$

where $p_{++}=\sum _i\sum _j p_{ij}$. If $t=(1,0,1)^\top $, then the possible outcome is $x=(1,0,0,0)$ only, and so $f(t;p) = p_{11}e^{-p_{++}}$. This means $p_{11}e^{-p_{++}}$ is identifiable from the marginal distribution. Furthermore, $p_{++}$ is also identifiable since the marginal distribution of $t_1+t_2=x_{++}$ is the Poisson distribution with mean $p_{++}$. Therefore, $p_{11}$ is identifiable. By symmetry, we need all $p_{ij}$ to parameterize f(t; p).

These examples are generalized as follows. A proof is given in Appendix B.

Theorem 1

(Theorem 10.9 of [13]) Let x be distributed according to the independent Poisson model and A be a configuration. Suppose that the row space of A contains the all-one vector. Then Ax is ancillary for $\psi =B\log p$ if and only if there exists an invertible matrix $L\in {\mathbb {R}}^{d\times d}$ such that

$$\begin{aligned} LA = \begin{pmatrix} 1_{n_1}^\top &{}&{}\mathbf {0}\\ &{}\ddots &{}\\ \mathbf {0}&{}&{}1_{n_d}^\top \end{pmatrix}, \end{aligned}$$

(11)

where $1_{n_i}=(1,\ldots ,1)^\top \in {\mathbb {N}}^{n_i}$ and $n_1,\ldots ,n_d\in {\mathbb {N}}$ with $n_1+\cdots +n_d=n$.

Corollary 1

(Chapter 2 of [1]) If A is of the form (11), then the conditional and unconditional maximum likelihood estimators coincide.

The theorem shows that Ax is not ancillary in the exact sense unless A is the product multinomial sampling scheme. In the subsequent subsection, however, we demonstrate that other ancillary properties hold for any A.

4.2 Godambe’s and asymptotic ancillarity

The ancillarity of Ax in Godambe’s sense when A is the $2\times 2$ independence model is shown by Godambe himself in [33]. The following theorem is derived in a similar manner. See Appendix B for an outline of the proof.

Theorem 2

[19] Consider the independent Poisson model together with any configuration matrix A. Then Ax is ancillary in Godambe’s sense for $\psi =B\log p$.

We will proceed to show the asymptotic ancillarity. Suppose that $x^{(N)}$ has the independent Poisson distribution with the mean vector Np, where p is a fixed positive vector. Denote the n-dimensional normal distribution with mean vector $\mu $ and covariance matrix $\Sigma $ by $\mathrm {N}_n(\mu ,\Sigma )$. Convergence in the distribution is denoted as $\displaystyle \mathop {\rightarrow }^\mathrm{d}$. Let $D_p$ be the diagonal matrix with diagonal entries $p=(p_i)_{i=1}^n$. We begin with a standard result.

Lemma 6

(e.g. [22]) Let ${\hat{p}}=x^{(N)}/N$. Denote the unconditional maximum likelihood estimator of $(\psi ,\lambda )$ by $({\hat{\psi }},{\hat{\lambda }})=(B\log {\hat{p}},A{\hat{p}})$. Then we have

$$\begin{aligned} N^{1/2}\left( {\begin{array}{c}{\hat{\psi }}-\psi \\ {\hat{\lambda }}-\lambda \end{array}}\right) \ \mathop {\rightarrow }^\mathrm{d}\ \mathrm {N}_n\left( \left( {\begin{array}{c}0\\ 0\end{array}}\right) ,\begin{pmatrix}BD_p^{-1}B^\top &{}0\\ 0&{}AD_pA^\top \end{pmatrix}\right) \end{aligned}$$

as $N\rightarrow \infty $.

Proof

This is a consequence of the classical central limit theorem. Indeed, from the reproducibility of the Poisson distribution, $N^{1/2}({\hat{p}}-p)$ weakly converges to $\mathrm {N}_n(0,D_p)$. Apply the delta method to the function $\phi (x)=(B\log x,Ax)$ (see, e.g., Chapter 3 of [36]) and use $AB^\top =0$. $\square $

Since the asymptotic covariance of the maximum likelihood estimator is the inverse of the Fisher information matrix, we obtain from Lemma 6 that

$$\begin{aligned} {\bar{G}}^x = G_{N=1}^x = \begin{pmatrix} (BD_p^{-1}B^\top )^{-1}&{} 0\\ 0&{} (AD_pA^\top )^{-1} \end{pmatrix}. \end{aligned}$$

To prove the asymptotic ancillarity of $t^{(N)}=Ax^{(N)}$, we check the two conditions ${\bar{G}}_{\psi \psi }^x={\bar{G}}_{\psi \psi }^{x\mid t}$ and ${\bar{G}}_{\lambda \lambda }^x={\bar{G}}_{\lambda \lambda }^t$. The second condition immediately follows from the fact that $t^{(N)}$ is a sufficient statistic for $\lambda $. The first condition is also expected since ${\hat{\psi }}$ is asymptotically independent of $t^{(N)}=N{\hat{\lambda }}$ by Lemma 6. In fact, the following result holds.

Theorem 3

[18] Let A be any configuration matrix and fix $p\in {\mathbb {R}}_+^n$. Suppose that $x^{(N)}$ has the independent Poisson distribution with the mean vector Np. Then the statistic $Ax^{(N)}$ is asymptotically ancillary for $\psi =B\log p$.

The proof of Theorem 3 is complicated due to the discrete nature of the conditioning variable. In Appendix B, we provide an outline of the proof with the help of Theorem 1.1 of [1] on the conditional central limit theorem; see also Theorem 4 of [2] for a refined result.

References

Haberman, S.J.: The Analysis of Frequency Data: Statistical Research Monographs. University of Chicago Press, Chicago (1974)
Google Scholar
Takayama, N., Kuriki, S., Takemura, A.: $A$-Hypergeometric distributions and Newton polytopes. Adv. Appl. Math. 99, 109–133 (2018)
Article MathSciNet Google Scholar
Diaconis, P., Sturmfels, B.: Algebraic algorithms for sampling from conditional distributions. Ann. Stat. 26(1), 363–397 (1998)
Article MathSciNet Google Scholar
Hibi, T.: Gröbner Bases: Statistics and Software Systems. Springer, Tokyo (2013)
Book Google Scholar
Aoki, S., Hara, H., Takemura, A.: Markov Bases in Algebraic Statistics. Springer, New York (2012)
Book Google Scholar
Harkness, W.: Properties of the extended hypergeometric distribution. Ann. Math. Stat. 36(3), 938–945 (1965)
Article MathSciNet Google Scholar
Plackett: The Analysis of Categorical Data, 2nd edn. Griffin, London (1981)
Ogawa, M.: Algebraic statistical methods for conditional inference of discrete statistical models. Ph.D. thesis, The University of Tokyo (2014)
Mano, S.: Partition structure and the $A$-hypergeometric distribution associated with the rational normal curve. Electron. J. Stat. 11, 4452–4487 (2017)
Article MathSciNet Google Scholar
Tachibana, Y., Goto, Y., Koyama, T., Takayama, N.: Holonomic gradient method for two-way contingency tables. Algebraic Stat. 11(2), 125–153 (2020)
Article MathSciNet Google Scholar
Neyman, J., Scott, E.L.: Consistent estimates based on partially consistent observations. Econometrica 16, 1–32 (1948)
Article MathSciNet Google Scholar
Amari, S.: Information Geometry and its Applications. Springer, Tokyo (2016)
Book Google Scholar
Barndorff-Nielsen, O.: Information and Exponential Families: in Statistical Theory. Wiley, New York (1978)
Google Scholar
Little, R.J.A.: Testing the equality of two independent binomial proportions. Am. Stat. 43(4), 283–288 (1989)
Google Scholar
Zhu, Y., Reid, N.: Information, ancillarity, and sufficiency in the presence of nuisance parameters. Can. J. Stat. 22(1), 111–123 (1994)
Article MathSciNet Google Scholar
Reid, N.: The roles of conditioning in inference. Stat. Sci. 10(2), 138–157 (1995)
Article MathSciNet Google Scholar
Choi, L., Blume, J.D., Dupont, W.D.: Elucidating the foundations of statistical inference with 2 x 2 tables. PLoS One 10(4), 0121263 (2015)
Article Google Scholar
Liang, K.-Y.: The asymptotic efficiency of conditional likelihood methods. Biometrika 71(2), 305–313 (1984)
Article MathSciNet Google Scholar
Godambe, V.P.: On ancillarity and Fisher information in the presence of a nuisance parameter. Biometrika 71(3), 626–629 (1984)
MathSciNet Google Scholar
Amari, S., Nagaoka, H.: Methods of Information Geometry. American Mathematical Society, Providence (2000)
Google Scholar
Amari, S.: Information geometry on hierarchy of probability distributions. IEEE Trans. Inf. Theory 47(5), 1701–1711 (2001)
Article MathSciNet Google Scholar
Cox, D.R., Reid, N.: Parameter orthogonality and approximate conditional inference. J. R. Stat. Soc. Ser. B (Methodol.) 49(1), 1–18 (1987)
MathSciNet Google Scholar
Sei, T., Yano, K.: Minimum information dependence modeling. (2022). arXiv:2206.06792
Fienberg, S.E., Rinaldo, A.: Maximum likelihood estimation in log-linear models. Ann. Stat. 40(2), 996–1023 (2012)
Article MathSciNet Google Scholar
Aoki, S., Otsu, T., Takemura, A., Numata, Y.: Statistical analysis of subject selection data in NCUEE examination. Ouyou Toukeigaku 39, 71–100 (2010)
Article Google Scholar
Mukherjee, S.: Estimation in exponential families on permutations. Ann. Stat. 44(2), 853–875 (2016)
Article MathSciNet Google Scholar
Rinaldo, A., Fienberg, S.E., Zhou, Y.: On the geometry of discrete exponential families with application to exponential random graph models. Electron. J. Stat. 3, 446–484 (2009)
Article MathSciNet Google Scholar
Csiszár, I., Matúš, F.: Generalized maximum likelihood estimates for exponential families. Probab. Theory Relat. Fields 141(1), 213–246 (2008)
Article MathSciNet Google Scholar
Agresti, A.: A survey of exact inference for contingency tables. Stat. Sci. 7(1), 131–153 (1992)
MathSciNet Google Scholar
Cox, D.R., Hinkley, D.V.: Theoretical Statistics. Chapman and Hall/CRC, Boca Raton (1974)
Book Google Scholar
Basu, D.: Recovery of ancillary information. Sankhyā Indian J. Stat. Ser. A 26(1), 3–16 (1964)
MathSciNet Google Scholar
Godambe, V.P.: Conditional likelihood and unconditional optimum estimating equations. Biometrika 63(2), 277–284 (1976)
Article MathSciNet Google Scholar
Godambe, V.P.: On sufficiency and ancillarity in the presence of a nuisance parameter. Biometrika 67(1), 155–162 (1980)
Article MathSciNet Google Scholar
Lehmann, E.L., Romano, J.P.: Testing Statistical Hypotheses. Springer, New York (2005)
Google Scholar
Amari, S.: Differential-Geometrical Methods in Statistics. Springer, Berlin (1985)
Book Google Scholar
van der Vaart, A.W.: Asymptotic Statistics. Cambridge University Press, Cambridge (1998)
Book Google Scholar
Csiszár, I.: I-Divergence geometry of probability distributions and minimization problems. Ann. Probab. 3(1), 146–158 (1975)
Article MathSciNet Google Scholar

Download references

Acknowledgements

The author is grateful to the co-editor and two anonymous referees for their careful reading and insightful suggestions. He also thanks Mitsunori Ogawa for providing his PhD thesis and helpful comments, and Keisuke Yano for fruitful discussions. This work was supported by JSPS KAKENHI Grant numbers JP21K11781 and JP19K11865, and by JST CREST Grant number JPMJCR1763, Japan.

Author information

Authors and Affiliations

Graduate School of Information Science and Technology, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8656, Japan
Tomonari Sei

Authors

Tomonari Sei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tomonari Sei.

Ethics declarations

Data availability

The dataset analyzed during the current study is available in the Comprehensive R Archive Network (CRAN), https://cran.r-project.org.

Conflict of interest

The corresponding author states that there is no conflict of interest.

Additional information

Communicated by Shinto Eguchi.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Mixed coordinate system

Denote the independent Poisson distribution by $f(x;p)=(p^x/x!)e^{-1_n^\top p}$ for $x\in {\mathbb {N}}^n$ and $p\in {\mathbb {R}}_+^n$. The Kullback–Leibler divergence from f(x; p) to f(x; q) is

$$\begin{aligned} D(p,q)&= \sum _x f(x;p)\log \frac{f(x;p)}{f(x;q)} \\&= \sum _{i=1}^n \left( p_i\log \frac{p_i}{q_i} - p_i + q_i\right) . \end{aligned}$$

Let A and B be the matrices introduced in Sect. 2. Define two sets

$$\begin{aligned} {\mathcal {E}}(p)=\{q\in {\mathbb {R}}_+^n\mid B\log q=B\log p\} = \{pe^{A^\top \alpha }\mid \alpha \in {\mathbb {R}}^d\} \end{aligned}$$

and

$$\begin{aligned} {\mathcal {M}}(p) = \{q\in {\mathbb {R}}_+^n\mid Aq=Ap\} = \{p+B^\top \beta \mid \beta \in {\mathbb {R}}^{n-d}\}\cap {\mathbb {R}}_+^n. \end{aligned}$$

Lemma 7

(Pythagorean theorem; [37]) Let $p,q,r\in {\mathbb {R}}_+^n$. If $q\in {\mathcal {M}}(p)\cap {\mathcal {E}}(r)$, then

$$\begin{aligned} D(p,r) = D(p,q) + D(q,r). \end{aligned}$$

In particular, ${\mathcal {M}}(p)\cap {\mathcal {E}}(p)=\{p\}$ for each p.

Proof

Let $p=q+B^\top \beta $ and $r=qe^{A^\top \alpha }$. The first assertion follows from

$$\begin{aligned} D(p,r)-D(p,q)-D(q,r) = (p-q)^\top (\log q - \log r) = -\beta ^\top BA^\top \alpha = 0. \end{aligned}$$

Next, choose any ${\tilde{p}}\in {\mathcal {M}}(p)\cap {\mathcal {E}}(p)$. Then we have $D(p,p)=D(p,{\tilde{p}})+D({\tilde{p}},p)$ by the Pythagorean relation, which implies $p={\tilde{p}}$ by positive definiteness of D. $\square $

Since $B\log p$ defines ${\mathcal {E}}(p)$ and Ap defines ${\mathcal {M}}(p)$, the lemma implies that the map $p\mapsto (B\log p,Ap)$ is injective. Therefore, the mixed coordinate system is well defined.

Next, we show the orthogonality of the two manifolds ${\mathcal {E}}(p)$ and ${\mathcal {M}}(p)$ at $p\in {\mathbb {R}}_+^n$ with respect to the Fisher information metric

$$\begin{aligned} g_p(v,w) = v^\top D_p^{-1}w,\quad v,w\in {\mathbb {R}}^n= T_p{\mathbb {R}}_+^n, \end{aligned}$$

where $D_p$ is the diagonal matrix with diagonal vector p and $T_p$ denotes the tangent space. A tangent vector in $T_p{\mathcal {E}}(p)$ is written as $D_pA^\top \alpha $ for some $\alpha \in {\mathbb {R}}^d$ and that in $T_p{\mathcal {M}}(p)$ is written as $B^\top \beta $ for some $\beta \in {\mathbb {R}}^{n-d}$. So we have $g_p(D_pA^\top \alpha ,B^\top \beta ) = \alpha ^\top AB^\top \beta = 0$.

Finally, we prove that the range of $(\psi ,\lambda )$ is of the form $\Omega _\psi \times \Omega _\lambda $ with $\Omega _\psi ={\mathbb {R}}^{n-d}$ and $\Omega _\lambda =A{\mathbb {R}}_+^n=\{Ap\mid p\in {\mathbb {R}}_+^n\}$. It is easy to see that the range of $\psi $ is ${\mathbb {R}}^{n-d}$ and the range of $\lambda $ is $A{\mathbb {R}}_+^n$. Therefore, it is enough to show that ${\mathcal {M}}(p)\cap {\mathcal {E}}(r)\ne \emptyset $ for any pair $p,r\in {\mathbb {R}}_+^n$. Consider the following convex optimization problem:

$$\begin{aligned}&\mathop {\mathrm { Minimize}}\quad D(q,r)\\&\mathop {\mathrm { subject \ to}}\quad q\in {\mathcal {M}}(p) = \{q\in {\mathbb {R}}_+^n\mid Aq=Ap\} \end{aligned}$$

for given $p,r\in {\mathbb {R}}_+^n$. Since D(q, r) diverges as q tends to the boundary of ${\mathcal {M}}(p)$, the optimal solution exists and satisfies the stationary condition

$$\begin{aligned} \partial _q\{D(q,r) -\nu ^\top (Aq-Ap)\} = \log (q/r) - A^\top \nu = 0, \end{aligned}$$

where $\nu $ is the Lagrange multiplier. This implies $q\in {\mathcal {M}}(p)\cap {\mathcal {E}}(r)$.

See [2] for solving the minimization problem by iterative proportional scaling.

Appendix B: Proof of Theorems

Proof of Theorem 1

Suppose that A is written as

$$\begin{aligned} A = \begin{pmatrix} 1_{n_1}^\top &{}&{}\\ &{}\ddots &{}\\ &{}&{}1_{n_d}^\top \end{pmatrix}. \end{aligned}$$

Then the marginal distribution of Ax is the product of the independent Poisson distributions with mean vector Ap. The conditional distribution of x given Ax depends only on $\psi (p)=B\log p$ by Lemma 2. Hence, Ax is ancillary. It is straightforward to see that Ax is ancillary if and only if LAx is ancillary for an invertible matrix L.

Conversely, suppose that the row space of A contains the all-one vector and Ax is ancillary. The marginal distribution of $t=Ax$ is

$$\begin{aligned} f(t) = \sum _{x: Ax=t}\frac{p^x}{x!}e^{-1_n^\top p}. \end{aligned}$$

Since the row space of A contains $1_n$, $1_n^\top p$ is identifiable from the marginal distribution of $1_n^\top x$. Next, we prove that $\{x\in {\mathbb {N}}^n\mid Ax=Ae_i\}=\{e_j\mid Ae_j=Ae_i\}$ for each i, where $e_i$ denotes the i-th unit vector in ${\mathbb {R}}^n$. Indeed, if $x\in {\mathbb {N}}^n$ and $Ax=Ae_i$, then $1_n^\top x=1_n^\top e_i=1$ and therefore $x=e_j$ for some j. Define a partition $\{I_k\}_{k=1}^K$ of $\{1,\ldots ,n\}$ by

$$\begin{aligned} Ae_i = Ae_j\quad \text{ if } \text{ and } \text{ only } \text{ if }\quad \{i,j\}\subset I_k\ \ \text{ for } \text{ some }\ \ k. \end{aligned}$$

We have $f(Ae_i)=\sum _{j\in I_k}p_j e^{-1_n^\top p}$ for $i\in I_k$. Thus, $\sum _{j\in I_k}p_j$ is identifiable from the marginal distribution f(t). Since the rank of A is d, we have $K\ge d$. Suppose that $K>d$. Define a configuration ${\tilde{A}}=({\tilde{a}}_{ki})\in {\mathbb {N}}^{K\times n}$ by ${\tilde{a}}_{ki}=1$ if $i\in I_k$ and 0 otherwise. Note that ${\tilde{A}}p$ is identifiable from the discussion so far. Since the rank of ${\tilde{A}}$ is K, there exist two points p and q such that $Ap=Aq$ and ${\tilde{A}}p\ne {\tilde{A}}q$, which implies $B\log p\ne B\log q$ because the map $p\mapsto (Ap,B\log p)$ is one-to-one. Thus, the distribution of f(t) depends on $\psi (p)=B\log p$ and Ax is not ancillary for $\psi $. This contradicts the assumption. Hence, $K=d$, which means A has the form (11). $\square $

Proof of Theorem 2

We use the parameterization (4) and prove that $t=Ax$ is complete for $\alpha $. The marginal distribution of t is an exponential family

$$\begin{aligned} f(t;\alpha ,\psi ) = e^{\alpha ^\top t-\phi (\alpha ,\psi )}C(t,\psi ) \end{aligned}$$

for fixed $\psi $, where $\phi (\alpha ,\psi )=1_n^\top p$ and $C(t,\psi )=\sum _{Ax=t}e^{\psi ^\top (BB^\top )^{-1}Bx}/x!$. The range of the parameter $\alpha $ is the whole space ${\mathbb {R}}^d$. Then the completeness follows from a general fact on exponential families. See, for example, Theorem 4.3.1 of [34]. $\square $

Proof of Theorem 3

We prove ${\bar{G}}_{\psi \psi }^{x\mid t}=(BD_pB^\top )^{-1}$. First, the conditional score function with respect to $\psi $ is

$$\begin{aligned} s_\psi&= \partial _\psi \log f_N(x^{(N)}\mid t^{(N)};p) \\&= \partial _\psi \{\psi ^\top (BB^\top )^{-1}Bx^{(N)} - \log Z(t^{(N)};e^{B^\top (BB^\top )^{-1}\psi })\} \\&= (BB^\top )^{-1}B(x^{(N)}-E_\psi [x^{(N)}\mid t^{(N)}]). \end{aligned}$$

We show

$$\begin{aligned} s_\psi&= (BD_p^{-1}B^\top )^{-1}BD_p^{-1}(x^{(N)}-E_\psi [x^{(N)}\mid t^{(N)}]). \end{aligned}$$

(B1)

Indeed, by using a decomposition of the identity matrix into two projection matrices

$$\begin{aligned} I = B^\top (BD_p^{-1}B^\top )^{-1}BD_p^{-1} + D_pA^\top (AD_pA^\top )^{-1}A, \end{aligned}$$

we obtain

$$\begin{aligned}&x^{(N)}-E[x^{(N)}\mid t^{(N)}] \\&= \{B^\top (BD_p^{-1}B^\top )^{-1}BD_p^{-1} + D_pA^\top (AD_pA^\top )^{-1}A\}(x^{(N)}-E[x^{(N)}\mid t^{(N)}]) \\&= B^\top (BD_p^{-1}B^\top )^{-1}BD_p^{-1}(x^{(N)} - E[x^{(N)}\mid t^{(N)}]), \end{aligned}$$

from which (B1) follows. Now, we recall Theorem 1.1 of [1], which says

$$\begin{aligned} N^{1/2}BD_p^{-1}({\hat{p}}-E[{\hat{p}}\mid A{\hat{p}}])\mid A{\hat{p}}\ \mathop {\rightarrow }^\mathrm{d}\ \mathrm {N}_{n-d}(0,BD_p^{-1}B^\top ) \end{aligned}$$

as $N\rightarrow \infty $ on the event $\{A{\hat{p}}\rightarrow Ap\}$, where ${\hat{p}}=x^{(N)}/N$. From (B1), it follows that

$$\begin{aligned} N^{-1/2}s_\psi \mid A{\hat{p}}\ \mathop {\rightarrow }^\mathrm{d}\ \mathrm {N}_{n-d}(0,(BD_p^{-1}B^\top )^{-1}) \end{aligned}$$

on the event $\{A{\hat{p}}\rightarrow Ap\}$ and, in particular,

$$\begin{aligned} N^{-1/2}s_\psi \ \mathop {\rightarrow }^\mathrm{d}\ \mathrm {N}_{n-d}(0,(BD_p^{-1}B^\top )^{-1}). \end{aligned}$$

Finally, we use the argument given by Liang [18]. Let v be any vector in ${\mathbb {R}}^{n-d}$. The Portmanteau lemma ([36], Lemma 2.2) implies

$$\begin{aligned} \liminf _{N\rightarrow \infty }N^{-1}E[(v^\top s_\psi )^2] \ge E[(v^\top Z)^2] = v^\top (BD_p^{-1}B^\top )^{-1}v, \end{aligned}$$

where Z is a Gaussian random vector having the covariance matrix $(BD_p^{-1}B^\top )^{-1}$. Conversely, we have

$$\begin{aligned} \limsup _{N\rightarrow \infty }N^{-1}E[(v^\top s_\psi )^2]\le v^\top {\bar{G}}_{\psi \psi }^x v=v^\top (BD_p^{-1}B^\top )^{-1}v \end{aligned}$$

because of the relation $G_{N,\psi \psi }^{x\mid t}\preceq G_{N,\psi \psi }^x=N{\bar{G}}_{\psi \psi }^x$ for finite N. We deduce that ${\bar{G}}_{\psi \psi }^{x\mid t}=\lim N^{-1}E[s_\psi s_\psi ^\top ]$ exists and is equal to ${\bar{G}}_{\psi \psi }^x$. The proof is completed. $\square $

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Sei, T. Conditional inference of Poisson models and information geometry: an ancillary review. Info. Geo. 7 (Suppl 1), 131–150 (2024). https://doi.org/10.1007/s41884-022-00082-w

Download citation

Received: 31 August 2022
Revised: 05 November 2022
Accepted: 15 November 2022
Published: 27 November 2022
Issue Date: January 2024
DOI: https://doi.org/10.1007/s41884-022-00082-w

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Conditional inference of Poisson models and information geometry: an ancillary review

Abstract

Similar content being viewed by others

Computational Information Geometry in Statistics: Mixture Modelling

On some statistical models with a random number of observations

On the correspondence from Bayesian log-linear modelling to logistic regression modelling with g-priors

1 Introduction

2 Conditional inference of Poisson models

2.1 Definition

Example 1

Example 2

Definition 1

Lemma 1

Proof

Example 3

2.2 Mixed coordinate system

Lemma 2

Proof

Example 4

2.3 Conditional maximum likelihood estimator

Example 5

Example 6

Lemma 3

3 Ancillary statistics

3.1 Exact sense

Definition 2

Example 7

3.2 Godambe’s sense

Definition 3

Lemma 4

Proof

Example 8

3.3 Asymptotic sense

Lemma 5

Proof

Definition 4

Example 9

4 Ancillary statistics of the Poisson model

4.1 Conditions for exact ancillarity

Example 10

Example 11

Theorem 1

Corollary 1

4.2 Godambe’s and asymptotic ancillarity

Theorem 2

Lemma 6

Proof

Theorem 3

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Data availability

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A: Mixed coordinate system

Lemma 7

Proof

Appendix B: Proof of Theorems

Proof of Theorem 1

Proof of Theorem 2

Proof of Theorem 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation