1 Introduction

Statistical analysis when the number of unknown parameters is comparable with the number of independent observations may demand modification of maximum-likelihood-based methods [7]. There are comparable difficulties with Bayesian analyses based on high dimensional “flat” priors. For an extreme example from a different perspective, see Stein [19].

Yates [22, 23] has discussed these issues in depth both for factorial experiments and also for variety trials in connection with balanced and partially balanced incomplete block designs. His development, powerful and almost explanation free, hinges, especially for incomplete block designs, on the geometry of least squares and the distinction between error-estimating and effect-estimating subspaces. Qualitatively similar forms of argument implicitly underlie the present paper.

Later discussion of these issues has mostly been either in general terms [6, chapter 2], or has approached them from a more decision-oriented perspective (e.g. [20]). In the present paper we show the considerations involved in the context of parametric analysis of matched pair survival data. Matched pair designs leading to a large number of nuisance parameters have been considered in various contexts, in particular by Cox [8], Anderson [4], Lindsay [15], Kumon and Amari [14] and Kartsonaki and Cox [12]. A key aspect is the way the potentially large number of nuisance parameters are represented. One is by a probability distribution parametrically specified. The second is as a set of unknown constants and the third is as independent and identically distributed random variables with totally unknown distribution. The consequences of the last two are essentially identical; note that the second would be converted into the third by reordering the data at random. By contrast, if appropriate, the stronger assumptions involved in the parametric random effects formulation lead to formally more precise conclusions. We illustrate the considerations involved with a theoretical and empirical analysis of the effect of misspecification. Assessment of model adequacy is also discussed. The results aim both to be directly applicable and to illustrate general principles.

2 Issues of formulation

Consider the comparison of two treatments in a matched pair design. For each of n pairs of individuals, one of the pair is a control and the other receives a treatment, leading to observations of survival times for the ith pair represented by random variables \(C_{i}, T_{i}\). We study analyses based on underlying exponential distributions, that is that the observations are in effect the first point events in individual Poisson processes. Study of the systematic variation between treatment and control is in general complicated by variation between pairs.

There are a number of ways to represent this simple situation. We specify them in terms of the rate parameter of the underlying Poisson processes, that is the reciprocal of the exponential means. The two key components specify the relation between \(C_i\) and \(T_i\) and the form of the inter-pair variation.

For a given pair, the Poisson rate under the treatment may be a constant multiple of that under the control. Alternatively the two rates may have a constant difference. There are other possibilities such as that the two mean survival times differ by a constant. The first two representations at least have a clear underlying interpretation in terms of a potential generating process and we largely concentrate on those.

In the formulation in terms of ratios, the rate parameters of \(C_{i}\) and \(T_{i}\) are written \(\gamma _i/\psi \) and \(\gamma _i \psi \), and in the additive formulation are written \(\rho _i - \Delta \) and \(\rho _i + \Delta \). Thus \(\gamma _i\) and \(\rho _i\) are responsible for the inter-pair variation whereas \(\psi \) and \(\Delta \) are key parameters of interest for understanding the effect of the treatment. There is a clear constraint on the parameter space in the second model and the two representations are in a formal sense rather similar to logistic and additive models for binary data.

To represent in general terms arbitrary systematic variation between pairs of individuals we either treat \(\gamma _i\) or \(\rho _i\) as constants, unknown parameters specific to each pair, or as realizations of random variables. The conceptual differences are considerable although the numerical implications are often minor when the sample is large.

An approach sometimes used in observational studies for which there is no natural pairing involves matching individuals based on the combination of a large number of background variables into a one-dimensional propensity score [18]. If background variables are available and not too numerous we favour using them directly for detailed interpretation. By contrast, the present paper focusses on situations in which component variables are not separately observed.

3 Exponential matched pairs with proportional rates

3.1 Nuisance parameters as arbitrary constants

For the representation involving ratios of rates let \(Z_{i}= T_{i}/C_i\), removing dependence on \(\gamma _i\). The density function at z is

$$\begin{aligned} \psi ^2/(1+\psi ^2 z)^2. \end{aligned}$$
(1)

Standard maximum likelihood theory based on the marginal distribution of the \(Z_i\) applies. In particular, the maximum likelihood estimator of \(\psi \) based on (1) is consistent and asymptotically normally distributed with variance given by the inverse of the Fisher information. The Fisher information per observation is

$$\begin{aligned} (2+2+8/3)\psi ^{-2} = (4/3)\psi ^{-2}. \end{aligned}$$
(2)

By eliminating the nuisance parameters in this way by marginalization, some information on the interest parameter is in general lost, because \( (Z_{1},\ldots , Z_{n}) = S\), say, is not sufficient for \(\psi \). Further discussion of these issues is given in Sect. 7.2. A smaller variance is achievable at the expense of stronger modelling assumptions, as demonstrated in Sect. 3.2.

3.2 Nuisance parameters as random variables

Instead of regarding the pair effects as constants we now suppose that they are random variables independently gamma distributed of shape parameter \(\alpha \) and rate \(\beta \). Then the joint density function of \(T_{i}\) and \(C_{i}\) at (tc) is

$$\begin{aligned} \frac{\beta ^{\alpha }}{\Gamma (\alpha )}\int _{0}^{\infty }\gamma ^{\alpha +1}\exp \{-\gamma (\psi t + c/\psi + \beta )\}d\gamma = \frac{\Gamma (\alpha +2)}{\Gamma (\alpha )}\frac{\beta ^{\alpha }}{(\psi t + c/\psi + \beta )^{\alpha +2}}.\nonumber \\ \end{aligned}$$
(3)

The Fisher information matrix per observation can be shown (see Appendix A.2) to be block diagonal with the relevant entry for inference about \(\psi \) equal to

$$\begin{aligned} \frac{2(\alpha +2)}{(\alpha +3)\psi ^2}. \end{aligned}$$
(4)

The two limits of this as \(\alpha \rightarrow \infty \) and \(\alpha \rightarrow 0\) are \(2\psi ^{-2}\) and \((4/3)\psi ^{-2}\), the latter being (2), the Fisher information per observation obtained by treating the nuisance parameters as arbitrary constants. The variance depends on the relative dispersion of the nuisance parameters through \(\alpha \).

See Sect. 7.1 for a formulation in terms of unobserved covariates involving a log normal distribution over the \(\gamma _i\).

Equation (4) shows that the gamma random effects formulation is more efficient than the one in which nuisance parameters are treated as arbitrary constants, provided that the random effects specification is reasonable. The modelling assumption is more severe, but the following analysis of the misspecified situation shows that, provided \(\psi \) is bounded away from zero, the corresponding maximum likelihood estimator \({\hat{\psi }}\) obtained by assuming the gamma random effects model of Sect. 3.2, converges almost surely to \(\psi \) as \(n\rightarrow \infty \). Thus \({\hat{\psi }}\) remains consistent in spite of an arbitrary degree of misspecification in the assumed random effects distribution.

Let \(\gamma _i\) (\(i=1,\ldots ,n\)) be independent random variables with an arbitrary density function \(f(\gamma )\). The associated joint distribution of \(T_i\) and \(C_i\) satisfies (see Appendix A.3)

$$\begin{aligned} E\left\{ \frac{T_{i}}{(T_{i}\psi + C_{i}/\psi + \beta )^{j}}\right\} =\frac{1}{\psi ^2} E\left\{ \frac{C_{i}}{(T_{i}\psi + C_{i}/\psi + \beta )^{j}}\right\} \quad (j=1,2,3,\ldots ).\nonumber \\ \end{aligned}$$
(5)

In view of the expressions for the cross partial derivatives of the log likelihood function (Eq. (28) in Appendix A.2), Eq. (5) establishes orthogonality of \(\psi \) to \(\alpha \) and \(\beta \) whatever the random effects distribution. The interpretation of the notional parameters \(\alpha \) and \(\beta \) under model misspecification is discussed below. The orthogonality justifies consideration of the marginal maximum likelihood estimating equation for \(\psi \), i.e.

$$\begin{aligned} 0 = \frac{1}{n}\sum _{i=1}^{n}\ell _{i,\psi }({\hat{\psi }})=\frac{1}{n}\sum _{i=1}^{n}\frac{C_{i}}{{\hat{\psi }}^2(T_{i}{\hat{\psi }} + C_{i}/{\hat{\psi }} + \beta )} - \frac{1}{n}\sum _{i=1}^{n}\frac{T_{i}}{T_{i}{\hat{\psi }} + C_{i}/{\hat{\psi }} + \beta }. \end{aligned}$$
(6)

For any \(\kappa >0\) bounded away from zero, consider

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^{n}\frac{C_{i}}{\kappa ^2(T_{i}\kappa + C_{i}/\kappa + \beta )} - \frac{1}{n}\sum _{i=1}^{n}\frac{T_{i}}{T_{i}\kappa + C_{i}/\kappa + \beta }. \end{aligned}$$
(7)

Under the random effects formulation, the summands are independent and identically distributed and a law of large numbers implies convergence of the averages to their expectations. The limiting value of the maximum likelihood estimator, as \(n\rightarrow \infty \), is the value of \(\kappa \) that equalizes the two expectations. Appendix A.4 shows that the expectations exist and the value of \(\kappa \) that equalizes them is \(\psi \). Thus \({\hat{\psi }}\) is consistent despite the misspecification.

An analysis of efficiency is harder. Let \(g_{\theta ^*}\) denote the density function of the true joint distribution of \((T_{i},C_{i})\), where \(\theta ^* = (\lambda , \psi )\) and \(\lambda \) could be a finite or infinite dimensional nuisance parameter, but the proportional rates model of Sect. 2 is assumed so that \(\psi \) captures the treatment effect. This joint density is determined by the marginal density function of the random effects distribution \(f(\gamma )\) as

$$\begin{aligned} g_{\theta ^*}(t,c)=\int _{0}^{\infty }\gamma ^2 f(\gamma ) \exp \{-\gamma (t\psi + c/\psi )\} d\gamma . \end{aligned}$$
(8)

Thus if f is not parameterized, \(\lambda \) is f itself. Let \(\Theta \) denote the parameter space for the erroneous gamma random effects model and let \(f_{\theta }(x,y)\) denote the misspecified joint density function of each \((T_i, C_i)\) at (xy), given by Eq. (3). Thus we may define \({\hat{\theta }}=({\hat{\alpha }},{\hat{\beta }},{\hat{\psi }})\) by \(\mathop {\mathrm {argmax}}_{v\in \Theta } \sum _{i=1}^{n}\log f_{v}(T_{i},C_{i})\), which converges almost surely (Appendix A.1) to

$$\begin{aligned} \theta = (\theta _1,\theta _2,\theta _3) = \mathop {\mathrm {argmin}}_{v\in \Theta }\int _{0}^{\infty } \int _{0}^{\infty } \log \frac{g_{\theta ^*}(x,y)}{f_{v}(x,y)}g_{\theta ^*}(x,y)dxdy, \end{aligned}$$
(9)

where, from the previous derivations, \(\theta _3=\psi \), the true treatment effect. Thus \(\alpha =\theta _1\) and \(\beta = \theta _2\) are the values that minimize the Kullback–Leibler divergence between the assumed (erroneous) model and the true model.

By the orthogonality established in (5), a discussion of efficiency requires consideration of the likelihood derivatives only with respect to \(\psi \). In particular, by the established consistency, a mean value expansion and standard arguments, it can be shown that the asymptotic distribution of \({n^{1/2}({\hat{\psi }} - \psi )}\) is Gaussian of zero mean and variance \([E\{\ell _{i,\psi \psi }(\theta )\}]^{-2}E\{\ell _{i,\psi }^{2}(\theta )\}\), leading to a variance of \(R/(R-Q)^{2}\), where R and Q depend in a rather complicated way on the density function \(f(\gamma )\) of the true random effects distribution. Specifically

$$\begin{aligned} R= & {} \frac{1}{\psi ^2}\left\{ \frac{1}{3} - \frac{2\beta }{3}E(\gamma _i) - \frac{\beta ^2}{3}E(\gamma ^2_{i})\right. \nonumber \\&\left. -\beta ^2\int _{0}^{\infty }\gamma ^2 f(\gamma ) e^{\gamma \beta } \text {Ei}(-\gamma \beta )d\gamma -\frac{\beta ^3}{3}\int _{0}^{\infty }\gamma ^3 f(\gamma )e^{\gamma \beta }\text {Ei}(-\gamma \beta )d\gamma \right\} , \nonumber \\ Q= & {} \frac{1}{\psi ^2}\left\{ 1+\beta ^2 \int _{0}^{\infty }\gamma ^2 f(\gamma )e^{\gamma \beta }\text {Ei}(-\gamma \beta )d\gamma - \beta E(\gamma _{i})\right\} . \end{aligned}$$
(10)

Here \(\text {Ei}(x)\) is the exponential integral [17, equation 3.07] thus, in Eq. (10),

$$\begin{aligned} \text {Ei}(-\gamma \beta ) = -\int _{\gamma \beta }^{\infty } z^{-1}e^{-z}dz, \quad \gamma \beta >0, \end{aligned}$$

and because the \(\gamma _i\) are treated as totally random, \(E(\gamma _i^{\kappa })=\int _{0}^{\infty }\gamma ^\kappa f(\gamma )d\gamma \).

In a correctly specified situation, \(\{E(\ell _{i,\psi \psi })\}^{-2}E(\ell _{i,\psi }^{2})\) is the inverse Fisher information. When the random effects are gamma distributed of parameter \(\alpha \) and rate \(\beta \), as assumed, \(Q=2(\alpha +2)^{-1}\psi ^{-2}\) and \(R=2(\alpha +3)^{-1}(\alpha +2)^{-1}\psi ^{-2}\) so that \(R/(R-Q)^2\) is \(2^{-1}(\alpha +2)^{-1}(\alpha +3)\psi ^2\), i.e. the reciprocal of Eq. (4). While formula (10) does not seem amenable to detailed interpretation under misspecification, it serves to illustrate complicated dependence on key aspects of the formulation.

Table 4 of Sect. 6.2 shows that the loss of efficiency in the gamma model for random effects can be severe when the sample size is not large and when the random effects distribution is misspecified. Thus, while the random effects formulation is in principle always feasible for nuisance parameter problems, the adequacy of the choice of random effects distribution, often made on the basis of mathematical convenience, needs consideration. A discussion in the context of the present example is in Sect. 5.

4 Exponential matched pairs with additive rates

When the nuisance parameters \(\rho _i\) of the additive treatment effects model (see Sect. 2) are treated as arbitrary constants, the inference is based on conditioning on the sufficient statistic for the nuisance parameter in each pair [12]. We extend their results slightly by giving explicit expressions for the conditional and unconditional variances of the estimator. The likelihood contribution from the ith pair is

$$\begin{aligned} (\rho _i^2 - \Delta ^{2}) \exp \{-\rho _i (t_{i}+c_{i})\} \exp \{-\Delta (t_{i}-c_i)\}. \end{aligned}$$
(11)

Thus \(T_{i}+C_{i}\) is sufficient for \(\rho _i\) and this leads to inference based on the difference \(T_{i}-C_i\), or equivalently \(T_i\) given the pairwise totals \(T_{i}+C_{i}=S_{i}\), say. The density function of \(S_i\) at s is

$$\begin{aligned} (\rho _{i}^{2}-\Delta ^{2})\{e^{-(\rho _i - \Delta )s} - e^{-(\rho _{i}+\Delta )s}\}/(2\Delta ). \end{aligned}$$
(12)

Some algebra shows that the conditional density function of \(T_{i}\) at t given \(S_{i}=s_i\) is, for \(\Delta > 0\),

$$\begin{aligned} \frac{2\Delta e^{-2\Delta t}}{1-e^{-2\Delta s_i}}. \end{aligned}$$
(13)

Let \({\hat{\Delta }}\) denote the maximum likelihood estimator of \(\Delta \) based on the conditional density function (13). The Fisher information of \({\hat{\Delta }}\), conditional on \(S_{i}=s_i\) is

$$\begin{aligned} \frac{n}{\Delta ^2} - 4\sum _{i=1}^{n}\frac{s_{i}^{2}e^{-2\Delta s_i}}{(1-e^{-2\Delta s_i})^{2}} = \frac{n}{\Delta ^2} - {\sum _{i=1}^{n}} s_{i}^2\sinh ^{-2}(\Delta s_{i}), \end{aligned}$$
(14)

where \(s \sinh ^{-1}(\Delta s) < \Delta ^{-1}\) for all \(s>0\) and \(\lim _{s\rightarrow 0} \{s \sinh ^{-1}(\Delta s)\}=\Delta ^{-1}\) so that the conditional Fisher information is non-negative. For planning, the unconditional Fisher information is relevant. This is used for determining the sample size required to achieve a pre-specified conditional efficiency with high probability, and is obtained by replacing the ith summand by

$$\begin{aligned}&\frac{(\rho _{i}^{2}-\Delta ^{2})}{2\Delta }\int _{0}^{\infty }\frac{s^{2}(e^{\Delta s} - e^{-\Delta s})\exp (-\rho _{i}s)}{\sinh ^{2}(\Delta s)} ds \nonumber \\&\quad = \frac{(\rho _{i}^{2}-\Delta ^{2})}{\Delta }\int _{0}^{\infty }\frac{s^{2}\exp (-\rho _{i}s)}{\sinh (\Delta s)} ds = \frac{(\rho _{i}^{2}-\Delta ^{2})}{4\Delta ^4}\int _{0}^{\infty }\frac{t^{2}e^{-q t}}{1-e^{-t}} dt, \end{aligned}$$
(15)

where \(q=(\rho _{i}+\Delta )/(2\Delta )\) and in the last line we have changed variables to \(t=2\Delta s\). The integral and summation representations of Riemann’s generalized zeta function are [21, p265–66]

$$\begin{aligned} \zeta (z,q)=\frac{1}{\Gamma (z)}\int _0^\infty \frac{t^{z-1}e^{-qt}}{1-e^{-t}}dt = \sum _{m=0}^{\infty }\frac{1}{(q+m)^z}, \end{aligned}$$

and the unconditional Fisher information is, from (15),

$$\begin{aligned} \frac{n}{\Delta ^2}-\frac{1}{2\Delta ^4}\sum _{i=1}^{n}(\rho _{i}^2 - \Delta ^2)\zeta \{3,(\rho _i + \Delta )/(2\Delta )\}. \end{aligned}$$
(16)

Section 6.1 confirms the above calculations by simulation.

Among other possibilities, the pair effects might be assumed to have a gamma distribution of parameter \(\alpha \) and rate \(\beta \) starting at \(\Delta \), leading to a joint density function of \(T_{i}\) and \(C_{i}\) at (tc) given by

$$\begin{aligned} \frac{\alpha \beta ^\alpha \exp (-2\Delta t)}{(t+c+\beta )^{\alpha +1}}\left\{ \frac{\alpha +1}{t+c+\beta } + 2\Delta \right\} . \end{aligned}$$
(17)

Standard maximum likelihood theory applies when the random effects distribution is correctly specified. An analysis of misspecification of this model is complicated by the fact that the parameters \(\alpha \) and \(\beta \) are not orthogonal to \(\Delta \) under arbitrary misspecification. Thus a full theoretical analysis of the kind developed in Sect. 3.2 will not be explored for the maximum likelihood estimator \({\tilde{\Delta }}\) based on Eq. (17). However Table 5 of Sect. 6.2 provides numerical evidence that severe loss of efficiency can result, relative to the version that treats the nuisance parameters as arbitrary constants. Consistency of \({\tilde{\Delta }}\) is also suspect. A referee asked whether there is any mathematically convenient distribution for the nuisance parameters that results in orthogonality of the nuisance parameters to the interest parameter \(\Delta \) in the additive rates model in spite of possible misspecification. In principle, if the true distribution of \(T_i\) and \(C_i\) is known and given in terms of parameters \((\Delta , \alpha , \beta )\), say, a reparameterization to \((\Delta , \lambda , \eta )\), say, can always be found such that \(\lambda \) and \(\eta \) are orthogonal to \(\Delta \). This entails solving the pair of differential equations

$$\begin{aligned} i^*_{\alpha \alpha }\frac{\partial \alpha (\Delta , \lambda , \eta )}{\partial \Delta } + i^*_{\beta \alpha } \frac{\partial \beta (\Delta , \lambda , \eta )}{\partial \Delta }= & {} -i^*_{\Delta \alpha } \\ i^*_{\alpha \beta }\frac{\partial \alpha (\Delta , \lambda , \eta )}{\partial \Delta } + i^*_{\beta \beta } \frac{\partial \beta (\Delta , \lambda , \eta )}{\partial \Delta }= & {} -i^*_{\Delta \beta }, \end{aligned}$$

initially to determine the dependence of \(\alpha \) and \(\beta \) on \(\Delta \) and ultimately choosing \(\lambda \) and \(\eta \) as detailed by Cox and Reid [9]. However, in the above display \(i^{*}_{\alpha \beta }\), \(i^*_{\Delta \beta }\), etc. are the expectations of the second cross partial derivatives of the assumed log likelihood function, taken with respect to the true model. These expressions differ depending on the form of misspecification. An extension of the ideas of Cox and Reid [9] to accommodate arbitrary misspecification is an important question which demands further study, ideally in full generality.

5 Assessment of model adequacy

In the above two models, exact tests of model adequacy are available. Sufficiency represents a separation of the information in the data into that relevant for estimating the parameters of a given model and that relevant for assessing the adequacy of the model [6, p.29]. Suppose that the proportional treatment effect model of Sect. 3 holds. The likelihood contribution from the ith pair is

$$\begin{aligned} \gamma _i^{2} \exp (-\gamma _i c_{i}/\psi ) \exp (-\gamma _{i}\psi t_{i}). \end{aligned}$$
(18)

From this, for any given \(\psi \), \(C_i/\psi + T_i\psi =S_{i}(\psi )\), say, is sufficient for \(\gamma _i\) and has density function

$$\begin{aligned} f_{S_{i}(\psi )}(s) = \gamma _i^{2} s \exp (-\gamma _i s), \end{aligned}$$
(19)

i.e., \(S_{i}(\psi )\) is gamma distributed with shape parameter 2 and rate parameter \(\gamma _i\).

The model and an arbitrarily specified parameter value \(\psi =\psi _0\) are jointly compatible with the data if the realization of \(T_i\), say, is not extreme relative to the conditional density function of \(T_i\) given \(S_{i}(\psi )=s_i(\psi )\), assuming \(\psi =\psi _0\). The conditional density of \(T_i\) at \(t_i\), given \(S_{i}(\psi )=s_i(\psi )\), is

$$\begin{aligned} \frac{\gamma _i^{2} \exp \{-\gamma _i s_i(\psi )\}}{\gamma _i^{2} s_i(\psi ) \exp \{-\gamma _i s_i(\psi )\}} = \frac{1}{s_i(\psi )}, \end{aligned}$$
(20)

showing that \(T_{i}\mid \{S_{i}(\psi )=s_i(\psi )\}\) is uniformly distributed between 0 and \(s_i(\psi )\).

For any hypothesized value \(\psi _{0}\) of \(\psi \), compatibility of the proportional treatment effects model and \(\psi _0\) with the data corresponds to compatibility of the realizations of \(T_{i}/s_{i}(\psi _0)=U_{i}(\psi _{0})\), say, with a uniform distribution on (0,1) for all \(i=1,\ldots ,n\). This is a basis for checking consistency with the proportional model. More specifically, an \(\alpha \)-level confidence set using Fisher’s [11, section 21.1] test is

$$\begin{aligned}&{\mathcal {C}}(\alpha ) \triangleq \left( \psi _0 \in \Psi : \min \left[ F\left\{ -2{\sum _{i=1}^{n}} \log U_{i}(\psi _0)\right\} , 1-F\left\{ -2{\sum _{i=1}^{n}}\log U_{i}(\psi _0)\right\} \right] < \alpha \right) ,\nonumber \\ \end{aligned}$$
(21)

where F is the distribution function of a \(\chi ^{2}\) random variable with 2n degrees of freedom. If the confidence set is non-empty at a specified level, there are at least some values of \(\psi _0\) for which the proportional treatment effects model is compatible with the data at this level.

For sufficiently large sample size, one might treat \({\hat{\psi }}\) as fixed and equal to \(\psi \) under the null hypothesis that the model is true. The adequacy of this assumption can then be assessed by checking the compatibility of the realizations of \(T_i/s_i({\hat{\psi }})\) for \(i=1,\ldots ,n\) with a uniform distribution on (0, 1).

The same ideas allow the adequacy of the a random effects model to be checked. In particular, for any given \(\psi \), the collection of weighted sums \(S_{i}(\psi )\) for \(i=1,\ldots ,n\) is sufficient for the nuisance parameters \(\alpha \) and \(\beta \), as can be seen from Eq. (3). One could condition as above.

For sufficiently many pairs, however, a simpler option is available due to the small number of nuisance parameters in the random effects model. The distribution function at s of \(S_{i}=T_{i}+C_{i}\) under the gamma random effects model is given by

$$\begin{aligned} 1- \frac{\beta ^\alpha }{\psi ^2-1}\left\{ \frac{\psi ^2}{(\beta + s/\psi )^{\alpha }} - \frac{1}{(\beta +s\psi )^{\alpha }} \right\} . \end{aligned}$$
(22)

Since the maximum likelihood estimators \({\tilde{\alpha }}\), \({\tilde{\beta }}\) and \({\tilde{\psi }}\) are consistent and completely specify the model, for sufficiently many individuals it may often be a reasonable approach to consider these as fixed and equal to the true values \(\alpha \), \(\beta \) and \(\psi \) under the null hypothesis that the gamma random effects model is correctly specified. Making this replacement in Eq. (22) and evaluating the distribution function at the points \(S_{i}\) for \(i,\ldots ,n\) leads to approximately standard uniformly distributed points under the null hypothesis, and Fisher’s [11, section 21.1] test is applicable.

Similar arguments apply to the additive effects model. Section 4 shows that \(S_i =T_{i}+C_{i}\) is sufficient for the nuisance parameter \(\rho _i\), so that the conditional density of \(T_i\) given \(S_{i}=s_i\) is free of \(\rho _i\) and is given by Eq. (13). In Sect. 4, this justified estimation of the treatment effect \(\Delta \) by maximization of the conditional likelihood based on (13). To assess model adequacy it is necessary to condition on the jointly sufficient statistic for all unknown parameters. Thus, as in the proportional rates formulation, one must fix \(\Delta \) at hypothesized values leading to a joint assessment of the adequacy of the additive effects model at an arbitrary but given value \(\Delta _0\) of the interest parameter. The model and a value \(\Delta _0\) are compatible with the data at a particular level if \(T_1,\ldots , T_n\) are not extreme relative to what would be expected under their joint conditional density assuming \(\Delta =\Delta _0\), i.e.,

$$\begin{aligned} \prod _{i=1}^{n}f_{T_{i}\mid S_{i}=s_i}(z_{i};\Delta _0)=\prod _{i=1}^{n}\frac{2\Delta _0 e^{-2\Delta _0 z_i}}{1-e^{-2\Delta _0 s_i}}. \end{aligned}$$

As in the proportional rates model, For sufficiently large sample size, one might reasonably treat \({\hat{\Delta }}\) as fixed and equal to \(\Delta \) under the null hypothesis that the additive rates model is true and proceed as above using \({\hat{\Delta }}\) in place of \(\Delta _0\) to assess the model.

There are situations where exact tests of model adequacy based on these principles do not seem feasible. One example in the spirit of this work would be an exponential matched pair problem in which \(T_i\) and \(C_i\) have a stable difference in means. In Sect. 7.2, we explain in more general terms how the structure of the inference problem dictates the appropriate strategy.

6 Empirical validation and numerical extensions

6.1 Fixed nuisance parameters

Throughout the following numerical work \(\psi =\Delta =2\). For several different values of n we generate \((\gamma _i)_{i=1}^{n}\) from a gamma distribution of shape \(\alpha =1\) and rate \(\beta =1\), and we define \(\rho _i=\Delta +\gamma _i\) so that \(\rho _i-\Delta >0\). The nuisance parameters \((\gamma _i)_{i=1}^{n}\) and \((\rho _i)_{i=1}^{n}\) are then fixed over Monte Carlo replications.

In each of \(R=1000\) Monte Carlo replications, \(T_{i}^{\text {(PR)}}\) and \(C_{i}^{\text {(PR)}}\) (\(i=1,\ldots , n\)) are generated independently from exponential distributions of rates \(\gamma _i\psi \) and \(\gamma _i/\psi \) respectively, and \(T_{i}^{\text {(AR)}}\) and \(C_{i}^{\text {(AR)}}\) are generated from exponential distributions of rates \(\rho _i+\Delta \) and \(\rho _i-\Delta \). The parameter \(\psi \) in the proportional rates model is estimated by maximum likelihood based on the density function of \(T_{i}^{\text {(PR)}}/C_{i}^{\text {(PR)}}\) of Eq. (1). Let \({\hat{\psi }}_{n}\) denote this estimator.

The sample variance of \({\hat{\psi }}_n\) over the 1000 Monte Carlo replications is reported in the second row of Table 1, with an estimate of its theoretical standard error in the third row. This is based on the \(\chi ^2\) distribution with \(R-1\) degrees of freedom of the sample variance. The theoretical variance of \({\hat{\psi }}_n\) is asymptotically (as \(n\rightarrow \infty \)) the inverse of the Fisher information. Its theoretical value obtained from Eq. (2) is reported below the row of standard errors. The values in the second and the fourth rows agree for large n.

We also report the results from fitting a gamma random effects model to \(T_{i}^{\text {(PR)}},C_{i}^{\text {(PR)}}\) for \(i=1,\ldots ,n\). Let \({\tilde{\psi }}_n\) denote the corresponding maximum likelihood estimator of \(\psi \). This model is misspecified but the efficiency of \({\tilde{\psi }}_n\) is high. However the model is not severely misspecified because the \((\gamma _i)_{i=1}^{n}\) are generated from a gamma distribution before being fixed across Monte Carlo replications. In Sect. 6.2, we consider the effect of more severe misspecification of the random effects distribution.

Table 1 The generating process is the proportional rates model with fixed \((\gamma _i)_{i=1}^{n}\)

The parameter \(\Delta \) from the additive rates model is estimated using maximum likelihood based on the conditional density function of \(T_{i}^{\text {(AR)}}\) given the realization of \(T_{i}^{\text {(AR)}}+C_{i}^{\text {(AR)}}\). This is Eq. (13). Let \({\hat{\Delta }}_n\) denote this maximum likelihood estimator. The Monte Carlo variance of \({\hat{\Delta }}_n\) is reported in the second row of Table 2, with its estimated theoretical standard error in the third row. The unconditional variance based on Eq. (16) is reported in the fourth row together with the Monte Carlo average of the conditional variances based on (14) in the fifth row. The two agree to a close approximation and they also agree with the Monte Carlo sample variances for sufficiently large n.

Table 2 The generating process is the additive rates model with fixed \((\rho _i)_{i=1}^n\)

6.2 Randomly generated nuisance parameters

The simulation studies are the same as in Sect. 6.1 except that \((\gamma _i)_{i=1}^{n}\) and the \((\rho _i)_{i=1}^{n}\) are generated anew in each Monte Carlo replication. Thus the models in which these nuisance parameters are treated as arbitrary constants are misspecified. In particular, dependence between both versions of \(T_{i}\) and \(C_{i}\) is induced by the generating mechanism for \(\gamma _i\) and \(\rho _i\).

Table 3 contains analogous information to the top three rows of Table 1 for the misspecified case. The theoretically true variances have not been calculated and so are not reported. However, the sample variances are very close to the theoretical asymptotic variances that would obtain if the nuisance parameters were arbitrary constants (cf. fourth row of Table 1). We also report the Monte Carlo variance of \({\tilde{\psi }}_n\), now under a correctly specified model, and its theoretical asymptotic variance based on Eq. (4). Comparing the fifth and last rows of Table 3, these agree for sufficiently large n.

Table 3 The generating process is the proportional rates model with gamma distributed (\(\alpha =1\), \(\beta =1\)) random effects \((\gamma _i)_{i=1}^{n}\)

To assess the efficiency of \({\tilde{\psi }}_n\) under fairly extreme misspecification of the random effects distribution, we conduct the same experiment but with the \((\gamma _i)_{i=1}^{n}\) drawn from a log normal distribution with scale parameter \(\tau =10\). For comparison, the Monte Carlo variances of \({\hat{\psi }}_n\) are also reported in Table 4. The conclusion from this analysis is that while \({\hat{\psi }}_{n}\), justified under the assumption that the nuisance parameters are arbitrary constants, has a stable variance when the nuisance parameters are drawn from a rather extreme random effects distribution, the variance of \({\tilde{\psi }}_n\) is appreciably larger when the random effects distribution is misspecified in this way.

Table 4 The generating process is the proportional rates model with log normally distributed (\(\tau =10\)) random effects \((\gamma _i)_{i=1}^{n}\)

We now consider the effect of misspecification of the random effects distribution in the additive rates model by comparing the estimator \({\hat{\Delta }}\) of Sect. 4 to the maximum likelihood estimator \({\tilde{\Delta }}\) obtained by erroneously assuming that the joint density function of \(T_{i}\) and \(C_{i}\) is given by Eq. (17). Rather than being a gamma distribution starting at \(\Delta \), the true distribution of the \(\rho _i\) is a log normal distribution of scale parameter \(\tau =10\) starting at \(\Delta \). Although the theoretical variance of \({\hat{\Delta }}\) has not been calculated under the random effects formulation, the ones based on Eqs. (16) and (14) are reported in the fourth and fifth rows of Table 5. As before, the estimated standard errors in the third and eighth rows are based on a \(\chi ^2\) distribution with \(R-1\) degrees of freedom for the sample variance, where R is the number of Monte Carlo replications.

Table 5 The generating process is the additive rates model with log normally distributed (\(\tau =10\)) random effects \((\rho _i)_{i=1}^{n}\) shifted by \(\Delta \)

6.3 Assessment of model adequacy in the proportional rates model

To illustrate the ideas in Sect. 5 we consider the data generating process corresponding to Table 1 with \(\psi =1\). This is the value of \(\psi \) that equalises the distributions of responses for treated individuals and controls. In each of 1000 Monte Carlo replications we calculate \(T_i/s_i(\psi _0)=U_i(\psi _0)\) for all \(\psi _0\) between zero and three in increments of 0.01 and for \(i=1,\ldots , n\) with the values of n reported in Table 6. We use the composite of these values to produce a confidence set for \(\psi \) as in Eq. (21). Table 6 reports the simulated coverage probabilities of the \(\alpha \)-level confidence sets for \(\alpha \in \{0.01,0.05\}\). While the confidence sets need not be intervals in general, they turned out to be intervals in all our Monte Carlo replications, thus we report the mean lower and upper boundaries of these confidence intervals, averaged over Monte Carlo replications.

Table 6 The generating process is the proportional rates model with fixed \((\gamma _i)_{i=1}^{n}\) and \(\psi =1\)

The interpretation of the numbers in Table 6 is that the proportional rates model with fixed nuisance parameters is compatible with the data at level \(\alpha \) for any value of \(\psi _0\) taking values in \({\mathcal {C}}(\alpha )\) defined by Eq. (21).

7 Discussion and open problems

7.1 A synthesis with earlier literature

The choice of random effects distribution in Sect. 3.2 was primarily one of mathematical convenience. It coincides with typical usage in applications and raises conceptual issues: (i) To what extent is the random effects formulation a plausible representation of the data generating mechanism? (ii) Are there statistical advantages of assuming a parametric random effects model even if the formulation is physically implausible? (iii) Are there statistical advantages of treating nuisance parameters as arbitrary constants when there is a probabilistic generating mechanism for them?

Our analysis has shown the need to be wary of assumptions made for mathematical convenience with no substantive basis. The following example shows how a different distribution for the random effects may be more plausible, leading to the situation considered in Table 4. The comparison to Table 1 shows that the approach in which nuisance parameters are treated as arbitrary constants is noticeably preferable to the approach in which the incorrect parametric random effects distribution is used.

Suppose, in the notation of Sect. 3, that one models the nuisance parameters as \(\gamma _i = \exp (x_{i}^{T}\theta )\), where the \(x_i\) are covariates that one could have, but did not, measure. If individuals are sampled completely at random from a larger population, it is not unreasonable to treat the covariates as realizations of random variables \(X_i\), assumed to be i.i.d. copies of X, a p-dimensional normally distributed random vector of mean zero and covariance matrix \(\Sigma =Q\Lambda Q^{T}\), where Q is a matrix whose columns are the unit-length eigenvectors of \(\Sigma \). To derive the induced distribution over the \(\gamma _i\), write \(W\triangleq \theta ^{T}X = \theta ^{T}Q\Lambda ^{1/2}V\), where V is a standard normally distributed random vector. We have \(W=\Vert \theta ^{T}Q\Lambda ^{1/2}\Vert _{2}\Vert V\Vert _{2}R\), where R is the cosine of the angle between V and \(\Lambda ^{1/2}Q^{T}\theta \), whose density function is given by (Fisher, 1915)

$$\begin{aligned} f_{R}(r)=\frac{\Gamma (p/2)}{\sqrt{\pi }\Gamma \{(p-1)/2\}}(1-r^2)^{(p-3)/2}, \quad -1<r<1, \end{aligned}$$

and \(\Vert V\Vert _{2}^{2}\) is a Chi squared random variable with p degrees of freedom, so that \(D\triangleq \Vert V\Vert _{2}\) has density function

$$\begin{aligned} f_{D}(\delta ) = \frac{\delta ^{p-1}\exp (-\delta ^2/2)}{2^{(p/2)-1}\Gamma (p/2)}, \quad \delta \ge 0. \end{aligned}$$

The characteristic function of W is

$$\begin{aligned} \phi _{W}(t)=E_{R}\{\phi _{D}(\Vert \theta ^{T}Q\Lambda ^{1/2}\Vert _{2}t R)\} = E_{D}\{\phi _{R}(\Vert \theta ^{T}Q\Lambda ^{1/2}\Vert _{2}t D)\}, \end{aligned}$$

where for any random variable Y, \(\phi _{Y}(t)=E_{Y}(e^{itY})\). Let \(s=\Vert \theta ^{T}Q\Lambda ^{1/2}\Vert _{2}t\). Direct calculation gives

$$\begin{aligned} \phi _{W}(t)= & {} K^{-1}\int _{0}^{\infty }\int _{-1}^{1}\exp \{-\delta ^2/2 + is\delta r\}\delta ^{p-1}(1-r^2)^{(p-3)/2} dr d\delta \\\simeq & {} K^{-1} \int _{0}^{\infty }\exp \{-\delta ^2/2\}\delta ^{p-1}\int _{-1}^{1}\exp \{is\delta r -(p/2) r^{2}\} dr d\delta , \quad p\rightarrow \infty \end{aligned}$$

where \(K=\sqrt{\pi }2^{(p/2)-1}\Gamma \{(p-1)/2\}\). Since \(\int _{-1}^{1}\exp \{-(p/2)r\}\sin (s\delta r)dr = 0\),

$$\begin{aligned}&\int _{-1}^{1}\exp \{is\delta r -(p/2) r^2\} dr = \int _{-1}^{1}\exp \{-(p/2) r^2\}\cos (s\delta r) dr \\&\quad = \frac{\exp \{-(s\delta )^2/2p\}\sqrt{2\pi }}{p^{1/2}} - \left( \int _{-\infty }^{-1} + \int _{1}^{\infty }\right) \exp \{-(p/2) r^2\}\cos (s\delta r) dr, \end{aligned}$$

and the remainder terms are ignored for \(p\rightarrow \infty \), leading to

$$\begin{aligned} \phi _{W}(t) \simeq \frac{\{1+(s^2/p)\}^{-p/2}\Gamma (p/2)\sqrt{2} }{\Gamma \{(p-1)/2\} p^{1/2}} , \quad p\rightarrow \infty \end{aligned}$$

Using Stirling’s formula in the form \(\Gamma (k+a)/\Gamma (k)\simeq k^{a}\) for large k,

$$\begin{aligned} \phi _{W}(t) \simeq (1+s^2/p)^{-p/2} \simeq e^{-s^2/2} \quad (p\rightarrow \infty ), \end{aligned}$$

where \(e^{-s^2/2} = e^{-(\Vert \Lambda ^{1/2}Q^{T}\theta \Vert _{2}^{2}/2)t^2}\) is the characteristic function of a centred normal random variable with standard deviation \(\tau \triangleq \Vert \Lambda ^{1/2}Q^{T}\theta \Vert _{2}\). Under this generating mechanism for the covariates, \(\gamma _i\) are thus log-normally distributed, with density function

$$\begin{aligned} (\tau \gamma )^{-1} \phi (\log \gamma /\tau ), \end{aligned}$$
(23)

where \(\phi (\cdot )\) is the standard normal density.

While this formulation is to some extent physically justifiable, the integral (8) does not appear to have an analytic solution when \(f(\gamma )\) is given by (23). This illustrates that random effects models are likely to be driven by mathematical convenience, highlighting the importance of studies of misspecification.

After completing this paper, we were made aware of a related contribution by Lindsay [16]. The work showed that straight maximum likelihood estimation (without preliminary manoeuvres based on the factorizability of the likelihood function) is consistent in a particular class of incidental parameter models. Specifically, those models for which there is a complete sufficient statistic \(S_{i}(\psi )\) for the nuisance parameter \(\lambda _i\), with \(\psi \) treated as fixed. This situation covers the exponential matched pairs problems with multiplicative treatment effect on the rates (Sect. 3.1) and with additive treatment effect on the rates (Sect. 4) but not the exponential matched pairs problem with additive treatment effect on the means. Despite consistency of the maximum likelihood estimator, the standard estimator of the variance of the maximum likelihood estimator is seriously distorted in these settings, the true variance typically being appreciably larger than that based on the supposed inverse Fisher information.

Lindsay considered estimation of the interest parameter by parametric random effects models and showed that the efficiency achievable by the resulting estimator is higher than straight maximum likelihood provided that a reasonable choice of parametric model for the random effects is used, even if this random effects distribution is misspecified. The appropriate conditions are essentially that the parameters of the parametric random effects distribution be orthogonal in the sense of Cox and Reid [9]. Parameter orthogonality arose in our derivations in Sect. 3.2 via Eq. (5) and its derivation in Appendix A.3. Lindsay [16] does not discuss the potential for appreciable loss of efficiency over conditional or marginal likelihood, as opposed to full maximum likelihood, by erroneously assuming a parametric random effects model. This potential loss of efficiency is illustrated by our Eq. (10). The synthesis of Lindsay’s analysis and ours is that, while a random effects formulation can lead to increased precision over straight maximum likelihood even when the random effects distribution is misspecified, provided that the parameters of the random effects distribution are orthogonal to the interest parameters, there is potential for appreciable loss of efficiency over marginal and conditional likelihood when the corresponding factorizations of the likelihood function are available.

7.2 Open problems

Issues connected with an appreciable number of nuisance parameters are likely to arise whenever a relatively complicated model is needed. In principle, analyses similar to those of Sects. 35 could be performed for other distributions. See Cox [8] for a binary responses formulation that parallels the proportional rates model of Sect. 3. Our existing work does not, however, generalize readily and the detailed calculation required for other distributional assumptions is likely to be considerable. Nevertheless, some general principles can be extracted from the previous discussion. Let \(\psi \) be an interest parameter and \(\lambda \) be a nuisance parameter. Either or both may be vectors. One starts from an arbitrary pair of observations (TC), or more generally an arbitrary partition, and makes a bijective transformation \((T,C)\rightarrow (S,R)\) such that one of factorizations (i)–(v) holds, where:

  1. (i)

    \(f_{S,R}(s,r; \psi , \lambda )=f_{R|S}(r|s; \lambda )f_{S}(s;\psi )\),

  2. (ii)

    \(f_{S,R}(s,r; \psi , \lambda )=f_{R|S}(r|s; \psi )f_{S}(s;\lambda )\),

  3. (iii)

    \(f_{S,R}(s,r; \psi , \lambda )=f_{R}(r; \lambda )f_{S}(s;\psi )\),

  4. (iv)

    \(f_{S,R}(s,r; \psi , \lambda )=f_{R|S}(r|s; \lambda , \psi )f_{S}(s;\psi )\),

  5. (v)

    \(f_{S,R}(s,r; \psi , \lambda )=f_{R|S}(r|s; \psi )f_{S}(s;\psi ,\lambda )\).

Factorization (i) requires marginalization with S sufficient for \(\psi \), (ii) requires conditioning on S, which is now the sufficient statistic for \(\lambda \). In (iii) the jointly sufficient statistic is two independent sufficient statistics so that conditioning reduces to marginalization. Marginalization is applicable in (iv), in which R|S is sufficient for \(\lambda \), and conditioning in (v), in which S is sufficient for \(\lambda \), but information on \(\psi \) is lost in either case. The exponential proportional rates model and the exponential additive rates model are examples of factorizations (iv) and (v) respectively.

Our suggestion of Sect. 5 provides a unified approach to assessing the joint compatibility of a model and its parameter values with the data, and is justified in any situation for which one of factorizations (i)–(v) holds exactly. An important open question is the construction of appropriate factorizations, exact or approximate, in greater generality. We conclude by an outline of the considerations involved.

For an arbitrary pair (tc) of jointly sufficient statistics, write the transformation equations as \(s=s(t,c)\), and \(r=r(t,c)\). The transformation is assumed to be bijective so that \(t=t(s,r)\) and \(c=c(s,r)\). For factorizations (i), (iii) or (iv) to be true, we require that \(f_{S}(s;\psi , \lambda ) = f_{S}(s;\psi )\), and similarly for (ii) and (v).

The general form of a solution to \(f_{S}(s;\psi , \lambda ) = f_{S}(s;\psi )\) is to express the unknown density of S in terms of the known joint density of T and C. For instance,

$$\begin{aligned} f_{S}(s;\psi , \lambda ) = \frac{1}{2\pi i} \int _{\tau - i \infty }^{\tau + i\infty } \exp \{z s(t,c)\} T_{\lambda }(z)dz, \end{aligned}$$

where \(\tau \) is anywhere in the strip of convergence of the moment generating function of S and

$$\begin{aligned} T_{\lambda }(z)=\int _{-\infty }^\infty \int _{-\infty }^\infty \exp \{-z s(x,y)\}f_{T,C}(x,y; \psi , \lambda ) dx dy, \quad z\in {\mathbb {C}}. \end{aligned}$$

The only contribution of \(\lambda \) comes from \(T_{\lambda }\), so it is sufficient to choose the function s(tc) to make \(T_{\lambda }\) independent of \(\lambda \), identically in z, \(\psi \) and \(\lambda \). It would be enough that independence be achieved only at points z of singularity, but this is more difficult. There results the following integral equation, to be solved for s(tc), identically in z, \(\psi \), and \(\lambda \):

$$\begin{aligned} \int _{-\infty }^\infty \int _{-\infty }^\infty \exp \{-z s(t,c)\} \left\{ \frac{\partial }{\partial \lambda }f_{T,C}(t,c; \psi , \lambda ) \right\} dt dc= 0. \end{aligned}$$
(24)

In the exponential matched pair problem with proportional rates (Sect. 3), Eq. (24) becomes

$$\begin{aligned} 0 = \int _{0}^{\infty }\int _{0}^{\infty } \exp \{-z s(t,c)\} \left\{ 2\lambda -\lambda ^2(\psi t + c/\psi )\right\} \exp (-\lambda \psi t)\exp (-\lambda c/\psi ) dt dc.\nonumber \\ \end{aligned}$$
(25)

While it is simple to show that \(s(t,c)=t/c\) verifies Eq. (25), recovering the strategy of Sect. 3.1, a general theory relies on a solution to the integral Eq. (24) when s(tc) is not known a priori.

An alternative general formulation to that based on Laplace transforms uses the joint density function of S and R. Specifically, for factorization (i), (iii) or (iv) consider

$$\begin{aligned} f_{S}(s;\psi , \lambda )= & {} \int _{\mathcal {R}} f_{S,R}(s,r;\psi ,\lambda ) dr \\= & {} \int _{\mathcal {R}} f_{T,C}\left\{ t(s,r),c(s,r);\psi , \lambda \right\} |\text {det}\{J_{(T,C)\rightarrow (S,R)}\}| dr, \end{aligned}$$

where \(J_{(T,C)\rightarrow (S,R)}\) is the Jacobian of the transformation \((T,C)\rightarrow (S,R)\). Thus, for the marginal density to be independent of \(\lambda \), we require the solution in t(sr) and c(sr) of the set of partial integro-differential equations:

$$\begin{aligned}&\int _{\mathcal {R}} \left( \frac{\partial t(s,r)}{\partial s}\frac{\partial c(s,r)}{\partial r} - \frac{\partial t(s,r)}{\partial r}\frac{\partial c(s,r)}{\partial s}\right) \frac{\partial }{\partial \lambda } f_{T,C}\left\{ t(s,r),c(s,r);\psi , \lambda \right\} dr = 0, \nonumber \\&\int _{\mathcal {R}} \left( \frac{\partial t(s,r)}{\partial r}\frac{\partial c(s,r)}{\partial s}-\frac{\partial t(s,r)}{\partial s}\frac{\partial c(s,r)}{\partial r}\right) \frac{\partial }{\partial \lambda } f_{T,C}\left\{ t(s,r),c(s,r);\psi , \lambda \right\} dr = 0,\nonumber \\ \end{aligned}$$
(26)

identically in \(\lambda \) and \(\psi \).

In connection with these ideas there are a number of open problems with a differential geometrical bearing:

  1. 1.

    When there are nuisance parameters two approaches are to transform the data and marginalize or condition based on factorizations (i)–(v) above, or to find an interest-respecting orthogonal transformation as in Cox and Reid [9]. It is natural to expect there to be a connection between the two, and for this to be characterizable geometrically.

  2. 2.

    Is there a geometric representation of conditioning to evade nuisance parameters, and if so, how is this different geometrically to conditioning to ensure relevance [1]?

  3. 3.

    Differential geometric treatments of asymptotic inference (e.g. [1,2,3, 5, 13]) hinge on looking locally in the parameter space of fixed number of dimensions as the amount of information becomes so large that interest is focused on a small region. As such it does not seem directly applicable when the dimension of the parameter space is itself very large which is the situation considered in the present paper. Is there an extension of these ideas suitable for the incidental parameter problems of the present paper?

The analysis of Sect. 3.2 also hints at a more general analysis of model misspecification. There are important open questions. For instance: when is inference on an interest parameter relatively unaffected by misspecification of the nuisance part of the model? What type of misspecification is the inference robust to and how does this depend on the structure of the model and the loss function used for estimation? In what sense is the inference robust? For instance consistency may be achievable but efficiency lost.