High dimensional nuisance parameters: an example from parametric survival analysis

Parametric statistical problems involving both large amounts of data and models with many parameters raise issues that are explicitly or implicitly differential geometric. When the number of nuisance parameters is comparable to the sample size, alternative approaches to inference on interest parameters treat the nuisance parameters either as random variables or as arbitrary constants. The two approaches are compared in the context of parametric survival analysis, with emphasis on the effects of misspecification of the random effects distribution. Notably, we derive a detailed expression for the precision of the maximum likelihood estimator of an interest parameter when the assumed random effects model is erroneous, recovering simply derived results based on the Fisher information in the correctly specified situation but otherwise illustrating complex dependence on other aspects. Methods of assessing model adequacy are given. The results are both directly applicable and illustrate general principles of inference when there is a high-dimensional nuisance parameter. Open problems with an information geometrical bearing are outlined.


Introduction
Statistical analysis when the number of unknown parameters is comparable with the number of independent observations may demand modification of maximumlikelihood-based methods [7]. There are comparable difficulties with Bayesian analyses based on high dimensional "flat" priors. For an extreme example from a different perspective, see Stein [19].
Yates [22,23] has discussed these issues in depth both for factorial experiments and also for variety trials in connection with balanced and partially balanced incomplete block designs. His development, powerful and almost explanation free, hinges, especially for incomplete block designs, on the geometry of least squares and the distinction between error-estimating and effect-estimating subspaces. Qualitatively similar forms of argument implicitly underlie the present paper.
Later discussion of these issues has mostly been either in general terms [6, chapter 2], or has approached them from a more decision-oriented perspective (e.g. [20]). In the present paper we show the considerations involved in the context of parametric analysis of matched pair survival data. Matched pair designs leading to a large number of nuisance parameters have been considered in various contexts, in particular by Cox [8], Anderson [4], Lindsay [15], Kumon and Amari [14] and Kartsonaki and Cox [12]. A key aspect is the way the potentially large number of nuisance parameters are represented. One is by a probability distribution parametrically specified. The second is as a set of unknown constants and the third is as independent and identically distributed random variables with totally unknown distribution. The consequences of the last two are essentially identical; note that the second would be converted into the third by reordering the data at random. By contrast, if appropriate, the stronger assumptions involved in the parametric random effects formulation lead to formally more precise conclusions. We illustrate the considerations involved with a theoretical and empirical analysis of the effect of misspecification. Assessment of model adequacy is also discussed. The results aim both to be directly applicable and to illustrate general principles.

Issues of formulation
Consider the comparison of two treatments in a matched pair design. For each of n pairs of individuals, one of the pair is a control and the other receives a treatment, leading to observations of survival times for the ith pair represented by random variables C i , T i . We study analyses based on underlying exponential distributions, that is that the observations are in effect the first point events in individual Poisson processes. Study of the systematic variation between treatment and control is in general complicated by variation between pairs.
There are a number of ways to represent this simple situation. We specify them in terms of the rate parameter of the underlying Poisson processes, that is the reciprocal of the exponential means. The two key components specify the relation between C i and T i and the form of the inter-pair variation.
For a given pair, the Poisson rate under the treatment may be a constant multiple of that under the control. Alternatively the two rates may have a constant difference. There are other possibilities such as that the two mean survival times differ by a constant. The first two representations at least have a clear underlying interpretation in terms of a potential generating process and we largely concentrate on those.
In the formulation in terms of ratios, the rate parameters of C i and T i are written γ i /ψ and γ i ψ, and in the additive formulation are written ρ i − and ρ i + . Thus γ i and ρ i are responsible for the inter-pair variation whereas ψ and are key parameters of interest for understanding the effect of the treatment. There is a clear constraint on the parameter space in the second model and the two representations are in a formal sense rather similar to logistic and additive models for binary data.
To represent in general terms arbitrary systematic variation between pairs of individuals we either treat γ i or ρ i as constants, unknown parameters specific to each pair, or as realizations of random variables. The conceptual differences are considerable although the numerical implications are often minor when the sample is large.
An approach sometimes used in observational studies for which there is no natural pairing involves matching individuals based on the combination of a large number of background variables into a one-dimensional propensity score [18]. If background variables are available and not too numerous we favour using them directly for detailed interpretation. By contrast, the present paper focusses on situations in which component variables are not separately observed.

Nuisance parameters as arbitrary constants
For the representation involving ratios of rates let Z i = T i /C i , removing dependence on γ i . The density function at z is Standard maximum likelihood theory based on the marginal distribution of the Z i applies. In particular, the maximum likelihood estimator of ψ based on (1) is consistent and asymptotically normally distributed with variance given by the inverse of the Fisher information. The Fisher information per observation is By eliminating the nuisance parameters in this way by marginalization, some information on the interest parameter is in general lost, because (Z 1 , . . . , Z n ) = S, say, is not sufficient for ψ. Further discussion of these issues is given in Sect. 7.2. A smaller variance is achievable at the expense of stronger modelling assumptions, as demonstrated in Sect. 3.2.

Nuisance parameters as random variables
Instead of regarding the pair effects as constants we now suppose that they are random variables independently gamma distributed of shape parameter α and rate β. Then the joint density function of T i and C i at (t, c) is The Fisher information matrix per observation can be shown (see Appendix A.2) to be block diagonal with the relevant entry for inference about ψ equal to The two limits of this as α → ∞ and α → 0 are 2ψ −2 and (4/3)ψ −2 , the latter being (2), the Fisher information per observation obtained by treating the nuisance parameters as arbitrary constants. The variance depends on the relative dispersion of the nuisance parameters through α. See Sect. 7.1 for a formulation in terms of unobserved covariates involving a log normal distribution over the γ i .
Equation (4) shows that the gamma random effects formulation is more efficient than the one in which nuisance parameters are treated as arbitrary constants, provided that the random effects specification is reasonable. The modelling assumption is more severe, but the following analysis of the misspecified situation shows that, provided ψ is bounded away from zero, the corresponding maximum likelihood estimatorψ obtained by assuming the gamma random effects model of Sect. 3.2, converges almost surely to ψ as n → ∞. Thusψ remains consistent in spite of an arbitrary degree of misspecification in the assumed random effects distribution.
Let γ i (i = 1, . . . , n) be independent random variables with an arbitrary density function f (γ ). The associated joint distribution of T i and C i satisfies (see Appendix In view of the expressions for the cross partial derivatives of the log likelihood function (Eq. (28) in Appendix A.2), Eq. (5) establishes orthogonality of ψ to α and β whatever the random effects distribution. The interpretation of the notional parameters α and β under model misspecification is discussed below. The orthogonality justifies consideration of the marginal maximum likelihood estimating equation for ψ, i.e.
For any κ > 0 bounded away from zero, consider Under the random effects formulation, the summands are independent and identically distributed and a law of large numbers implies convergence of the averages to their expectations. The limiting value of the maximum likelihood estimator, as n → ∞, is the value of κ that equalizes the two expectations. Appendix A.4 shows that the expectations exist and the value of κ that equalizes them is ψ. Thusψ is consistent despite the misspecification. An analysis of efficiency is harder. Let g θ * denote the density function of the true joint distribution of (T i , C i ), where θ * = (λ, ψ) and λ could be a finite or infinite dimensional nuisance parameter, but the proportional rates model of Sect. 2 is assumed so that ψ captures the treatment effect. This joint density is determined by the marginal density function of the random effects distribution f (γ ) as Thus if f is not parameterized, λ is f itself. Let denote the parameter space for the erroneous gamma random effects model and let f θ (x, y) denote the misspecified joint density function of each (T i , C i ) at (x, y), given by Eq. (3). Thus we may definê where, from the previous derivations, θ 3 = ψ, the true treatment effect. Thus α = θ 1 and β = θ 2 are the values that minimize the Kullback-Leibler divergence between the assumed (erroneous) model and the true model. By the orthogonality established in (5), a discussion of efficiency requires consideration of the likelihood derivatives only with respect to ψ. In particular, by the established consistency, a mean value expansion and standard arguments, it can be shown that the asymptotic distribution of n 1/2 (ψ − ψ) is Gaussian of zero mean and variance [E{ i,ψψ (θ )}] −2 E{ 2 i,ψ (θ )}, leading to a variance of R/(R − Q) 2 , where R and Q depend in a rather complicated way on the density function f (γ ) of the true random effects distribution. Specifically Here Ei(x) is the exponential integral [17, equation 3.07] thus, in Eq. (10), and because the γ i are treated as totally random, In a correctly specified situation, is the inverse Fisher information. When the random effects are gamma distributed of parameter α and rate β, as assumed, Q = 2(α +2) −1 ψ −2 and R = 2(α +3) −1 (α +2) −1 ψ −2 so that R/(R − Q) 2 is 2 −1 (α + 2) −1 (α + 3)ψ 2 , i.e. the reciprocal of Eq. (4). While formula (10) does not seem amenable to detailed interpretation under misspecification, it serves to illustrate complicated dependence on key aspects of the formulation. Table 4 of Sect. 6.2 shows that the loss of efficiency in the gamma model for random effects can be severe when the sample size is not large and when the random effects distribution is misspecified. Thus, while the random effects formulation is in principle always feasible for nuisance parameter problems, the adequacy of the choice of random effects distribution, often made on the basis of mathematical convenience, needs consideration. A discussion in the context of the present example is in Sect. 5.

Exponential matched pairs with additive rates
When the nuisance parameters ρ i of the additive treatment effects model (see Sect. 2) are treated as arbitrary constants, the inference is based on conditioning on the sufficient statistic for the nuisance parameter in each pair [12]. We extend their results slightly by giving explicit expressions for the conditional and unconditional variances of the estimator. The likelihood contribution from the ith pair is Thus T i + C i is sufficient for ρ i and this leads to inference based on the difference T i − C i , or equivalently T i given the pairwise totals T i + C i = S i , say. The density function of S i at s is Some algebra shows that the conditional density function of T i at t given S i = s i is, Letˆ denote the maximum likelihood estimator of based on the conditional density function (13). The Fisher information ofˆ , conditional on S i = s i is where s sinh −1 ( s) < −1 for all s > 0 and lim s→0 {s sinh −1 ( s)} = −1 so that the conditional Fisher information is non-negative. For planning, the unconditional Fisher information is relevant. This is used for determining the sample size required to achieve a pre-specified conditional efficiency with high probability, and is obtained by replacing the ith summand by where q = (ρ i + )/ (2 ) and in the last line we have changed variables to t = 2 s. The integral and summation representations of Riemann's generalized zeta function are [21, p265-66] and the unconditional Fisher information is, from (15), Section 6.1 confirms the above calculations by simulation. Among other possibilities, the pair effects might be assumed to have a gamma distribution of parameter α and rate β starting at , leading to a joint density function of T i and C i at (t, c) given by Standard maximum likelihood theory applies when the random effects distribution is correctly specified. An analysis of misspecification of this model is complicated by the fact that the parameters α and β are not orthogonal to under arbitrary misspecification. Thus a full theoretical analysis of the kind developed in Sect. 3.2 will not be explored for the maximum likelihood estimator˜ based on Eq. (17). However Table 5 of Sect. 6.2 provides numerical evidence that severe loss of efficiency can result, relative to the version that treats the nuisance parameters as arbitrary constants. Consistency of˜ is also suspect. A referee asked whether there is any mathematically convenient distribution for the nuisance parameters that results in orthogonality of the nuisance parameters to the interest parameter in the additive rates model in spite of possible misspecification. In principle, if the true distribution of T i and C i is known and given in terms of parameters ( , α, β), say, a reparameterization to ( , λ, η), say, can always be found such that λ and η are orthogonal to . This entails solving the pair of differential equations initially to determine the dependence of α and β on and ultimately choosing λ and η as detailed by Cox and Reid [9]. However, in the above display i * αβ , i * β , etc. are the expectations of the second cross partial derivatives of the assumed log likelihood function, taken with respect to the true model. These expressions differ depending on the form of misspecification. An extension of the ideas of Cox and Reid [9] to accommodate arbitrary misspecification is an important question which demands further study, ideally in full generality.

Assessment of model adequacy
In the above two models, exact tests of model adequacy are available. Sufficiency represents a separation of the information in the data into that relevant for estimating the parameters of a given model and that relevant for assessing the adequacy of the model [6, p.29]. Suppose that the proportional treatment effect model of Sect. 3 holds. The likelihood contribution from the ith pair is From this, for any given ψ, , say, is sufficient for γ i and has density function i.e., S i (ψ) is gamma distributed with shape parameter 2 and rate parameter γ i . The model and an arbitrarily specified parameter value ψ = ψ 0 are jointly compatible with the data if the realization of T i , say, is not extreme relative to the conditional density function of T i given S i (ψ) = s i (ψ), assuming ψ = ψ 0 . The conditional density of T i at t i , given S i (ψ) = s i (ψ), is showing that For any hypothesized value ψ 0 of ψ, compatibility of the proportional treatment effects model and ψ 0 with the data corresponds to compatibility of the realizations of T i /s i (ψ 0 ) = U i (ψ 0 ), say, with a uniform distribution on (0,1) for all i = 1, . . . , n. This is a basis for checking consistency with the proportional model. More specifically, an α-level confidence set using Fisher's [11, section 21.1] test is where F is the distribution function of a χ 2 random variable with 2n degrees of freedom. If the confidence set is non-empty at a specified level, there are at least some values of ψ 0 for which the proportional treatment effects model is compatible with the data at this level. For sufficiently large sample size, one might treatψ as fixed and equal to ψ under the null hypothesis that the model is true. The adequacy of this assumption can then be assessed by checking the compatibility of the realizations of T i /s i (ψ) for i = 1, . . . , n with a uniform distribution on (0, 1).
The same ideas allow the adequacy of the a random effects model to be checked. In particular, for any given ψ, the collection of weighted sums S i (ψ) for i = 1, . . . , n is sufficient for the nuisance parameters α and β, as can be seen from Eq. (3). One could condition as above.
For sufficiently many pairs, however, a simpler option is available due to the small number of nuisance parameters in the random effects model. The distribution function at s of S i = T i + C i under the gamma random effects model is given by Since the maximum likelihood estimatorsα,β andψ are consistent and completely specify the model, for sufficiently many individuals it may often be a reasonable approach to consider these as fixed and equal to the true values α, β and ψ under the null hypothesis that the gamma random effects model is correctly specified. Making this replacement in Eq. (22) and evaluating the distribution function at the points S i for i, . . . , n leads to approximately standard uniformly distributed points under the null hypothesis, and Fisher's [11, section 21.1] test is applicable. Similar arguments apply to the additive effects model. Section 4 shows that S i = T i + C i is sufficient for the nuisance parameter ρ i , so that the conditional density of T i given S i = s i is free of ρ i and is given by Eq. (13). In Sect. 4, this justified estimation of the treatment effect by maximization of the conditional likelihood based on (13). To assess model adequacy it is necessary to condition on the jointly sufficient statistic for all unknown parameters. Thus, as in the proportional rates formulation, one must fix at hypothesized values leading to a joint assessment of the adequacy of the additive effects model at an arbitrary but given value 0 of the interest parameter. The model and a value 0 are compatible with the data at a particular level if T 1 , . . . , T n are not extreme relative to what would be expected under their joint conditional density assuming = 0 , i.e., As in the proportional rates model, For sufficiently large sample size, one might reasonably treatˆ as fixed and equal to under the null hypothesis that the additive rates model is true and proceed as above usingˆ in place of 0 to assess the model.
There are situations where exact tests of model adequacy based on these principles do not seem feasible. One example in the spirit of this work would be an exponential matched pair problem in which T i and C i have a stable difference in means. In Sect. 7.2, we explain in more general terms how the structure of the inference problem dictates the appropriate strategy.

Fixed nuisance parameters
Throughout the following numerical work ψ = = 2. For several different values of n we generate (γ i ) n i=1 from a gamma distribution of shape α = 1 and rate β = 1, and we define The sample variance ofψ n over the 1000 Monte Carlo replications is reported in the second row of Table 1, with an estimate of its theoretical standard error in the third row. This is based on the χ 2 distribution with R − 1 degrees of freedom of the sample variance. The theoretical variance ofψ n is asymptotically (as n → ∞) the inverse of the Fisher information. Its theoretical value obtained from Eq. (2) is reported below the row of standard errors. The values in the second and the fourth rows agree for large n.
We also report the results from fitting a gamma random effects model to T . . , n. Letψ n denote the corresponding maximum likelihood estimator of ψ. This model is misspecified but the efficiency ofψ n is high. However the model is not severely misspecified because the (γ i ) n i=1 are generated from a gamma distribution before being fixed across Monte Carlo replications. In Sect. 6.2, we consider the effect of more severe misspecification of the random effects distribution.
The parameter from the additive rates model is estimated using maximum likelihood based on the conditional density function of T (AR) i given the realization of T . This is Eq. (13). Letˆ n denote this maximum likelihood estimator. The Monte Carlo variance ofˆ n is reported in the second row of Table 2, with its estimated theoretical standard error in the third row. The unconditional variance based on Eq. (16) is reported in the fourth row together with the Monte Carlo average of the conditional variances based on (14) in the fifth row. The two agree to a close approximation and they also agree with the Monte Carlo sample variances for sufficiently large n.

Table 1
The generating process is the proportional rates model with fixed Simulated variance of the marginal maximum likelihood estimator and its estimated standard error, the associated inverse Fisher information and the simulated variance of the maximum likelihood estimator by erroneously assuming gamma random effects Table 2 The generating process is the additive rates model with fixed

Randomly generated nuisance parameters
The simulation studies are the same as in Sect. 6.1 except that (γ i ) n i=1 and the (ρ i ) n i=1 are generated anew in each Monte Carlo replication. Thus the models in which these nuisance parameters are treated as arbitrary constants are misspecified. In particular, dependence between both versions of T i and C i is induced by the generating mechanism for γ i and ρ i . Table 3 contains analogous information to the top three rows of Table 1 for the misspecified case. The theoretically true variances have not been calculated and so are not reported. However, the sample variances are very close to the theoretical asymptotic variances that would obtain if the nuisance parameters were arbitrary constants (cf. fourth row of Table 1). We also report the Monte Carlo variance ofψ n , now under a correctly specified model, and its theoretical asymptotic variance based on Eq. (4). Comparing the fifth and last rows of Table 3, these agree for sufficiently large n.
To assess the efficiency ofψ n under fairly extreme misspecification of the random effects distribution, we conduct the same experiment but with the (γ i ) n i=1 drawn from a log normal distribution with scale parameter τ = 10. For comparison, the Monte Carlo variances ofψ n are also reported in Table 4. The conclusion from this analysis is that whileψ n , justified under the assumption that the nuisance parameters are arbitrary constants, has a stable variance when the nuisance parameters are drawn from a rather extreme random effects distribution, the variance ofψ n is appreciably larger when the random effects distribution is misspecified in this way.
We now consider the effect of misspecification of the random effects distribution in the additive rates model by comparing the estimatorˆ of Sect. 4 to the maximum likelihood estimator˜ obtained by erroneously assuming that the joint density function of T i and C i is given by Eq. (17). Rather than being a gamma distribution starting at , the true distribution of the ρ i is a log normal distribution of scale parameter τ = 10 starting at . Although the theoretical variance ofˆ has not been calculated under the random effects formulation, the ones based on Eqs. (16) and (14) are reported in the fourth and fifth rows of Table 5. As before, the estimated standard errors in the third and eighth rows are based on a χ 2 distribution with R − 1 degrees of freedom for the sample variance, where R is the number of Monte Carlo replications.

Assessment of model adequacy in the proportional rates model
To illustrate the ideas in Sect. 5 we consider the data generating process corresponding to Table 1 with ψ = 1. This is the value of ψ that equalises the distributions of responses for treated individuals and controls. In each of 1000 Monte Carlo replications we calculate T i /s i (ψ 0 ) = U i (ψ 0 ) for all ψ 0 between zero and three in increments of 0.01 and for i = 1, . . . , n with the values of n reported in Table 6. We use the composite of these values to produce a confidence set for ψ as in Eq. (21). Table 6 reports the simulated coverage probabilities of the α-level confidence sets for α ∈ {0.01, 0.05}. While the confidence sets need not be intervals in general, they turned out to be intervals in all our Monte Carlo replications, thus we report the mean lower and upper boundaries of these confidence intervals, averaged over Monte Carlo replications.

Table 3
The generating process is the proportional rates model with gamma distributed (α Sample size (n)  Table 4 The generating process is the proportional rates model with log normally distributed (τ = 10) random effects  Table 5 The generating process is the additive rates model with log normally distributed (τ = 10) random effects  (14). The seventh row presents the simulated variance of˜ , the maximum likelihood estimator of by erroneously assuming a gamma distribution starting at for the random effects Table 6 The generating process is the proportional rates model with fixed  The interpretation of the numbers in Table 6 is that the proportional rates model with fixed nuisance parameters is compatible with the data at level α for any value of ψ 0 taking values in C(α) defined by Eq. (21).

A synthesis with earlier literature
The choice of random effects distribution in Sect. 3.2 was primarily one of mathematical convenience. It coincides with typical usage in applications and raises conceptual issues: (i) To what extent is the random effects formulation a plausible representation of the data generating mechanism? (ii) Are there statistical advantages of assuming a parametric random effects model even if the formulation is physically implausible? (iii) Are there statistical advantages of treating nuisance parameters as arbitrary constants when there is a probabilistic generating mechanism for them?
Our analysis has shown the need to be wary of assumptions made for mathematical convenience with no substantive basis. The following example shows how a different distribution for the random effects may be more plausible, leading to the situation considered in Table 4. The comparison to Table 1 shows that the approach in which nuisance parameters are treated as arbitrary constants is noticeably preferable to the approach in which the incorrect parametric random effects distribution is used. Suppose, in the notation of Sect. 3, that one models the nuisance parameters as where the x i are covariates that one could have, but did not, measure. If individuals are sampled completely at random from a larger population, it is not unreasonable to treat the covariates as realizations of random variables X i , assumed to be i.i.d. copies of X , a p-dimensional normally distributed random vector of mean zero and covariance matrix = Q Q T , where Q is a matrix whose columns are the unit-length eigenvectors of . To derive the induced distribution over the γ i , write W θ T X = θ T Q 1/2 V , where V is a standard normally distributed random vector. We have W = θ T Q 1/2 2 V 2 R, where R is the cosine of the angle between V and 1/2 Q T θ , whose density function is given by (Fisher, 1915) and V 2 2 is a Chi squared random variable with p degrees of freedom, so that D V 2 has density function The characteristic function of W is and the remainder terms are ignored for p → ∞, leading to where e −s 2 /2 = e −( 1/2 Q T θ 2 2 /2)t 2 is the characteristic function of a centred normal random variable with standard deviation τ 1/2 Q T θ 2 . Under this generating mechanism for the covariates, γ i are thus log-normally distributed, with density function where φ(·) is the standard normal density. While this formulation is to some extent physically justifiable, the integral (8) does not appear to have an analytic solution when f (γ ) is given by (23). This illustrates that random effects models are likely to be driven by mathematical convenience, highlighting the importance of studies of misspecification.
After completing this paper, we were made aware of a related contribution by Lindsay [16]. The work showed that straight maximum likelihood estimation (without preliminary manoeuvres based on the factorizability of the likelihood function) is consistent in a particular class of incidental parameter models. Specifically, those models for which there is a complete sufficient statistic S i (ψ) for the nuisance parameter λ i , with ψ treated as fixed. This situation covers the exponential matched pairs problems with multiplicative treatment effect on the rates (Sect. 3.1) and with additive treatment effect on the rates (Sect. 4) but not the exponential matched pairs problem with additive treatment effect on the means. Despite consistency of the maximum likelihood estimator, the standard estimator of the variance of the maximum likelihood estimator is seriously distorted in these settings, the true variance typically being appreciably larger than that based on the supposed inverse Fisher information.
Lindsay considered estimation of the interest parameter by parametric random effects models and showed that the efficiency achievable by the resulting estimator is higher than straight maximum likelihood provided that a reasonable choice of parametric model for the random effects is used, even if this random effects distribution is misspecified. The appropriate conditions are essentially that the parameters of the parametric random effects distribution be orthogonal in the sense of Cox and Reid [9]. Parameter orthogonality arose in our derivations in Sect. 3.2 via Eq. (5) and its derivation in Appendix A.3. Lindsay [16] does not discuss the potential for appreciable loss of efficiency over conditional or marginal likelihood, as opposed to full maximum likelihood, by erroneously assuming a parametric random effects model. This potential loss of efficiency is illustrated by our Eq. (10). The synthesis of Lindsay's analysis and ours is that, while a random effects formulation can lead to increased precision over straight maximum likelihood even when the random effects distribution is misspecified, provided that the parameters of the random effects distribution are orthogonal to the interest parameters, there is potential for appreciable loss of efficiency over marginal and conditional likelihood when the corresponding factorizations of the likelihood function are available.

Open problems
Issues connected with an appreciable number of nuisance parameters are likely to arise whenever a relatively complicated model is needed. In principle, analyses similar to those of Sects. 3-5 could be performed for other distributions. See Cox [8] for a binary responses formulation that parallels the proportional rates model of Sect. 3. Our existing work does not, however, generalize readily and the detailed calculation required for other distributional assumptions is likely to be considerable. Nevertheless, some general principles can be extracted from the previous discussion. Let ψ be an interest parameter and λ be a nuisance parameter. Either or both may be vectors. One starts from an arbitrary pair of observations (T , C), or more generally an arbitrary partition, and makes a bijective transformation (T , C) → (S, R) such that one of factorizations (i)-(v) holds, where: Factorization (i) requires marginalization with S sufficient for ψ, (ii) requires conditioning on S, which is now the sufficient statistic for λ. In (iii) the jointly sufficient statistic is two independent sufficient statistics so that conditioning reduces to marginalization. Marginalization is applicable in (iv), in which R|S is sufficient for λ, and conditioning in (v), in which S is sufficient for λ, but information on ψ is lost in either case. The exponential proportional rates model and the exponential additive rates model are examples of factorizations (iv) and (v) respectively.
Our suggestion of Sect. 5 provides a unified approach to assessing the joint compatibility of a model and its parameter values with the data, and is justified in any situation for which one of factorizations (i)-(v) holds exactly. An important open question is the construction of appropriate factorizations, exact or approximate, in greater generality. We conclude by an outline of the considerations involved.
For an arbitrary pair (t, c) of jointly sufficient statistics, write the transformation equations as s = s(t, c), and r = r (t, c). The transformation is assumed to be bijective so that t = t(s, r ) and c = c(s, r ). For factorizations (i), (iii) or (iv) to be true, we require that f S (s; ψ, λ) = f S (s; ψ), and similarly for (ii) and (v).
The general form of a solution to f S (s; ψ, λ) = f S (s; ψ) is to express the unknown density of S in terms of the known joint density of T and C. For instance, where τ is anywhere in the strip of convergence of the moment generating function of S and The only contribution of λ comes from T λ , so it is sufficient to choose the function s(t, c) to make T λ independent of λ, identically in z, ψ and λ. It would be enough that independence be achieved only at points z of singularity, but this is more difficult. There results the following integral equation, to be solved for s(t, c), identically in z, ψ, and λ: In the exponential matched pair problem with proportional rates ( identically in λ and ψ.
In connection with these ideas there are a number of open problems with a differential geometrical bearing: 1. When there are nuisance parameters two approaches are to transform the data and marginalize or condition based on factorizations (i)-(v) above, or to find an interest-respecting orthogonal transformation as in Cox and Reid [9]. It is natural to expect there to be a connection between the two, and for this to be characterizable geometrically. 2. Is there a geometric representation of conditioning to evade nuisance parameters, and if so, how is this different geometrically to conditioning to ensure relevance [1]? 3. Differential geometric treatments of asymptotic inference (e.g. [1][2][3]5,13]) hinge on looking locally in the parameter space of fixed number of dimensions as the amount of information becomes so large that interest is focused on a small region. As such it does not seem directly applicable when the dimension of the parameter space is itself very large which is the situation considered in the present paper. Is there an extension of these ideas suitable for the incidental parameter problems of the present paper?
The analysis of Sect. 3.2 also hints at a more general analysis of model misspecification.
There are important open questions. For instance: when is inference on an interest parameter relatively unaffected by misspecification of the nuisance part of the model? What type of misspecification is the inference robust to and how does this depend on the structure of the model and the loss function used for estimation? In what sense is the inference robust? For instance consistency may be achievable but efficiency lost.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

A.1 Derivation of Eq. (9)
The argmax is unchanged by rescaling and subtraction of constants. Dividing by n and subtracting n −1 n i=1 log g θ * (T i , C i ) shows that The summands are identically distributed and of finite expectations, thereforeθ converges almost surely to

A.2 Derivation of Eq. (4)
The second derivative of the log likelihood for the ith pair with respect to ψ is and the two cross-partial derivatives with respect to ψ are i,ψα = The expectations of both terms in (28) are zero because, for any κ, Changing variables to z = tψ and z = c/ψ in (29) and (30) shows that both integrals are equal to so that terms cancel when taking expectations in (28). It follows that the Fisher information matrix per observation is block diagonal with the relevant block equal to the negative expectation of (27), specifically This is (4).

A.3 Derivation of Eq. (5)
Consider j = 1 and let K be the normalizing constant for the joint density of T i and C i . Then Direct calculation shows that the inner integral is so that, changing variables to z = γ (c/ψ + β) gives Now consider The inner integral is Integrating with respect to t and changing variables to z = γ (tψ + β) in the second term gives The demonstration is analogous for other j ∈ N, the integrals being identical up to the ψ 2 term that arises from the same changes of variables used above.
If both these integrals exist, the limit ofψ is the unique value of κ that sets I T = I C , i.e. κ = ψ. Since the exponential integral Ei(−x) is negative for x > 0, I T and I C are both upper bounded by (κ K ) −1 ∞ 0 f (γ )dγ = (κ K ) −1 < ∞ for all κ bounded away from zero. This justifies the previous use of the a strong law of large numbers. Thusψ converges almost surely to ψ.
The term e γ tψ cancels with e −γ tψ so that the relevant integrals with respect to t are We thus obtain Using the previous expression for the integrals of the Ei(z) functions, we obtain Thus T 1 = T 2 = T 3 and E{ 2 i,ψ } = T 1 . In the correctly specified case this is the Fisher information. On replacing f (γ ) by β α γ α−1 e −γβ / (α) in the expression for T 1 we obtain the result from Sect. 3.2, namely 2(α + 2)/ψ 2 (α + 3).