Abstract
Parametric statistical problems involving both large amounts of data and models with many parameters raise issues that are explicitly or implicitly differential geometric. When the number of nuisance parameters is comparable to the sample size, alternative approaches to inference on interest parameters treat the nuisance parameters either as random variables or as arbitrary constants. The two approaches are compared in the context of parametric survival analysis, with emphasis on the effects of misspecification of the random effects distribution. Notably, we derive a detailed expression for the precision of the maximum likelihood estimator of an interest parameter when the assumed random effects model is erroneous, recovering simply derived results based on the Fisher information in the correctly specified situation but otherwise illustrating complex dependence on other aspects. Methods of assessing model adequacy are given. The results are both directly applicable and illustrate general principles of inference when there is a high-dimensional nuisance parameter. Open problems with an information geometrical bearing are outlined.
Similar content being viewed by others
1 Introduction
Statistical analysis when the number of unknown parameters is comparable with the number of independent observations may demand modification of maximum-likelihood-based methods [7]. There are comparable difficulties with Bayesian analyses based on high dimensional “flat” priors. For an extreme example from a different perspective, see Stein [19].
Yates [22, 23] has discussed these issues in depth both for factorial experiments and also for variety trials in connection with balanced and partially balanced incomplete block designs. His development, powerful and almost explanation free, hinges, especially for incomplete block designs, on the geometry of least squares and the distinction between error-estimating and effect-estimating subspaces. Qualitatively similar forms of argument implicitly underlie the present paper.
Later discussion of these issues has mostly been either in general terms [6, chapter 2], or has approached them from a more decision-oriented perspective (e.g. [20]). In the present paper we show the considerations involved in the context of parametric analysis of matched pair survival data. Matched pair designs leading to a large number of nuisance parameters have been considered in various contexts, in particular by Cox [8], Anderson [4], Lindsay [15], Kumon and Amari [14] and Kartsonaki and Cox [12]. A key aspect is the way the potentially large number of nuisance parameters are represented. One is by a probability distribution parametrically specified. The second is as a set of unknown constants and the third is as independent and identically distributed random variables with totally unknown distribution. The consequences of the last two are essentially identical; note that the second would be converted into the third by reordering the data at random. By contrast, if appropriate, the stronger assumptions involved in the parametric random effects formulation lead to formally more precise conclusions. We illustrate the considerations involved with a theoretical and empirical analysis of the effect of misspecification. Assessment of model adequacy is also discussed. The results aim both to be directly applicable and to illustrate general principles.
2 Issues of formulation
Consider the comparison of two treatments in a matched pair design. For each of n pairs of individuals, one of the pair is a control and the other receives a treatment, leading to observations of survival times for the ith pair represented by random variables \(C_{i}, T_{i}\). We study analyses based on underlying exponential distributions, that is that the observations are in effect the first point events in individual Poisson processes. Study of the systematic variation between treatment and control is in general complicated by variation between pairs.
There are a number of ways to represent this simple situation. We specify them in terms of the rate parameter of the underlying Poisson processes, that is the reciprocal of the exponential means. The two key components specify the relation between \(C_i\) and \(T_i\) and the form of the inter-pair variation.
For a given pair, the Poisson rate under the treatment may be a constant multiple of that under the control. Alternatively the two rates may have a constant difference. There are other possibilities such as that the two mean survival times differ by a constant. The first two representations at least have a clear underlying interpretation in terms of a potential generating process and we largely concentrate on those.
In the formulation in terms of ratios, the rate parameters of \(C_{i}\) and \(T_{i}\) are written \(\gamma _i/\psi \) and \(\gamma _i \psi \), and in the additive formulation are written \(\rho _i - \Delta \) and \(\rho _i + \Delta \). Thus \(\gamma _i\) and \(\rho _i\) are responsible for the inter-pair variation whereas \(\psi \) and \(\Delta \) are key parameters of interest for understanding the effect of the treatment. There is a clear constraint on the parameter space in the second model and the two representations are in a formal sense rather similar to logistic and additive models for binary data.
To represent in general terms arbitrary systematic variation between pairs of individuals we either treat \(\gamma _i\) or \(\rho _i\) as constants, unknown parameters specific to each pair, or as realizations of random variables. The conceptual differences are considerable although the numerical implications are often minor when the sample is large.
An approach sometimes used in observational studies for which there is no natural pairing involves matching individuals based on the combination of a large number of background variables into a one-dimensional propensity score [18]. If background variables are available and not too numerous we favour using them directly for detailed interpretation. By contrast, the present paper focusses on situations in which component variables are not separately observed.
3 Exponential matched pairs with proportional rates
3.1 Nuisance parameters as arbitrary constants
For the representation involving ratios of rates let \(Z_{i}= T_{i}/C_i\), removing dependence on \(\gamma _i\). The density function at z is
Standard maximum likelihood theory based on the marginal distribution of the \(Z_i\) applies. In particular, the maximum likelihood estimator of \(\psi \) based on (1) is consistent and asymptotically normally distributed with variance given by the inverse of the Fisher information. The Fisher information per observation is
By eliminating the nuisance parameters in this way by marginalization, some information on the interest parameter is in general lost, because \( (Z_{1},\ldots , Z_{n}) = S\), say, is not sufficient for \(\psi \). Further discussion of these issues is given in Sect. 7.2. A smaller variance is achievable at the expense of stronger modelling assumptions, as demonstrated in Sect. 3.2.
3.2 Nuisance parameters as random variables
Instead of regarding the pair effects as constants we now suppose that they are random variables independently gamma distributed of shape parameter \(\alpha \) and rate \(\beta \). Then the joint density function of \(T_{i}\) and \(C_{i}\) at (t, c) is
The Fisher information matrix per observation can be shown (see Appendix A.2) to be block diagonal with the relevant entry for inference about \(\psi \) equal to
The two limits of this as \(\alpha \rightarrow \infty \) and \(\alpha \rightarrow 0\) are \(2\psi ^{-2}\) and \((4/3)\psi ^{-2}\), the latter being (2), the Fisher information per observation obtained by treating the nuisance parameters as arbitrary constants. The variance depends on the relative dispersion of the nuisance parameters through \(\alpha \).
See Sect. 7.1 for a formulation in terms of unobserved covariates involving a log normal distribution over the \(\gamma _i\).
Equation (4) shows that the gamma random effects formulation is more efficient than the one in which nuisance parameters are treated as arbitrary constants, provided that the random effects specification is reasonable. The modelling assumption is more severe, but the following analysis of the misspecified situation shows that, provided \(\psi \) is bounded away from zero, the corresponding maximum likelihood estimator \({\hat{\psi }}\) obtained by assuming the gamma random effects model of Sect. 3.2, converges almost surely to \(\psi \) as \(n\rightarrow \infty \). Thus \({\hat{\psi }}\) remains consistent in spite of an arbitrary degree of misspecification in the assumed random effects distribution.
Let \(\gamma _i\) (\(i=1,\ldots ,n\)) be independent random variables with an arbitrary density function \(f(\gamma )\). The associated joint distribution of \(T_i\) and \(C_i\) satisfies (see Appendix A.3)
In view of the expressions for the cross partial derivatives of the log likelihood function (Eq. (28) in Appendix A.2), Eq. (5) establishes orthogonality of \(\psi \) to \(\alpha \) and \(\beta \) whatever the random effects distribution. The interpretation of the notional parameters \(\alpha \) and \(\beta \) under model misspecification is discussed below. The orthogonality justifies consideration of the marginal maximum likelihood estimating equation for \(\psi \), i.e.
For any \(\kappa >0\) bounded away from zero, consider
Under the random effects formulation, the summands are independent and identically distributed and a law of large numbers implies convergence of the averages to their expectations. The limiting value of the maximum likelihood estimator, as \(n\rightarrow \infty \), is the value of \(\kappa \) that equalizes the two expectations. Appendix A.4 shows that the expectations exist and the value of \(\kappa \) that equalizes them is \(\psi \). Thus \({\hat{\psi }}\) is consistent despite the misspecification.
An analysis of efficiency is harder. Let \(g_{\theta ^*}\) denote the density function of the true joint distribution of \((T_{i},C_{i})\), where \(\theta ^* = (\lambda , \psi )\) and \(\lambda \) could be a finite or infinite dimensional nuisance parameter, but the proportional rates model of Sect. 2 is assumed so that \(\psi \) captures the treatment effect. This joint density is determined by the marginal density function of the random effects distribution \(f(\gamma )\) as
Thus if f is not parameterized, \(\lambda \) is f itself. Let \(\Theta \) denote the parameter space for the erroneous gamma random effects model and let \(f_{\theta }(x,y)\) denote the misspecified joint density function of each \((T_i, C_i)\) at (x, y), given by Eq. (3). Thus we may define \({\hat{\theta }}=({\hat{\alpha }},{\hat{\beta }},{\hat{\psi }})\) by \(\mathop {\mathrm {argmax}}_{v\in \Theta } \sum _{i=1}^{n}\log f_{v}(T_{i},C_{i})\), which converges almost surely (Appendix A.1) to
where, from the previous derivations, \(\theta _3=\psi \), the true treatment effect. Thus \(\alpha =\theta _1\) and \(\beta = \theta _2\) are the values that minimize the Kullback–Leibler divergence between the assumed (erroneous) model and the true model.
By the orthogonality established in (5), a discussion of efficiency requires consideration of the likelihood derivatives only with respect to \(\psi \). In particular, by the established consistency, a mean value expansion and standard arguments, it can be shown that the asymptotic distribution of \({n^{1/2}({\hat{\psi }} - \psi )}\) is Gaussian of zero mean and variance \([E\{\ell _{i,\psi \psi }(\theta )\}]^{-2}E\{\ell _{i,\psi }^{2}(\theta )\}\), leading to a variance of \(R/(R-Q)^{2}\), where R and Q depend in a rather complicated way on the density function \(f(\gamma )\) of the true random effects distribution. Specifically
Here \(\text {Ei}(x)\) is the exponential integral [17, equation 3.07] thus, in Eq. (10),
and because the \(\gamma _i\) are treated as totally random, \(E(\gamma _i^{\kappa })=\int _{0}^{\infty }\gamma ^\kappa f(\gamma )d\gamma \).
In a correctly specified situation, \(\{E(\ell _{i,\psi \psi })\}^{-2}E(\ell _{i,\psi }^{2})\) is the inverse Fisher information. When the random effects are gamma distributed of parameter \(\alpha \) and rate \(\beta \), as assumed, \(Q=2(\alpha +2)^{-1}\psi ^{-2}\) and \(R=2(\alpha +3)^{-1}(\alpha +2)^{-1}\psi ^{-2}\) so that \(R/(R-Q)^2\) is \(2^{-1}(\alpha +2)^{-1}(\alpha +3)\psi ^2\), i.e. the reciprocal of Eq. (4). While formula (10) does not seem amenable to detailed interpretation under misspecification, it serves to illustrate complicated dependence on key aspects of the formulation.
Table 4 of Sect. 6.2 shows that the loss of efficiency in the gamma model for random effects can be severe when the sample size is not large and when the random effects distribution is misspecified. Thus, while the random effects formulation is in principle always feasible for nuisance parameter problems, the adequacy of the choice of random effects distribution, often made on the basis of mathematical convenience, needs consideration. A discussion in the context of the present example is in Sect. 5.
4 Exponential matched pairs with additive rates
When the nuisance parameters \(\rho _i\) of the additive treatment effects model (see Sect. 2) are treated as arbitrary constants, the inference is based on conditioning on the sufficient statistic for the nuisance parameter in each pair [12]. We extend their results slightly by giving explicit expressions for the conditional and unconditional variances of the estimator. The likelihood contribution from the ith pair is
Thus \(T_{i}+C_{i}\) is sufficient for \(\rho _i\) and this leads to inference based on the difference \(T_{i}-C_i\), or equivalently \(T_i\) given the pairwise totals \(T_{i}+C_{i}=S_{i}\), say. The density function of \(S_i\) at s is
Some algebra shows that the conditional density function of \(T_{i}\) at t given \(S_{i}=s_i\) is, for \(\Delta > 0\),
Let \({\hat{\Delta }}\) denote the maximum likelihood estimator of \(\Delta \) based on the conditional density function (13). The Fisher information of \({\hat{\Delta }}\), conditional on \(S_{i}=s_i\) is
where \(s \sinh ^{-1}(\Delta s) < \Delta ^{-1}\) for all \(s>0\) and \(\lim _{s\rightarrow 0} \{s \sinh ^{-1}(\Delta s)\}=\Delta ^{-1}\) so that the conditional Fisher information is non-negative. For planning, the unconditional Fisher information is relevant. This is used for determining the sample size required to achieve a pre-specified conditional efficiency with high probability, and is obtained by replacing the ith summand by
where \(q=(\rho _{i}+\Delta )/(2\Delta )\) and in the last line we have changed variables to \(t=2\Delta s\). The integral and summation representations of Riemann’s generalized zeta function are [21, p265–66]
and the unconditional Fisher information is, from (15),
Section 6.1 confirms the above calculations by simulation.
Among other possibilities, the pair effects might be assumed to have a gamma distribution of parameter \(\alpha \) and rate \(\beta \) starting at \(\Delta \), leading to a joint density function of \(T_{i}\) and \(C_{i}\) at (t, c) given by
Standard maximum likelihood theory applies when the random effects distribution is correctly specified. An analysis of misspecification of this model is complicated by the fact that the parameters \(\alpha \) and \(\beta \) are not orthogonal to \(\Delta \) under arbitrary misspecification. Thus a full theoretical analysis of the kind developed in Sect. 3.2 will not be explored for the maximum likelihood estimator \({\tilde{\Delta }}\) based on Eq. (17). However Table 5 of Sect. 6.2 provides numerical evidence that severe loss of efficiency can result, relative to the version that treats the nuisance parameters as arbitrary constants. Consistency of \({\tilde{\Delta }}\) is also suspect. A referee asked whether there is any mathematically convenient distribution for the nuisance parameters that results in orthogonality of the nuisance parameters to the interest parameter \(\Delta \) in the additive rates model in spite of possible misspecification. In principle, if the true distribution of \(T_i\) and \(C_i\) is known and given in terms of parameters \((\Delta , \alpha , \beta )\), say, a reparameterization to \((\Delta , \lambda , \eta )\), say, can always be found such that \(\lambda \) and \(\eta \) are orthogonal to \(\Delta \). This entails solving the pair of differential equations
initially to determine the dependence of \(\alpha \) and \(\beta \) on \(\Delta \) and ultimately choosing \(\lambda \) and \(\eta \) as detailed by Cox and Reid [9]. However, in the above display \(i^{*}_{\alpha \beta }\), \(i^*_{\Delta \beta }\), etc. are the expectations of the second cross partial derivatives of the assumed log likelihood function, taken with respect to the true model. These expressions differ depending on the form of misspecification. An extension of the ideas of Cox and Reid [9] to accommodate arbitrary misspecification is an important question which demands further study, ideally in full generality.
5 Assessment of model adequacy
In the above two models, exact tests of model adequacy are available. Sufficiency represents a separation of the information in the data into that relevant for estimating the parameters of a given model and that relevant for assessing the adequacy of the model [6, p.29]. Suppose that the proportional treatment effect model of Sect. 3 holds. The likelihood contribution from the ith pair is
From this, for any given \(\psi \), \(C_i/\psi + T_i\psi =S_{i}(\psi )\), say, is sufficient for \(\gamma _i\) and has density function
i.e., \(S_{i}(\psi )\) is gamma distributed with shape parameter 2 and rate parameter \(\gamma _i\).
The model and an arbitrarily specified parameter value \(\psi =\psi _0\) are jointly compatible with the data if the realization of \(T_i\), say, is not extreme relative to the conditional density function of \(T_i\) given \(S_{i}(\psi )=s_i(\psi )\), assuming \(\psi =\psi _0\). The conditional density of \(T_i\) at \(t_i\), given \(S_{i}(\psi )=s_i(\psi )\), is
showing that \(T_{i}\mid \{S_{i}(\psi )=s_i(\psi )\}\) is uniformly distributed between 0 and \(s_i(\psi )\).
For any hypothesized value \(\psi _{0}\) of \(\psi \), compatibility of the proportional treatment effects model and \(\psi _0\) with the data corresponds to compatibility of the realizations of \(T_{i}/s_{i}(\psi _0)=U_{i}(\psi _{0})\), say, with a uniform distribution on (0,1) for all \(i=1,\ldots ,n\). This is a basis for checking consistency with the proportional model. More specifically, an \(\alpha \)-level confidence set using Fisher’s [11, section 21.1] test is
where F is the distribution function of a \(\chi ^{2}\) random variable with 2n degrees of freedom. If the confidence set is non-empty at a specified level, there are at least some values of \(\psi _0\) for which the proportional treatment effects model is compatible with the data at this level.
For sufficiently large sample size, one might treat \({\hat{\psi }}\) as fixed and equal to \(\psi \) under the null hypothesis that the model is true. The adequacy of this assumption can then be assessed by checking the compatibility of the realizations of \(T_i/s_i({\hat{\psi }})\) for \(i=1,\ldots ,n\) with a uniform distribution on (0, 1).
The same ideas allow the adequacy of the a random effects model to be checked. In particular, for any given \(\psi \), the collection of weighted sums \(S_{i}(\psi )\) for \(i=1,\ldots ,n\) is sufficient for the nuisance parameters \(\alpha \) and \(\beta \), as can be seen from Eq. (3). One could condition as above.
For sufficiently many pairs, however, a simpler option is available due to the small number of nuisance parameters in the random effects model. The distribution function at s of \(S_{i}=T_{i}+C_{i}\) under the gamma random effects model is given by
Since the maximum likelihood estimators \({\tilde{\alpha }}\), \({\tilde{\beta }}\) and \({\tilde{\psi }}\) are consistent and completely specify the model, for sufficiently many individuals it may often be a reasonable approach to consider these as fixed and equal to the true values \(\alpha \), \(\beta \) and \(\psi \) under the null hypothesis that the gamma random effects model is correctly specified. Making this replacement in Eq. (22) and evaluating the distribution function at the points \(S_{i}\) for \(i,\ldots ,n\) leads to approximately standard uniformly distributed points under the null hypothesis, and Fisher’s [11, section 21.1] test is applicable.
Similar arguments apply to the additive effects model. Section 4 shows that \(S_i =T_{i}+C_{i}\) is sufficient for the nuisance parameter \(\rho _i\), so that the conditional density of \(T_i\) given \(S_{i}=s_i\) is free of \(\rho _i\) and is given by Eq. (13). In Sect. 4, this justified estimation of the treatment effect \(\Delta \) by maximization of the conditional likelihood based on (13). To assess model adequacy it is necessary to condition on the jointly sufficient statistic for all unknown parameters. Thus, as in the proportional rates formulation, one must fix \(\Delta \) at hypothesized values leading to a joint assessment of the adequacy of the additive effects model at an arbitrary but given value \(\Delta _0\) of the interest parameter. The model and a value \(\Delta _0\) are compatible with the data at a particular level if \(T_1,\ldots , T_n\) are not extreme relative to what would be expected under their joint conditional density assuming \(\Delta =\Delta _0\), i.e.,
As in the proportional rates model, For sufficiently large sample size, one might reasonably treat \({\hat{\Delta }}\) as fixed and equal to \(\Delta \) under the null hypothesis that the additive rates model is true and proceed as above using \({\hat{\Delta }}\) in place of \(\Delta _0\) to assess the model.
There are situations where exact tests of model adequacy based on these principles do not seem feasible. One example in the spirit of this work would be an exponential matched pair problem in which \(T_i\) and \(C_i\) have a stable difference in means. In Sect. 7.2, we explain in more general terms how the structure of the inference problem dictates the appropriate strategy.
6 Empirical validation and numerical extensions
6.1 Fixed nuisance parameters
Throughout the following numerical work \(\psi =\Delta =2\). For several different values of n we generate \((\gamma _i)_{i=1}^{n}\) from a gamma distribution of shape \(\alpha =1\) and rate \(\beta =1\), and we define \(\rho _i=\Delta +\gamma _i\) so that \(\rho _i-\Delta >0\). The nuisance parameters \((\gamma _i)_{i=1}^{n}\) and \((\rho _i)_{i=1}^{n}\) are then fixed over Monte Carlo replications.
In each of \(R=1000\) Monte Carlo replications, \(T_{i}^{\text {(PR)}}\) and \(C_{i}^{\text {(PR)}}\) (\(i=1,\ldots , n\)) are generated independently from exponential distributions of rates \(\gamma _i\psi \) and \(\gamma _i/\psi \) respectively, and \(T_{i}^{\text {(AR)}}\) and \(C_{i}^{\text {(AR)}}\) are generated from exponential distributions of rates \(\rho _i+\Delta \) and \(\rho _i-\Delta \). The parameter \(\psi \) in the proportional rates model is estimated by maximum likelihood based on the density function of \(T_{i}^{\text {(PR)}}/C_{i}^{\text {(PR)}}\) of Eq. (1). Let \({\hat{\psi }}_{n}\) denote this estimator.
The sample variance of \({\hat{\psi }}_n\) over the 1000 Monte Carlo replications is reported in the second row of Table 1, with an estimate of its theoretical standard error in the third row. This is based on the \(\chi ^2\) distribution with \(R-1\) degrees of freedom of the sample variance. The theoretical variance of \({\hat{\psi }}_n\) is asymptotically (as \(n\rightarrow \infty \)) the inverse of the Fisher information. Its theoretical value obtained from Eq. (2) is reported below the row of standard errors. The values in the second and the fourth rows agree for large n.
We also report the results from fitting a gamma random effects model to \(T_{i}^{\text {(PR)}},C_{i}^{\text {(PR)}}\) for \(i=1,\ldots ,n\). Let \({\tilde{\psi }}_n\) denote the corresponding maximum likelihood estimator of \(\psi \). This model is misspecified but the efficiency of \({\tilde{\psi }}_n\) is high. However the model is not severely misspecified because the \((\gamma _i)_{i=1}^{n}\) are generated from a gamma distribution before being fixed across Monte Carlo replications. In Sect. 6.2, we consider the effect of more severe misspecification of the random effects distribution.
The parameter \(\Delta \) from the additive rates model is estimated using maximum likelihood based on the conditional density function of \(T_{i}^{\text {(AR)}}\) given the realization of \(T_{i}^{\text {(AR)}}+C_{i}^{\text {(AR)}}\). This is Eq. (13). Let \({\hat{\Delta }}_n\) denote this maximum likelihood estimator. The Monte Carlo variance of \({\hat{\Delta }}_n\) is reported in the second row of Table 2, with its estimated theoretical standard error in the third row. The unconditional variance based on Eq. (16) is reported in the fourth row together with the Monte Carlo average of the conditional variances based on (14) in the fifth row. The two agree to a close approximation and they also agree with the Monte Carlo sample variances for sufficiently large n.
6.2 Randomly generated nuisance parameters
The simulation studies are the same as in Sect. 6.1 except that \((\gamma _i)_{i=1}^{n}\) and the \((\rho _i)_{i=1}^{n}\) are generated anew in each Monte Carlo replication. Thus the models in which these nuisance parameters are treated as arbitrary constants are misspecified. In particular, dependence between both versions of \(T_{i}\) and \(C_{i}\) is induced by the generating mechanism for \(\gamma _i\) and \(\rho _i\).
Table 3 contains analogous information to the top three rows of Table 1 for the misspecified case. The theoretically true variances have not been calculated and so are not reported. However, the sample variances are very close to the theoretical asymptotic variances that would obtain if the nuisance parameters were arbitrary constants (cf. fourth row of Table 1). We also report the Monte Carlo variance of \({\tilde{\psi }}_n\), now under a correctly specified model, and its theoretical asymptotic variance based on Eq. (4). Comparing the fifth and last rows of Table 3, these agree for sufficiently large n.
To assess the efficiency of \({\tilde{\psi }}_n\) under fairly extreme misspecification of the random effects distribution, we conduct the same experiment but with the \((\gamma _i)_{i=1}^{n}\) drawn from a log normal distribution with scale parameter \(\tau =10\). For comparison, the Monte Carlo variances of \({\hat{\psi }}_n\) are also reported in Table 4. The conclusion from this analysis is that while \({\hat{\psi }}_{n}\), justified under the assumption that the nuisance parameters are arbitrary constants, has a stable variance when the nuisance parameters are drawn from a rather extreme random effects distribution, the variance of \({\tilde{\psi }}_n\) is appreciably larger when the random effects distribution is misspecified in this way.
We now consider the effect of misspecification of the random effects distribution in the additive rates model by comparing the estimator \({\hat{\Delta }}\) of Sect. 4 to the maximum likelihood estimator \({\tilde{\Delta }}\) obtained by erroneously assuming that the joint density function of \(T_{i}\) and \(C_{i}\) is given by Eq. (17). Rather than being a gamma distribution starting at \(\Delta \), the true distribution of the \(\rho _i\) is a log normal distribution of scale parameter \(\tau =10\) starting at \(\Delta \). Although the theoretical variance of \({\hat{\Delta }}\) has not been calculated under the random effects formulation, the ones based on Eqs. (16) and (14) are reported in the fourth and fifth rows of Table 5. As before, the estimated standard errors in the third and eighth rows are based on a \(\chi ^2\) distribution with \(R-1\) degrees of freedom for the sample variance, where R is the number of Monte Carlo replications.
6.3 Assessment of model adequacy in the proportional rates model
To illustrate the ideas in Sect. 5 we consider the data generating process corresponding to Table 1 with \(\psi =1\). This is the value of \(\psi \) that equalises the distributions of responses for treated individuals and controls. In each of 1000 Monte Carlo replications we calculate \(T_i/s_i(\psi _0)=U_i(\psi _0)\) for all \(\psi _0\) between zero and three in increments of 0.01 and for \(i=1,\ldots , n\) with the values of n reported in Table 6. We use the composite of these values to produce a confidence set for \(\psi \) as in Eq. (21). Table 6 reports the simulated coverage probabilities of the \(\alpha \)-level confidence sets for \(\alpha \in \{0.01,0.05\}\). While the confidence sets need not be intervals in general, they turned out to be intervals in all our Monte Carlo replications, thus we report the mean lower and upper boundaries of these confidence intervals, averaged over Monte Carlo replications.
The interpretation of the numbers in Table 6 is that the proportional rates model with fixed nuisance parameters is compatible with the data at level \(\alpha \) for any value of \(\psi _0\) taking values in \({\mathcal {C}}(\alpha )\) defined by Eq. (21).
7 Discussion and open problems
7.1 A synthesis with earlier literature
The choice of random effects distribution in Sect. 3.2 was primarily one of mathematical convenience. It coincides with typical usage in applications and raises conceptual issues: (i) To what extent is the random effects formulation a plausible representation of the data generating mechanism? (ii) Are there statistical advantages of assuming a parametric random effects model even if the formulation is physically implausible? (iii) Are there statistical advantages of treating nuisance parameters as arbitrary constants when there is a probabilistic generating mechanism for them?
Our analysis has shown the need to be wary of assumptions made for mathematical convenience with no substantive basis. The following example shows how a different distribution for the random effects may be more plausible, leading to the situation considered in Table 4. The comparison to Table 1 shows that the approach in which nuisance parameters are treated as arbitrary constants is noticeably preferable to the approach in which the incorrect parametric random effects distribution is used.
Suppose, in the notation of Sect. 3, that one models the nuisance parameters as \(\gamma _i = \exp (x_{i}^{T}\theta )\), where the \(x_i\) are covariates that one could have, but did not, measure. If individuals are sampled completely at random from a larger population, it is not unreasonable to treat the covariates as realizations of random variables \(X_i\), assumed to be i.i.d. copies of X, a p-dimensional normally distributed random vector of mean zero and covariance matrix \(\Sigma =Q\Lambda Q^{T}\), where Q is a matrix whose columns are the unit-length eigenvectors of \(\Sigma \). To derive the induced distribution over the \(\gamma _i\), write \(W\triangleq \theta ^{T}X = \theta ^{T}Q\Lambda ^{1/2}V\), where V is a standard normally distributed random vector. We have \(W=\Vert \theta ^{T}Q\Lambda ^{1/2}\Vert _{2}\Vert V\Vert _{2}R\), where R is the cosine of the angle between V and \(\Lambda ^{1/2}Q^{T}\theta \), whose density function is given by (Fisher, 1915)
and \(\Vert V\Vert _{2}^{2}\) is a Chi squared random variable with p degrees of freedom, so that \(D\triangleq \Vert V\Vert _{2}\) has density function
The characteristic function of W is
where for any random variable Y, \(\phi _{Y}(t)=E_{Y}(e^{itY})\). Let \(s=\Vert \theta ^{T}Q\Lambda ^{1/2}\Vert _{2}t\). Direct calculation gives
where \(K=\sqrt{\pi }2^{(p/2)-1}\Gamma \{(p-1)/2\}\). Since \(\int _{-1}^{1}\exp \{-(p/2)r\}\sin (s\delta r)dr = 0\),
and the remainder terms are ignored for \(p\rightarrow \infty \), leading to
Using Stirling’s formula in the form \(\Gamma (k+a)/\Gamma (k)\simeq k^{a}\) for large k,
where \(e^{-s^2/2} = e^{-(\Vert \Lambda ^{1/2}Q^{T}\theta \Vert _{2}^{2}/2)t^2}\) is the characteristic function of a centred normal random variable with standard deviation \(\tau \triangleq \Vert \Lambda ^{1/2}Q^{T}\theta \Vert _{2}\). Under this generating mechanism for the covariates, \(\gamma _i\) are thus log-normally distributed, with density function
where \(\phi (\cdot )\) is the standard normal density.
While this formulation is to some extent physically justifiable, the integral (8) does not appear to have an analytic solution when \(f(\gamma )\) is given by (23). This illustrates that random effects models are likely to be driven by mathematical convenience, highlighting the importance of studies of misspecification.
After completing this paper, we were made aware of a related contribution by Lindsay [16]. The work showed that straight maximum likelihood estimation (without preliminary manoeuvres based on the factorizability of the likelihood function) is consistent in a particular class of incidental parameter models. Specifically, those models for which there is a complete sufficient statistic \(S_{i}(\psi )\) for the nuisance parameter \(\lambda _i\), with \(\psi \) treated as fixed. This situation covers the exponential matched pairs problems with multiplicative treatment effect on the rates (Sect. 3.1) and with additive treatment effect on the rates (Sect. 4) but not the exponential matched pairs problem with additive treatment effect on the means. Despite consistency of the maximum likelihood estimator, the standard estimator of the variance of the maximum likelihood estimator is seriously distorted in these settings, the true variance typically being appreciably larger than that based on the supposed inverse Fisher information.
Lindsay considered estimation of the interest parameter by parametric random effects models and showed that the efficiency achievable by the resulting estimator is higher than straight maximum likelihood provided that a reasonable choice of parametric model for the random effects is used, even if this random effects distribution is misspecified. The appropriate conditions are essentially that the parameters of the parametric random effects distribution be orthogonal in the sense of Cox and Reid [9]. Parameter orthogonality arose in our derivations in Sect. 3.2 via Eq. (5) and its derivation in Appendix A.3. Lindsay [16] does not discuss the potential for appreciable loss of efficiency over conditional or marginal likelihood, as opposed to full maximum likelihood, by erroneously assuming a parametric random effects model. This potential loss of efficiency is illustrated by our Eq. (10). The synthesis of Lindsay’s analysis and ours is that, while a random effects formulation can lead to increased precision over straight maximum likelihood even when the random effects distribution is misspecified, provided that the parameters of the random effects distribution are orthogonal to the interest parameters, there is potential for appreciable loss of efficiency over marginal and conditional likelihood when the corresponding factorizations of the likelihood function are available.
7.2 Open problems
Issues connected with an appreciable number of nuisance parameters are likely to arise whenever a relatively complicated model is needed. In principle, analyses similar to those of Sects. 3–5 could be performed for other distributions. See Cox [8] for a binary responses formulation that parallels the proportional rates model of Sect. 3. Our existing work does not, however, generalize readily and the detailed calculation required for other distributional assumptions is likely to be considerable. Nevertheless, some general principles can be extracted from the previous discussion. Let \(\psi \) be an interest parameter and \(\lambda \) be a nuisance parameter. Either or both may be vectors. One starts from an arbitrary pair of observations (T, C), or more generally an arbitrary partition, and makes a bijective transformation \((T,C)\rightarrow (S,R)\) such that one of factorizations (i)–(v) holds, where:
-
(i)
\(f_{S,R}(s,r; \psi , \lambda )=f_{R|S}(r|s; \lambda )f_{S}(s;\psi )\),
-
(ii)
\(f_{S,R}(s,r; \psi , \lambda )=f_{R|S}(r|s; \psi )f_{S}(s;\lambda )\),
-
(iii)
\(f_{S,R}(s,r; \psi , \lambda )=f_{R}(r; \lambda )f_{S}(s;\psi )\),
-
(iv)
\(f_{S,R}(s,r; \psi , \lambda )=f_{R|S}(r|s; \lambda , \psi )f_{S}(s;\psi )\),
-
(v)
\(f_{S,R}(s,r; \psi , \lambda )=f_{R|S}(r|s; \psi )f_{S}(s;\psi ,\lambda )\).
Factorization (i) requires marginalization with S sufficient for \(\psi \), (ii) requires conditioning on S, which is now the sufficient statistic for \(\lambda \). In (iii) the jointly sufficient statistic is two independent sufficient statistics so that conditioning reduces to marginalization. Marginalization is applicable in (iv), in which R|S is sufficient for \(\lambda \), and conditioning in (v), in which S is sufficient for \(\lambda \), but information on \(\psi \) is lost in either case. The exponential proportional rates model and the exponential additive rates model are examples of factorizations (iv) and (v) respectively.
Our suggestion of Sect. 5 provides a unified approach to assessing the joint compatibility of a model and its parameter values with the data, and is justified in any situation for which one of factorizations (i)–(v) holds exactly. An important open question is the construction of appropriate factorizations, exact or approximate, in greater generality. We conclude by an outline of the considerations involved.
For an arbitrary pair (t, c) of jointly sufficient statistics, write the transformation equations as \(s=s(t,c)\), and \(r=r(t,c)\). The transformation is assumed to be bijective so that \(t=t(s,r)\) and \(c=c(s,r)\). For factorizations (i), (iii) or (iv) to be true, we require that \(f_{S}(s;\psi , \lambda ) = f_{S}(s;\psi )\), and similarly for (ii) and (v).
The general form of a solution to \(f_{S}(s;\psi , \lambda ) = f_{S}(s;\psi )\) is to express the unknown density of S in terms of the known joint density of T and C. For instance,
where \(\tau \) is anywhere in the strip of convergence of the moment generating function of S and
The only contribution of \(\lambda \) comes from \(T_{\lambda }\), so it is sufficient to choose the function s(t, c) to make \(T_{\lambda }\) independent of \(\lambda \), identically in z, \(\psi \) and \(\lambda \). It would be enough that independence be achieved only at points z of singularity, but this is more difficult. There results the following integral equation, to be solved for s(t, c), identically in z, \(\psi \), and \(\lambda \):
In the exponential matched pair problem with proportional rates (Sect. 3), Eq. (24) becomes
While it is simple to show that \(s(t,c)=t/c\) verifies Eq. (25), recovering the strategy of Sect. 3.1, a general theory relies on a solution to the integral Eq. (24) when s(t, c) is not known a priori.
An alternative general formulation to that based on Laplace transforms uses the joint density function of S and R. Specifically, for factorization (i), (iii) or (iv) consider
where \(J_{(T,C)\rightarrow (S,R)}\) is the Jacobian of the transformation \((T,C)\rightarrow (S,R)\). Thus, for the marginal density to be independent of \(\lambda \), we require the solution in t(s, r) and c(s, r) of the set of partial integro-differential equations:
identically in \(\lambda \) and \(\psi \).
In connection with these ideas there are a number of open problems with a differential geometrical bearing:
-
1.
When there are nuisance parameters two approaches are to transform the data and marginalize or condition based on factorizations (i)–(v) above, or to find an interest-respecting orthogonal transformation as in Cox and Reid [9]. It is natural to expect there to be a connection between the two, and for this to be characterizable geometrically.
-
2.
Is there a geometric representation of conditioning to evade nuisance parameters, and if so, how is this different geometrically to conditioning to ensure relevance [1]?
-
3.
Differential geometric treatments of asymptotic inference (e.g. [1,2,3, 5, 13]) hinge on looking locally in the parameter space of fixed number of dimensions as the amount of information becomes so large that interest is focused on a small region. As such it does not seem directly applicable when the dimension of the parameter space is itself very large which is the situation considered in the present paper. Is there an extension of these ideas suitable for the incidental parameter problems of the present paper?
The analysis of Sect. 3.2 also hints at a more general analysis of model misspecification. There are important open questions. For instance: when is inference on an interest parameter relatively unaffected by misspecification of the nuisance part of the model? What type of misspecification is the inference robust to and how does this depend on the structure of the model and the loss function used for estimation? In what sense is the inference robust? For instance consistency may be achievable but efficiency lost.
References
Amari, S.-I.: Geometrical theory of asymptotic ancillarity and conditional inference. Biometrika 69, 1–17 (1982)
Amari, S.-I., Kumon, M.: Differential geometry of Edgeworth expansions in curved exponential family. Ann. Inst. Stat. Math. 35, 1–24 (1983)
Amari, S.-I.: Differential Geometry of Statistical Inference. Springer, Berlin (1983)
Anderson, E.B.: Asymptotic properties of conditional maximum likelihood estimators. J. R. Stat. Soc. B 32, 283–301 (1970)
Barndorff-Nielsen, O.E., Cox, D.R., Reid, N.M.: The role of differential geometry in statistical theory. Int. Stat. Rev. 54, 83–96 (1986)
Barndorff-Nielsen, O.E., Cox, D.R.: Inference and Asymptotics. Chapman and Hall, London (1994)
Bartlett, M.S.: Properties of sufficiency and statistical tests. Proc. R. Soc. Lond. A 160, 268–82 (1937)
Cox, D.R.: Two further applications of a model for binary regression. Biometrika 45, 562–65 (1958)
Cox, D.R., Reid, N.M.: Parameter orthogonality and approximate conditional inference (with discussion). J. R. Stat. Soc. B 49, 1–39 (1987)
Fisher, R.A.: Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika 10, 507–521 (1915)
Fisher, R.A.: Statistical Methods for Research Workers. Oliver and Boyd, Edinburgh (1925)
Kartsonaki, C., Cox, D.R.: Some matched comparisons of two distributions of survival time. Biometrika 103, 219–24 (2016)
Kumon, M., Amari, S.-I.: Geometrical theory of higher-order asymptotics of test, interval estimator and conditional inference. Proc. R. Soc. Lond. Ser. A 387, 429–458 (1983)
Kumon, M., Amari, S.-I.: Estimation of a structural parameter in the presence of a large number of nuisance parameters. Biometrika 71, 445–59 (1984)
Lindsay, B.G.: Nuisance parameters, mixture models, and the efficiency of partial likelihood estimators. Philos. Trans. R. Soc. Lond. 296, 639–65 (1980)
Lindsay, B.G.: Using empirical partially Bayes inference for increased efficiency. Ann. Stat. 13, 914–31 (1985)
Olver, F.W.J.: Introduction to Asymptotics and Special Functions. Academic Press, New York (1974)
Rosenbaum, P.R., Rubin, D.B.: The central role of the propensity score in observational studies for causal effects. Biometrika 70, 41–55 (1983)
Stein, C.: Inadmissibility of the usual estimator for the mean of a multivariate distribution. Proc. Berkeley Symp. Math. Stat. Probab. 1, 197–206 (1956)
Tibshirani, R.: Regression shrinkage and selection via the LASSO. J. R. Stat. Soc. B 58, 267–88 (1996)
Whittaker, E.T., Watson, G.N.: A Course of Modern Analysis (1965 reprint of the fourth edition). Cambridge University Press, London (1927)
Yates, F.: Complex experiments. J. R. Stat. Soc. B (with discussion) 2, 181–223 (1935)
Yates, F.: A new method of arranging variety trials involving a large number of varieties. J. Agric. Sci. 26, 424–55 (1936)
Acknowledgements
the work was supported by a UK Engineering and Physical Sciences Research Council Fellowship to HSB.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A Derivations of key results
A Derivations of key results
1.1 A.1 Derivation of Eq. (9)
The argmax is unchanged by rescaling and subtraction of constants. Dividing by n and subtracting \(n^{-1}\sum _{i=1}^{n}\log g_{\theta ^*}(T_i,C_i)\) shows that
The summands are identically distributed and of finite expectations, therefore \({\hat{\theta }}\) converges almost surely to
1.2 A.2 Derivation of Eq. (4)
The second derivative of the log likelihood for the ith pair with respect to \(\psi \) is
and the two cross-partial derivatives with respect to \(\psi \) are
The expectations of both terms in (28) are zero because, for any \(\kappa \),
Changing variables to \(z=t\psi \) and \(z=c/\psi \) in (29) and (30) shows that both integrals are equal to
so that terms cancel when taking expectations in (28). It follows that the Fisher information matrix per observation is block diagonal with the relevant block equal to the negative expectation of (27), specifically
This is (4).
1.3 A.3 Derivation of Eq. (5)
Consider \(j=1\) and let K be the normalizing constant for the joint density of \(T_{i}\) and \(C_i\). Then
Direct calculation shows that the inner integral is
so that, changing variables to \(z=\gamma (c/\psi + \beta )\) gives
Now consider
The inner integral is
Integrating with respect to t and changing variables to \(z=\gamma (t\psi + \beta )\) in the second term gives
The demonstration is analogous for other \(j\in {\mathbb {N}}\), the integrals being identical up to the \(\psi ^2\) term that arises from the same changes of variables used above.
1.4 A.4 Proof of consistency of the maximum likelihood estimator
By the argument following Eq. (7), it is required to show that the \(\kappa \) that equalizes the expectation of \(C_{i}/\{\kappa ^2(T_{i}\kappa + C_{i}/\kappa + \beta \}\) and \(T_{i}/\{T_{i}\kappa + C_{i}/\kappa + \beta \}\) is \(\kappa =\psi \), and that these expectations exist for any \(\kappa \) and \(\psi \) bounded away from zero.
Consider
where, as before, K is the normalizing constant for the joint density function of \(T_i\) and \(C_i\). Direct calculation shows that the innermost integral is
Similarly,
and the innermost integral is
Changing variables to \(z=(c/\kappa + \beta )\) in (31) and \(s=(\kappa t + \beta )\) in (32) shows that
If both these integrals exist, the limit of \({\hat{\psi }}\) is the unique value of \(\kappa \) that sets \(\text {I}_{T}=\text {I}_{C}\), i.e. \(\kappa = \psi \). Since the exponential integral \(\text {Ei}(-x)\) is negative for \(x>0\), \(\text {I}_{T}\) and \(\text {I}_{C}\) are both upper bounded by \((\kappa K)^{-1}\int _{0}^{\infty }f(\gamma )d\gamma = (\kappa K)^{-1}<\infty \) for all \(\kappa \) bounded away from zero. This justifies the previous use of the a strong law of large numbers. Thus \({\hat{\psi }}\) converges almost surely to \(\psi \).
1.5 A.5 Derivation of Eq. (10)
The squared derivative with respect to \(\psi \) of the likelihood contribution from the ith pair is
Taking expectations, \(E(\ell _{i,\psi }^{2})=\text {T}_{1}+\text {T}_{2}-\text {T}_{3}\), where
Consider \(\text {T}_1\). A change of variables to \(z=(t\psi + c/\psi + \beta )\) leads to
The term \(e^{\gamma t\psi }\) cancels with \(e^{-\gamma t\psi }\) so that the relevant integrals with respect to t are
Integration by parts shows that \(\int z^{b-1}\text {Ei}(-z)dz = b^{-1}\{z^b \text {Ei}(-z) + \Gamma (b,z)\}\) and there is the recursive formula \(\Gamma (b+1,z)=b\Gamma (b,z) + z^b e^{-z}\) so that \(\Gamma (1,\gamma \beta )=e^{-\gamma \beta }\) and
We thus obtain
and an analogous calculation shows that \(\text {T}_{2}=\text {T}_1\).
Consider \(\text {T}_{3}\). The inner integrals are
and
Using the previous expression for the integrals of the \(\text {Ei}(z)\) functions, we obtain
Thus \(\text {T}_{1}=\text {T}_{2}=\text {T}_{3}\) and \(E\{\ell _{i,\psi }^{2}\}=\text {T}_{1}\). In the correctly specified case this is the Fisher information. On replacing \(f(\gamma )\) by \(\beta ^{\alpha }\gamma ^{\alpha -1}e^{-\gamma \beta }/\Gamma (\alpha )\) in the expression for \(\text {T}_{1}\) we obtain the result from Sect. 3.2, namely \(2(\alpha +2)/\psi ^2 (\alpha +3)\).
For the calculation of \(E(\ell _{i,\psi \psi })\), it is required to calculate the expectations of the terms \(\text {I}_{1}-\text {I}_{4}\) in Eq. (27), under misspecification. It is clear from their expressions that these expectations are related to the above calculations in the following way: \(E(\text {I}_{4})=(\alpha +2)^{-2}\text {T}_{1}\), \(E(\text {I}_{2})=(\alpha +2)^{-2}\text {T}_{2}\) and \(E(\text {I}_{1})=(\alpha +2)^{-2}\text {T}_{3}\), but \(\text {T}_{1}=\text {T}_{2}=\text {T}_{3}\) so that
The missing expectation is
On writing \(R=(\alpha +2)^{-2}\text {T}_{1}\), it follows that
Under the correct specification of the gamma random effects model we also obtain
as expected.
1.6 A.6 Derivation of Eq. (12)
The Laplace transform of the density of \(S_i\) at z is
and the density function of each \(S_{i}\) at s is
where for a function g(z) \(z\in {\mathbb {C}}\), \(\text {Res}\{g,a\}\) denotes the residue of g at \(z=a\).
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Battey, H.S., Cox, D.R. High dimensional nuisance parameters: an example from parametric survival analysis. Info. Geo. 3, 119–148 (2020). https://doi.org/10.1007/s41884-020-00030-6
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41884-020-00030-6