Abstract
The use of historical, i.e., already existing, estimates in current studies is common in a wide variety of application areas. Nevertheless, despite their routine use, the uncertainty associated with historical estimates is rarely properly accounted for in the analysis. In this communication, we review common practices and then provide a mathematical formulation and a principled frequentist methodology for addressing the problem of drawing inferences in the presence of historical estimates. Three distinct variants are investigated in detail; the corresponding limiting distributions are found and compared. The design of future studies, given historical data, is also explored and relations with a variety of other well-studied statistical problems discussed.
Similar content being viewed by others
1 Introduction
There are many circumstances in which a statistical analysis either requires, or can greatly benefit, from the use of historical, that is existing, information. In this paper we focus on the situation where the historical information consists of parameter estimates. These may be essential for model fitting but impossible, and/or very expensive, to collect in the context of the current study. Although related, we will not explicitly discuss the large literature on data combination schemes or other two-stage plug-in methods. An example of the former is Ridder and Moffitt (2007) comprehensive review of methodologies for data combination common in econometrics whereas an example of the latter is Genest et al. (1995) seminal paper on two-step semi-parametric inference in copula models. These and similar problems have received considerable attention and have a long history in Statistics, see, e.g., Cochran (1954).
Historical estimates are used in a variety of applications in the social, physical, and biomedical sciences. For example, some models for the spread of infectious diseases, such as the SIR model (Becker 2017) popularized in connection with COVID-19, e.g., Cooper et al. (2020), require the input of age-specific transmission parameters which can be estimated from social contact networks (Edmunds et al. 1997; Wallinga et al. 2006) and then used to fit epidemic models (Mossong et al. 2008; Goeyvaerts et al. 2010, Yaari et al. 2018). Another interesting application is the optimization of cancer treatment where Kronik et al. (2010) develop a framework for predicting the outcome of prostate cancer immunotherapy by fitting personalized mathematical models. Their model consists of a set of differential equations whose behavior is governed by a collection of parameters, some of which are global parameters while others are subject specific. The values of the global parameters were obtained from at least ten different published studies, see their Table 2, whereas the subject level parameters were estimated by fitting a model to each participant assuming that the global parameters were estimated without error. See Kogan et al. (2012) and Kozłowska et al. (2018) for similar applications. It is worth noting that the applications above may be viewed as a model for situations in which knowledge collected in one setting, experimental or observational, is then used to estimate quantities arising in a different experiment and is quite common in the biomedical sciences. See Lee and Zelen (1998) and Davidov and Zelen (2004) for a similar structure arising in the planning of early detection programs.
Another very important application in which historical estimates are used is clinical trials. Consider, for example, the situation in which the effect of a combination of treatments is assessed (e.g., Tamma et al. 2012, Kanda et al. 2016). In such cases there exists a collection of therapies which have been independently proven to be somewhat successful at treating a medical condition. The objective of a new study may then be to assess whether a combination of these therapies provides an even better outcome. In the simplest case, one may view this problem as a three armed clinical trial comparing treatments \(\varvec{A}\),\(\varvec{B}\) and \(\varvec{A+B}\) in which historical estimates on the efficacy of treatments \(\varvec{A}\) and \(\varvec{B}\) already exist. An important example of such situations is the Food and Drug. Administration (2006) guidelines for submitting applications for approval of fixed dose combinations, i.e., co-packaged drug products, of previously approved antiretrovirals for the treatment of HIV. In particular, Attachment A of the aforementioned document considers the scenario in which a non-innovator, i.e., a generic drug company, wants to obtain approval for a combination of already approved ingredients. In this case, only efficacy data for the combination needs to be submitted. We will revisit and thoroughly analyze two forms of this example later on. More broadly, the use of historical data in the contexts of clinical trials has been investigated by numerous researchers using multiple perspectives, cf., Pocock (1976), Peto et al. (1976), Neuenschwander et al. (2010), Viele et al. (2014), and Piantadosi (2017) among many others. As noted by a referee, a particularly relevant class of designs are platform trials which allow adding new treatments to the experiment and thus controls may become non-concurrent, cf. Lee and Wason (2020) and Roig et al. (2022).
The use of historical estimates is also widespread in the social sciences. For example, in the fitting of some econometric models researchers may use values estimated from previously collected survey data. We point to the paper of Newey et al. (2005) which focuses on the asymptotic bias of the estimated parameters. The complexity of using historical estimates in the social sciences is further illustrated by the work of Tasseva (2019). In a microsimulation study investigating the effect of the recent expansion in higher education in Great Britain on household inequalities, previously obtained estimates from the Family Resources Survey for Great Britain (GOV.UK 2019) were used. While sampling variability could be taken into account using bootstrap methods, as noted by the author, measurement error, inevitably present in income information collected in surveys, see, e.g., Moore et al. (2000), could not be accounted for using this method. Similarly, Douidich et al. (2016) describe an imputation-related-method for incorporating estimates obtained in labor force surveys (which are easily and cheaply conducted) into household expenditure surveys (which are much more time consuming and expensive) in order to estimate poverty rates in Morocco. Likewise, demographic model fitting and projections rely on historical data. The standard method of population projections (see United Nations 2014) is based on the combination of cohort survival rates, i.e., historical data, with current data on cohort sizes. Raftery et al. (2014) proposed a Bayesian approach to take the uncertainty associated with historical data into account. It is worth noting that in this case the uncertainty accounted for by Bayesian modeling did not come from observational errors but rather from the fact that the true population figures may have changed over time.
Researchers often do not adequately account for the variability of the historical estimates when incorporating them into a current analysis. In fact, the practice of plugging-in the estimated values for certain parameters is widespread. However, this practice is often not disclosed as many practitioners view this strategy as a natural way of “doing things”. Consequently, the objectives and contributions of this communication are twofold: first, we draw attention to current practice, and secondly, and more importantly, we provide a principled methodology for incorporating historical estimates into a current analysis. Surprisingly, despite the ubiquity of historical data and estimates and the many papers that touch on various aspects thereof, a general methodology discussing the use of historical estimates, as given here, has been thus far lacking. In particular we consider two broad settings in which historical estimates are employed. Three such estimators are presented in Sect. 3 and one in Sect. 4.1. Their limiting distributions are found and a theoretical analysis comparing their precision is conducted. When comparable, a preference order among the different approaches is established. Our paper goes beyond the existing knowledge by providing an inventory of ways in which historical estimates can be used, and by quantifying the properties of the resulting estimators. We also demonstrate how these results may be used in the design of experiments.
The paper is organized as follows. Our notation and formulation are outlined in Sect. 2. Section 3 provides our main theoretical findings which include the limiting distributions of the estimates in the presence of historical estimates and a comparison thereof. In Sect. 4 two applications are described in conjunction with accompanying numerical experiments. The first application addresses the two-way analysis of variance (ANOVA) problem introduced in Sect. 2. The second, related application, deals with a drug interaction study within the framework of Bliss-independence (Bliss 1939), an old concept which has garnered much recent attention. We conclude with a discussion in Sect. 5. All proofs are collected in an Appendix.
2 Notation and formulation
Consider a designed experiment or observational study, denoted by \(\mathcal {S}\), in which data \(\mathcal {D}\) consisting of n observations are collected. Usually, the observations are independent and identically distributed. Suppose further that the model describing the distribution of \(\mathcal {D}\) is indexed by \(\varvec{\omega }^T=(\varvec{\theta }^T,\varvec{\eta }^T)\) where \(\varvec{\theta }\in \mathbb {R}^{p}\) and \(\varvec{\eta }\in \mathbb {R}^{q}\) is the concatenation of \(\varvec{\eta }_{1}\in \mathbb {R}^{q_1},\ldots ,\varvec{\eta }_{K}\in \mathbb {R}^{q_K}\) with \(q=q_1+\cdots +q_K\). Let \(\varvec{\varPhi }(\varvec{\omega })\) be some function of the model parameters which is of interest to the researchers. Clearly, \(\varvec{\varPhi }(\varvec{\omega })\) may be a function of \(\varvec{\theta }\) alone, \(\varvec{\eta }\) alone or of both \(\varvec{\theta }\) and \(\varvec{\eta }\). The primary goal of the study \(\mathcal {S}\), which we refer to as the current study, is inference on \(\varvec{\varPhi }(\varvec{\omega })\) in the presence of historical data which we view as a collection of K, independent estimates \(\widehat{\varvec{\eta }}_{1},\ldots ,\widehat{\varvec{\eta }}_{K}\) obtained from historical studies \(\mathcal {S}_1,\ldots ,\mathcal {S}_K\) of sizes \(m_1,\ldots ,m_K\) and \(m= m_1+\cdots +m_K\) denotes the total sample size in the historical studies.
In some circumstances, it may not be possible to estimate \(\varvec{\omega }\) using the data \(\mathcal {D}\). However, if \(\varvec{\eta }\) were known in advance then it would be possible to estimate \(\varvec{\theta }\). As an example, such a situation would arise if the model \(f(\cdot ;\varvec{\theta },\varvec{\eta })\) is not identifiable whereas the model \(f(\cdot ;\varvec{\theta },\varvec{\eta }_0)\) is identifiable for every fixed value of \(\varvec{\eta }_0\). In other circumstances given the data \(\mathcal {D}\) both \(\varvec{\theta }\) and \(\varvec{\eta }\) are estimable (e.g., Peddada et al. 2007). Thus, in this communication we consider two distinct settings, the second of which has two variants. In the first setting, referred to as a Type I Problem, only the parameter \(\varvec{\theta }\) is estimable using the data \(\mathcal {D}\), while \((\varvec{\eta }_{1},\ldots ,\varvec{\eta }_{K})\) are fixed at their historical estimated values \((\widehat{\varvec{\eta }}_{1},\ldots ,\widehat{\varvec{\eta }}_{K})\). In the second setting, referred to as Type II Problem, both \(\varvec{\theta }\) and \(\varvec{\eta }\) are estimable using \(\mathcal {D}\) and a two-step procedure is utilized to estimate \(\varvec{\theta }\) while updating the estimates for \((\varvec{\eta }_{1},\ldots ,\varvec{\eta }_{K})\). It may also happen that the available data corresponds to a Type II problem and while a Type II analysis would be possible, the researcher may decide to conduct a Type I Analysis, i.e., estimate \(\varvec{\theta }\) as if the data came from a Type I Problem. One of our results shows that this is an inferior strategy, i.e., if the data \(\mathcal {D}\) identifies \(\varvec{\omega }\) it is always advisable to re-estimate \(\varvec{\eta }\) and the loss of precision is quantified in terms of a simple decomposition of the variance matrices of the resulting estimators. It is also important to emphasize that there are situations in which the investigator, by means of the design of the study \(\mathcal {S}\), may control whether the problem is of Type I or a Type II.
To fix ideas consider the two-way ANOVA model in which the expected value of an outcome Y is given by
where for \(i=1,2\), \(T_{i}\in \{0,1\}\) indicates whether treatment i is administered. Here \(\eta _{0}\) denotes the mean of Y when neither treatment is administered, \(\eta _{i}\) models the marginal increase in the expectation of Y when treatment i is administered and \(\theta \) models the interaction \(T_{1} \times T_{2}\). Suppose, now that the historical data consists of two studies \(\mathcal {S}_1\) and \(\mathcal {S}_2\) of sizes \(m_1\) and \(m_2\), respectively, where in the study \(\mathcal {S}_i\) treatment i was compared with a control. Clearly the historical data provides no information on \(\theta \). Thus inference on \(\theta \) would require a new study \(\mathcal {S}\) in which \(T_{1} = T_{2} = 1\) for some subset of the observations. For simplicity, interchangeability is assumed, i.e., all experimental units, in \(\mathcal {S}_1\) and \(\mathcal {S}_2\) as well as \(\mathcal {S}\), are assumed to be drawn from the same population, e.g., Peddada et al. (2007), and therefore any change in the mean response may be attributed solely to the treatment combination received. The assumption of interchangeability may be relaxed as discussed in Sect. 5.
One objective of this communication is to provide a methodology for effective design and analysis of a new study \(\mathcal {S}\) of size n which allows the estimation of \(\theta \) and utilizes the historical estimates of \((\eta _{0},\eta _{1},\eta _{2})\) obtained from \(\mathcal {S}_1\) and \(\mathcal {S}_2\). Depending on its objectives, the study \(\mathcal {S}\) may be of various forms. For example, one may choose to allocate all n observations to receive both treatments, i.e., \(T_1=T_2=1\). In this case, the data \(\mathcal {D}\) is an IID sample of observations with mean \(\eta _0+\eta _1+\eta _2+\theta \) and variance \(\sigma ^2\). Although the parameter \(\theta \) is not identifiable from \(\mathcal {D}\) alone it is estimable given the historical data, so this is clearly a Type I Problem. Alternatively, if \(\mathcal {S}\) allocates observation to all treatment combinations then \(\theta \) as well as \((\eta _0,\eta _1,\eta _2)\) are estimable from \(\mathcal {S}\) and this falls within the framework of a Type II Problem. This example will be further analyzed in Sect. 4.1.
3 Results
Our main theoretical findings, i.e., Theorems 3.1, 3.2, and 3.4 describe the limiting distributions of estimators for \(\varvec{\omega }\) which are then compared in Theorems 3.3, 3.5 and 3.6. Remark 7 provides a brief summary of the results of this Section.
3.1 Type I problems
Suppose first that we are in the setting of a Type I Problem. Recall that in such circumstances only \(\varvec{\theta }\) is estimated while \((\varvec{\eta }_1,\ldots ,\varvec{\eta }_K)\) are fixed at their historical values. Thus, let \(\bar{\varvec{\theta }}_{A}\) solve
where \(\widehat{\varvec{\eta }}=(\widehat{\varvec{\eta }}_{1}^{T},\ldots ,\widehat{\varvec{\eta }}_{K}^{T})^{T}\). The estimating Eq. (2) may be a score equation motivated by likelihood theory, a generalized estimating equation derived by quasi-likelihood or any other statistical estimation framework. Observe that the solution \(\bar{\varvec{\theta }}_{A}\) of (2) is obtained by plugging-in the sample values of the K independent estimators \(\widehat{\varvec{\eta }}_{1},\ldots ,\widehat{\varvec{\eta }}_{K}\). For simplicity we may further assume that the data \(\mathcal {D}\) is a random sample \(\varvec{Y}_{1},\ldots ,\varvec{Y}_{n}\) and (2) is of the form
The function \(\varvec{\psi }\) is assumed to be: (i) continuously differentiable with respect to both \(\varvec{\theta }\) and \(\varvec{\eta }_{1},\ldots ,\varvec{\eta }_{K}\); it is further assumed to satisfy (ii) \(\mathbb {E}_{0}(\varvec{\psi }) = \varvec{0}\); (iii) \(\mathbb {E}_{0}(\varvec{\psi }\varvec{\psi }^T) < \infty \); (iv) the matrix \(\mathbb {E}_{0}(\partial \varvec{\psi }/\partial \varvec{\eta })\) exists; and (v) the matrix \(\mathbb {E}_{0}(\partial \varvec{\psi }/\partial \varvec{\theta })\) exists and is invertible. Here \(\mathbb {E}_{0}(\cdot )\) denotes the expectation taken at \(\varvec{\omega }_{0}=(\varvec{\theta }_0,\varvec{\eta }_{0})=(\varvec{\theta }_0,\varvec{\eta }_{1,0},\ldots ,\varvec{\eta }_{K,0})\), the true value of all parameters. Conditions \((i)-(v)\) are all standard regularity conditions often imposed in the literature (cf., Heyde 2008, Van der Vaart 2000). We now have the following:
Theorem 3.1
Let \(\bar{\varvec{\theta }}_{A}\) be a solution to (2) and set \(\bar{\varvec{\eta }}_{A}=\widehat{\varvec{\eta }}\). Assume that: (i) \(\bar{\varvec{\theta }}_{A}\) is consistent at \(\varvec{\omega }_{0}\); (ii) the estimating function \(\varvec{\psi }\) satisfies the regularity conditions listed above; and (iii) the historical estimates satisfy \(\sqrt{m_j}(\widehat{\varvec{\eta }}_{j}-\varvec{\eta }_{j,0}) \Rightarrow \mathcal {N}_{q_{j}}(\varvec{0},\varvec{\varSigma }_{j})\) and are independent of each other and of the current study. Then if \((m/m_j)\rightarrow \kappa _j < \infty \) for all \(j=1,\ldots ,K\) as \(m_j \rightarrow \infty \) and \(n/m \rightarrow \rho \in (0,\infty )\) as \(n \rightarrow \infty \) we have
where
with
where \(\varvec{D}_{{\varvec{\theta }_0}}=\mathbb {E}_{0}(\partial \varvec{\psi }/\partial {\varvec{\theta })}\), \(\varvec{D}_{{\varvec{\eta }_0}}=\mathbb {E}_{0}(\partial \varvec{\psi }/\partial {\varvec{\eta })}\), \(\varvec{\varSigma }_{\varvec{\psi }}=\mathbb {E}_{0}(\varvec{\psi }\varvec{\psi }^T)\) and \(\varvec{\varSigma } = \text {BlockDiag}(\kappa _{1}\varvec{\varSigma }_{1},\ldots ,\kappa _{K}\varvec{\varSigma }_{K})\).
Remark 1
Clearly, \(\varvec{A}_{\varvec{\theta \theta }}\) is the \(p\times p\) asymptotic variance matrix of \(\bar{\varvec{\theta }}_{A}\), \(\varvec{A}_{\varvec{\eta \eta }}\) is the \(q\times q\) asymptotic variance matrix of \(\bar{\varvec{\eta }}_{A}\) and \(\varvec{A}_{\varvec{\theta \eta }} = \varvec{A}_{\varvec{\eta \theta }}^T\) is their \(p\times q\) asymptotic covariance matrix.
Remark 2
As pointed out by a referee the application of Theorem 3.1 is predicated on the fact that the historical studies report the estimates of the variance matrices \(\varvec{\varSigma }_{1},\ldots ,\varvec{\varSigma }_{K}\).
The proof of Theorem 3.1 is a straightforward, but somewhat involved, application of the delta method. In contrast with Randles (1982) and Pierce (1982) which describe the limiting distribution of statistics that are explicit functions of estimated parameters, the estimator \(\bar{\varvec{\theta }}_{A}\) is an implicit function of \(\bar{\varvec{\eta }}_{A}\). For a related but less general result see Benichou and Gail (1989). Further note that \((\varvec{D}^{-1}_{{\varvec{\theta }_0}})\varvec{\varSigma }_{\varvec{\psi }}(\varvec{D}^{-1}_{{\varvec{\theta }_0}})^{T}\) is the asymptotic variance of \(\bar{\varvec{\theta }}_{A}\) when the true values of \(\varvec{\eta }_{1},\ldots ,\varvec{\eta }_{K}\) are known in advance. Thus the term
may be viewed as the penalty for substituting estimates for the true values of the parameters. The penalty may also be rewritten as \(\rho \sum _{j=1}^{K} \kappa _{j} \varvec{D}_{j} \varvec{\varSigma }_{j} \varvec{D}^{T}_{j}\) where \(\varvec{D}_{j} = \mathbb {E}_{0}(\partial \varvec{\psi }/\partial {\varvec{\eta }_{j})}\) which expresses its dependence on the relative sample sizes, the asymptotic variances of the historical estimators and the sensitivity of the estimation procedure with respect to the historical estimates, embodied in the matrices \(\varvec{D}_1,\ldots ,\varvec{D}_K\).
Remark 3
Note that if \(\rho \) is very small which occurs when \(m \gg n\), then the penalty is inconsequential, i.e., the asymptotic variance of \(\bar{\varvec{\theta }}_{A}\) is close to its variance when \(\varvec{\eta }_{1},\ldots ,\varvec{\eta }_{K}\) are fully known.
3.2 Type II problems
Next, consider the case where both \(\varvec{\theta }\) and \(\varvec{\eta }\) are estimable using the data \(\mathcal {D}\) observed in the current study \(\mathcal {S}\). Further assume that two estimating functions \(\varvec{\varPsi }\) and \(\varvec{\varGamma }\) are available to us; \(\varvec{\varPsi }\) is an estimating function for \(\varvec{\theta }\) given a known fixed value of \(\varvec{\eta }\), as in Type I Problems, whereas \(\varvec{\varGamma }\) is an estimating function for \(\varvec{\eta }\) given a known fixed value of \(\varvec{\theta }\). For example, within the likelihood framework \(\varvec{\varPsi }\) is the score with respect to \(\varvec{\theta }\) while \(\varvec{\varGamma }\) is the score with respect to \(\varvec{\eta }\).
We propose estimating \(\varvec{\omega }\) using a two step procedure. In the first step the data \(\mathcal {D}\) is used to obtain the pair \((\tilde{\varvec{\theta }},\tilde{\varvec{\eta }})^T\) which simultaneously solve
Under standard regularity conditions, cf., the conditions listed just before the statement of Theorem 3.1, the estimators \((\tilde{\varvec{\theta }},\tilde{\varvec{\eta }})^T\) satisfy
where \(\varvec{\varUpsilon }\) is assumed to be a non-singular variance matrix which can be consistently estimated from the data by, say \(\tilde{\varvec{\varUpsilon }}\), the standard sandwich estimator (Van der Vaart 2000). For convenience, we may partition \(\varvec{\varUpsilon }\) as
where \(\varvec{\varUpsilon }_{\varvec{\theta }\varvec{\theta }}\) and \(\varvec{\varUpsilon }_{\varvec{\eta }\varvec{\eta }}\) denote the marginal asymptotic variances of \(\tilde{\varvec{\theta }}\) and \(\tilde{\varvec{\eta }}\), respectively, and \(\varvec{\varUpsilon }_{\varvec{\theta }\varvec{\eta }}\) is their asymptotic covariance. Naturally, a similar partition holds for \(\tilde{\varvec{\varUpsilon }}\). Furthermore, as in Sect. 3.1, at our disposal are K independent historical estimates of \(\varvec{\eta }_{1},\ldots ,\varvec{\eta }_{K}\) obtained using studies of sizes \(m_1,\ldots ,m_K\) which satisfy \(\sqrt{m_{j}}(\widehat{\varvec{\eta }}_{j}-\varvec{\eta }_{j,0})\Rightarrow \mathcal {N}_{q_{j}}(\varvec{0},\varvec{\varSigma }_{j})\), where, again, it is assumed that \(\varvec{\varSigma }_{j}\) are non-singular and can be consistently estimated for all \(j=1,\ldots ,K\). Thus
where \(\varvec{\varSigma }\) is given in the statement of Theorem 3.1. Let \(\widehat{\varvec{\varSigma }}\) be a consistent estimator of \(\varvec{\varSigma }\).
The historic and current estimates of \(\varvec{\eta }\) can be aggregated, or combined, in many ways. Lemma 2, appearing in the Appendix, suggests using the estimator
which is the MLE under normality assuming that the matrices \({\varvec{\varUpsilon }}_{\varvec{\eta \eta }}\) and \(\varvec{\varSigma }\) are known. Note that
where the weights \(\varvec{W}_1\) and \(\varvec{W}_2\) are the symmetric matrices
which satisfy \(\varvec{I} =\varvec{W}_1+\varvec{W}_2\) with \(\gamma = \lim (n/(n+m))\). Thus (7) differs from the best linear unbiased estimator by at most an \(o_{p}(1)\) term.
In the second step we find \(\bar{\varvec{\theta }}_{B}\) by solving
where \(\bar{\varvec{\eta }}\) is given by (7). We now have:
Theorem 3.2
Let \(\bar{\varvec{\theta }}_{B}\) be a solution to (9) where \(\bar{\varvec{\eta }}_{B}=\bar{\varvec{\eta }}\) is given in (7). Assume that the regularity conditions of Theorem 3.1 hold. Then
where
with
Although the mechanics are slightly more involved, the proof of Theorem 3.2 builds on the proof of Theorem 3.1. Moreover, the structures of the asymptotic variance matrices \(\varvec{A}\) and \(\varvec{B}\) are analogous with the exception that the variance matrix \(\rho \varvec{\varSigma }\) appearing in \(\varvec{A}\) is replaced with \((\varvec{\varUpsilon }_{\varvec{\eta \eta }}^{-1}+(\rho \varvec{\varSigma })^{-1})^{-1}\) in \(\varvec{B}\).
Remark 4
Observe that \((\varvec{\varUpsilon }_{\varvec{\eta \eta }}^{-1}+(\rho \varvec{\varSigma })^{-1})^{-1} \rightarrow \varvec{0}\) as \(\rho \rightarrow 0\) so the conclusions of Remark 3 hold here as well.
It is clear that whenever the model for \(\mathcal {D}\) identifies \(\varvec{\omega }\), both \((\bar{\varvec{\theta }}_{B},\bar{\varvec{\eta }}_{B})\) and \((\bar{\varvec{\theta }}_{A},\bar{\varvec{\eta }}_{A})\) can be computed. Next, using the concept of the Loewner order we show the former is superior to the latter. Recall that the matrix \(\varvec{V}_1\) is said to be smaller in the Loewner order compared with the matrix \(\varvec{V}_2\) if \(\varvec{V}_2-\varvec{V}_1\) is non-negative definite (Pukelsheim 2006). This relationship is denoted by \(\varvec{V}_1\preceq \varvec{V}_2\). Suppose now that \(\varvec{V}_1\) and \(\varvec{V}_2\) are the variances of two (asymptotically) unbiased estimators. Then \(\varvec{V}_1\preceq \varvec{V}_2\) implies that the estimator associated with \(\varvec{V}_1\) is more efficient than the estimator associated with \(\varvec{V}_2\). This means, for example, that the confidence ellipsoid associated with \(\varvec{V}_1\) lies within the confidence ellipsoid associated with \(\varvec{V}_2\).
Theorem 3.3
Whenever the data \(\mathcal {D}\) identifies \(\varvec{\omega }\) we have
Moreover, for any function \(\varvec{\varPhi }\) we have \(\varvec{V}_{\varvec{B}}^{\varvec{\varPhi }} \preceq \varvec{V}_{\varvec{A}}^{\varvec{\varPhi }}\) where \(\varvec{V}_{\varvec{A}}^{\varvec{\varPhi }}\) and \(\varvec{V}_{\varvec{B}}^{\varvec{\varPhi }}\) are the asymptotic variances of \(\varvec{\varPhi }(\bar{\varvec{\theta }}_{A},\bar{\varvec{\eta }}_{A})\) and \(\varvec{\varPhi }(\bar{\varvec{\theta }}_{B},\bar{\varvec{\eta }}_{B})\) respectively.
Theorem 3.3 indicates that, if possible, it is always asymptotically beneficial to estimate both \(\varvec{\theta }\) and \(\varvec{\eta }\) using the data \(\mathcal {D}\) collected in the study \(\mathcal {S}\). Moreover, Theorem 3.3 holds also when only a sub-vector of \(\varvec{\eta }\) is identified by the data \(\mathcal {D}\).
Remark 5
As noted by a referee, an alternative approach to Type II Problems would be to combine the first and second estimation steps. This can be done by simultaneously solving the estimating equations
The system (12) is obtained from (3) by augmenting the estimating function \(\varvec{\varGamma }\) with the term \(m\widehat{\varvec{\varSigma }}^{-1}(\widehat{\varvec{\eta }}-\varvec{\eta })\). The latter is a pseudo-score equation which follows directly from (6). By appropriately modifying the proof of Theorem 3.2 it can be shown that \((\bar{\varvec{\theta }}_{B},\bar{\varvec{\eta }}_{B})\) have the same limiting distribution as the solution of (12). It thus follows that the estimator (7) is asymptotically efficient up to the first order.
Another variant of Type II Problems occurs when the data \(\mathcal {D}\) is not available, but nevertheless the estimates \((\tilde{\varvec{\theta }},\tilde{\varvec{\eta }})\) from the current study as well as their estimated variance, i.e., \(\tilde{\varvec{\varUpsilon }}\) is given. The objective is then to combine the current estimators (4) with the historical estimators (6). To this end we propose estimating \(\varvec{\theta }\) by
where \(\bar{\varvec{\eta }}_{C}=\bar{\varvec{\eta }}\) is given by (7). The estimators (7) as well as (13) are motivated by Lemma 2 and Remark 8 appearing in the Appendix.
Theorem 3.4
Let \((\bar{\varvec{\theta }}_{C},\bar{\varvec{\eta }}_{C})^T\) be defined by (7) and (13). Suppose further that (4) and (6) hold and both \(\varvec{\varUpsilon }\) and \(\varvec{\varSigma }\) can be consistently estimated. Then as \(n \rightarrow \infty \) we have
where \(\varvec{C}=\varvec{M}\varvec{V}\varvec{M}^T\) with
The matrices \(\varvec{W}_1\) and \(\varvec{W}_2\) are defined in (8) and \(\varvec{R}=\varvec{\varUpsilon }_{\varvec{\theta }\varvec{\eta }}\varvec{\varUpsilon }_{\varvec{\eta }\varvec{\eta }}^{-1}\). Moreover, we have:
Theorem 3.4 describes the large sample behavior of the estimators (7) and (13). Further insight is facilitated by considering the simplest possible situation, i.e., when \((\theta ,\eta )\in \mathbb {R}^2\), in which case \(\sqrt{m}(\widehat{\eta }-\eta _{0})\Rightarrow \mathcal {N}(0,\sigma ^2)\) for the historical data, whereas for the current study \(\sqrt{n}(\tilde{\theta }-\theta _{0},\tilde{\eta }-\eta _{0})^{T} \Rightarrow \mathcal {N}(0,\varvec{\varUpsilon })\) where
It is not hard to see that (13) reduces to \(\bar{\theta }=\tilde{\theta }-(\tilde{\upsilon }_{\theta \eta }/\tilde{\upsilon }_{\eta \eta }^2)(\tilde{\eta }-\bar{\eta })\) where \(\bar{\eta } = w_{1}^{*}\tilde{\eta }+w_{2}^{*}\widehat{\eta }\) with
Furthermore \(\varvec{C}_{\theta \theta }\) simplifies to
where \(r=\upsilon _{\theta \eta }/(\upsilon _{\theta \theta }\upsilon _{\eta \eta })\) is the asymptotic correlation between \(\tilde{\theta }\) and \(\tilde{\eta }\) and
is the limiting value of \(w_{2}^{*}\) as \(n/(n+m) \rightarrow \gamma \). It follows that the asymptotic relative efficiency of \(\bar{\theta }\) to \(\tilde{\theta }\) is \(1-w_2r^2\), which is at most unity (when \(\upsilon _{\theta \eta }=0\)) and no less than \(1-r^2\) (when \(\gamma \) is close to 0). Clearly, the historical estimates are useful only if the covariance \(\upsilon _{\theta \eta }\) is non-zero and highly useful whenever \(w_2\) is close to unity. A similar but more involved analysis applies when the parameters are multidimensional.
We emphasize that the structure of the estimators \(\bar{\varvec{\eta }}_{C}\) and \(\bar{\varvec{\theta }}_{C}\) as well as the form of \(\varvec{C}\) are related to, but much more general, than results obtained in the literature on both double sampling and monotone missing normal data (Anderson 1957; Morrison 1971; Kanda and Fujikoshi 1998). Double sampling is a widely used technique in survey sampling, where the estimator is also known as the generalized regression estimator (Thompson 1997), as well as in other applications, cf. Davidov and Haitovsky (2000), Chen and Chen (2000) and the references therein. We also note that Eq. (15) is a generalization of the formulas obtained for the usual double sampling estimator (e.g., Tamhane 1978) where \(w_2 = m/(n+m)\). The following Theorem substantially generalizes on results obtained in the literature on both the double sampling and monotone missing data.
Theorem 3.5
We have
Moreover, for any function \(\varvec{\varPhi }\) we have \(\varvec{V}_{\varvec{C}}^{\varvec{\varPhi }} \preceq \varvec{V}_{\varvec{\varUpsilon }}^{\varvec{\varPhi }}\) where \(\varvec{V}_{\varvec{C}}^{\varvec{\varPhi }}\) and \(\varvec{V}_{\varvec{\varUpsilon }}^{\varvec{\varPhi }}\) are the asymptotic variances of \(\varPhi (\bar{\varvec{\theta }}_{C},\bar{\varvec{\eta }}_{C})\) and \(\varPhi (\tilde{\varvec{\theta }},\tilde{\varvec{\eta }})\) respectively.
In words, the estimator \((\bar{\varvec{\theta }}_{C},\bar{\varvec{\eta }}_{C})\), incorporating the historical estimates and derived by combining \((\tilde{\varvec{\theta }},\tilde{\varvec{\eta }})\) and \(\widehat{\varvec{\eta }}\), is more precise than \((\tilde{\varvec{\theta }},\tilde{\varvec{\eta }})\), the estimator based only on the current study.
Remark 6
It is also important to emphasize that in finite, typically small samples, the estimator \(\varvec{C}_{\varvec{\theta \theta }}\) may be in fact inferior to \(\varvec{\varUpsilon }_{\varvec{\theta \theta }}\). This typically occurs when the “regression matrix” \(\varvec{R}\), see the statement of Theorem 3.4, is poorly estimated. This feature has been also recognized in the double sampling literature (Tamhane 1978).
A little algebra shows that
so we can remove the dependence of \(\varvec{C}\) on the matrices \(\varvec{W}_1\) and \(\varvec{W}_2\).
Clearly, whenever the data \(\mathcal {D}\) is available both \((\bar{\varvec{\theta }}_{B},\bar{\varvec{\eta }}_{B})\) and \((\bar{\varvec{\theta }}_{C},\bar{\varvec{\eta }}_{C})\) can be calculated where \(\bar{\varvec{\eta }}_{B}=\bar{\varvec{\eta }}_{C}\) are given in (7). Recall that \(\bar{\varvec{\theta }}_{B}\) solves \(\varvec{\varPsi }(\varvec{\theta },\bar{\varvec{\eta }})=\varvec{0}\) where \(\varvec{\varPsi }(\varvec{\theta },\varvec{\eta })=n^{-1}\sum _{i=1}^{n} \varvec{\psi }(\varvec{\theta },\varvec{\eta },\varvec{Y}_i)\). Similarly, we can view \(\bar{\varvec{\theta }}_{C}\) as a solution to an estimating equation \(\varvec{\varLambda }(\varvec{\theta },\bar{\varvec{\eta }})=\varvec{0}\) where \(\varvec{\varLambda }(\varvec{\theta },\varvec{\eta })=n^{-1}\sum _{i=1}^{n} \varvec{\lambda }(\varvec{\theta },\varvec{\eta },\varvec{Y}_i)\). The form of \(\varvec{\varLambda }\) can be easily deduced from Lemma 2 and that of \(\varvec{\lambda }\) by plugging in the influence functions for \(\tilde{\varvec{\theta }}\) and \(\tilde{\varvec{\eta }}\) into \(\varvec{\varLambda }\). In fact, the precise form of the influence function of \(\bar{\varvec{\theta }}_{C}\) is readily derived, for more details see Remark 9 appearing in the Appendix. It is worth noting that \(\varvec{\varPsi }\) operates on the full data \(\mathcal {D}\) whereas \(\varvec{\varLambda }\) operates on functions thereof namely the estimators \((\tilde{\varvec{\theta }},\tilde{\varvec{\eta }})\) and \(\widehat{\varvec{\eta }}\). Thus \((\bar{\varvec{\theta }}_{C},\bar{\varvec{\eta }}_{C})\) can be viewed as functions of a coarsening of the data \(\mathcal {D}\) and therefore is expected to be less efficient than \((\bar{\varvec{\theta }}_{B},\bar{\varvec{\eta }}_{B})\). This indeed is the case under mild regularity conditions. A formal statement requires the introduction of some additional notation. Let \(\varvec{h}=\varvec{h}(\varvec{\theta },\varvec{\eta },\varvec{Y})\) denote any estimating function and denote \(\varvec{D}_{\varvec{\theta }_{0}}(\varvec{h}) = \mathbb {E}_{0}(\partial \varvec{h}/\partial {\varvec{\theta })}\) and \(\varvec{D}_{\varvec{\eta }_{0}}(\varvec{h}) = \mathbb {E}_{0}(\partial \varvec{h}/\partial {\varvec{\theta })}\). Note that earlier we referred to \(\varvec{D}_{\varvec{\theta }_{0}}(\varvec{\psi })\) and \(\varvec{D}_{\varvec{\eta }_{0}}(\varvec{\psi })\) simply as \(\varvec{D}_{\varvec{\theta }_{0}}\) and \(\varvec{D}_{\varvec{\eta }_{0}}\). Now:
Theorem 3.6
Suppose that both \(\bar{\varvec{\omega }}_{B}\) and \(\bar{\varvec{\omega }}_{C}\) can be obtained. If
component-wise and
in the Loewner order, then
Moreover, for any function \(\varvec{\varPhi }\) we have \(\varvec{V}_{\varvec{B}}^{\varvec{\varPhi }} \preceq \varvec{V}_{\varvec{C}}^{\varvec{\varPhi }}\) where \(\varvec{V}_{\varvec{B}}^{\varvec{\varPhi }}\) and \(\varvec{V}_{\varvec{C}}^{\varvec{\varPhi }}\) are the asymptotic variances of \(\varvec{\varPhi }(\bar{\varvec{\theta }}_{B},\bar{\varvec{\eta }}_{B})\) and \(\varvec{\varPhi }(\bar{\varvec{\theta }}_{C},\bar{\varvec{\eta }}_{C})\) respectively.
Condition (18) holds when the estimating equation \(\varvec{\varPsi }(\varvec{\theta },\varvec{\eta }_0)=\varvec{0}\) results in more efficient estimators for \(\varvec{\theta }\) than those resulting from \(\varvec{\varLambda }(\varvec{\theta },\varvec{\eta }_0)=\varvec{0}\) when \(\varvec{\eta }=\varvec{\eta }_{0}\) is set to its true value. This condition holds for any sensible choice of \(\varvec{\varPsi }\). In particular it holds for the score equations associated with maximum likelihood estimation. Condition (17) roughly means that \(\varvec{\varPsi }\) is less sensitive to small perturbations in both \(\varvec{\theta }\) and \(\varvec{\eta }\) compared with \(\varvec{\varLambda }\). Conditions (17) are (18) are not necessary. For example, the conclusion of Theorem 3.6 may hold if \(\varvec{\psi }\) is more sensitive to small perturbations but at the same time much more efficient. We believe that the aforementioned conditions hold broadly and the estimators \((\bar{\varvec{\theta }}_{C},\bar{\varvec{\eta }}_{C})\), described in Theorem 3.4, are generally less efficient than \((\bar{\varvec{\theta }}_{B},\bar{\varvec{\eta }}_{B})\), described in Theorem 3.2. For an additional discussion see Remark 9 in the Appendix. There are, however, situations in which \(\varvec{B}=\varvec{C}\) and situations where \(\bar{\varvec{\omega }}_{B} =\bar{\varvec{\omega }}_{C}\) for any data \(\mathcal {D}\). As we shall see in the next section this is the case in normal linear models in which the estimators \((\tilde{\varvec{\theta }},\tilde{\varvec{\eta }})\) and \(\widehat{\varvec{\eta }}\) are actually sufficient statistics. Finally, it is worth noting that if \(\varvec{\varUpsilon }_{\varvec{\theta \eta }} = \varvec{0}\) then the estimator \((\bar{\varvec{\theta }}_{C},\bar{\varvec{\eta }}_{C})\) does not improve \(\tilde{\varvec{\theta }}\) whereas there is always an improvement when the full data \(\mathcal {D}\) is available.
Remark 7
To summarize, the proposed estimators are designed to extract as much information as possible from the data. Recall that in Type I Problems the parameter \(\varvec{\eta }\) is not estimable given the current data \(\mathcal {D}\). The pair \((\bar{\varvec{\theta }}_{A},\bar{\varvec{\eta }}_{A})\) solves the system of equations \(\varvec{\varPsi }(\bar{\varvec{\theta }}_{A},\bar{\varvec{\eta }}_{A})=\varvec{0}\) with \(\bar{\varvec{\eta }}_{A}=\widehat{\varvec{\eta }}\) where \(\widehat{\varvec{\eta }}\) is the historical estimator. In Type II problems the pair \((\bar{\varvec{\theta }}_{B},\bar{\varvec{\eta }}_{B})\) solves the system of equations \(\varvec{\varPsi }(\bar{\varvec{\theta }}_{B},\bar{\varvec{\eta }}_{B})=0\) where \(\bar{\varvec{\eta }}_{B}\), given by (7), is a weighted combination of \(\widehat{\varvec{\eta }}\) and \(\tilde{\varvec{\eta }}\). Moreover, as noted earlier it can be shown that the resulting estimators are asymptotically efficient. Our approach to Type I and the first variant of Type II problems are of similar structure: plug into \(\varvec{\varPsi }\) the best available estimator of \(\varvec{\eta }\). In the second variant of Type II Problems the pair \((\bar{\varvec{\theta }}_{C},\bar{\varvec{\eta }}_{C})\) solves a different system of estimating equations which we denote by \(\varvec{\varLambda }(\bar{\varvec{\theta }}_{C},\bar{\varvec{\eta }}_{C})=\varvec{0}\). These equations are the likelihood equations given in Lemma 2 assuming that the variance matrices \(\varvec{\varSigma }\) and \(\varvec{\varUpsilon }\) are fully known. Since \((\bar{\varvec{\theta }}_{C},\bar{\varvec{\eta }}_{C})\) are based on further coarsening of the data they are generally less efficient.
4 Illustrations, applications and numerical results
In this section two applications are discussed in detail. In Sect. 4.1 the two-way ANOVA problem introduced in Sect. 2 is investigated. In particular, various design options for the current study \(\mathcal {S}\) are evaluated. It is worth noting that although the abovementioned ANOVA problem is among the simplest possible, its analysis is far from trivial. Next, in Sect. 4.2 we discuss the use of historical estimates in the design of drug interaction studies in the context of Bliss independence. A simple algorithm for the design of such studies is proposed.
4.1 Two way ANOVA
Recall the ANOVA model of Sect. 2 where the studies \(\mathcal {S}_1\) and \(\mathcal {S}_2\) were designed to estimate \(\varvec{\eta }_{1}=(\eta _0,\eta _1)^T\) and \(\varvec{\eta }_{2}=(\eta _0,\eta _2)^T\), respectively. Note that the parameter \(\eta _0\) is estimated in both studies so \(\varvec{\eta }_{1}\) and \(\varvec{\eta }_{2}\) are not distinct. Therefore employing any of the aforementioned findings requires the aggregation of the historical estimates as if they came from a single experiment. The historical studies result in the estimates \((\widehat{\eta }_{0}(\mathcal {S}_{1}),\widehat{\eta }_{1}(\mathcal {S}_{1}))\) and \((\widehat{\eta }_{0}(\mathcal {S}_{2}),\widehat{\eta }_{2}(\mathcal {S}_{2}))\) as well as their standard errors. Given these we can easily back calculate the unobserved means and sample sizes in the studies \(\mathcal {S}_1\) and \(\mathcal {S}_2\) and estimate \((\eta _0,\eta _1,\eta _2)\) by:
where the quantity \(\bar{Y}_j(\mathcal {S}_i)\) is the average response on treatment \(j\in \{0,1,2\}\) in study \(i\in \{1,2\}\) and \(m_{i,j}\) is the size of of treatment group j in study i. Under the usual conditions
for some matrix \(\varvec{\varSigma }\). Furthermore, if (1) is a homoscedastic model with variance \(\sigma ^2\) and \(m_{1,0}=m_{1,1}=m_{2,0}=m_{2,2}\), i.e., the studies \(\mathcal {S}_{1}\) and \(\mathcal {S}_{2}\) are balanced and of the same size, then it is easy to see that
We will now investigate various designs for a new study \(\mathcal {S}\). When the primary focus of \(\mathcal {S}\) is inference on \(\theta \) then it may be advantageous in some circumstances to allocate all n observations to the treatment arm receiving both treatments one and two, i.e., \(T_{1}=T_{2}=1\) for all observations. This is clearly a Type I problem since \(\varvec{\omega }\) is not identifiable from \(\mathcal {D}\) but given \(\varvec{\eta }\) the parameter \(\theta \) is estimable. Note that an unbiased estimator for \(\theta \) is
and it is not hard to see that (21) solves (2) when \(\psi (\theta ,\eta _0,\eta _1,\eta _2,Y_i)=Y_i-\eta _0-\eta _1-\eta _2-\theta \). Thus, \(\varvec{\varSigma }_{\psi } = \sigma ^2\), \(\varvec{D}_{\theta _{0}} = 1\) and \(\varvec{D}_{\eta _{0}}= -(1,1,1)\) and it follows that \(\varvec{A}_{\theta \theta }\), the asymptotic variance of (21) as described in Theorem 3.1, reduces to
The second term appearing in the parentheses in the above display is an inflation factor, i.e., the price to pay for substituting estimates for the unknown value of \((\eta _0,\eta _1,\eta _2)\). Note that when \(n/m \rightarrow 0\) as both \(m\rightarrow \infty \) and \(n\rightarrow \infty \) the asymptotic variance of \(\bar{\theta }_{A}\) approaches \(\sigma ^2\). In practice this requires a large current study and even larger historical data. Incidentally, since \(\bar{\theta }_{A}\) is a linear function of \(\bar{Y}_{12}(\mathcal {S})\) and \((\widehat{\eta }_{0},\widehat{\eta }_{1},\widehat{\eta }_{2})\) it is not hard to see that its exact variance is \(\sigma ^2(1/n+10/m)\) which coincides with the asymptotic form.
Alternatively, suppose that the study \(\mathcal {S}\) allocates n/4 observations to all treatment combinations. In this case the data \(\mathcal {D}\) identifies \(\varvec{\omega } = (\theta ,\eta _0,\eta _1,\eta _2)^T\), so this is a Type II problem. The usual estimators for this design are \(\tilde{\eta _{0}} =\bar{Y}_{0}(\mathcal {S}), \tilde{\eta _{1}} =\bar{Y}_{1}(\mathcal {S})-\bar{Y}_{0}(\mathcal {S}) , \tilde{\eta _{2}}=\bar{Y}_{2}(\mathcal {S})-\bar{Y}_{0}(\mathcal {S})\) and
and thus the limiting variance of \((\tilde{\theta },\tilde{\varvec{\eta }})^{T}\) is
Next we aggregate the historical and current estimates for \(\varvec{\eta }\). As in Sect. 3 we estimate \(\varvec{\eta }\) by \(\bar{\varvec{\eta }} = \varvec{W_1}\tilde{\varvec{\eta }}+\varvec{W_2} \widehat{\varvec{\eta }}\) where
Note that the weight matrices are functions of the variances \(\varvec{\varUpsilon }_{\varvec{\eta \eta }}\) and \(\varvec{\varSigma }\) as well as the ratio \(n/(n+m)\). Since \(\mathcal {D}\) is fully available to us then we can estimate \(\theta \) by
Note that the estimators (21) and (22) are of the same functional form. Further note that the statistic \(\bar{Y}_{12}(\mathcal {S})\) in (22) is a function of the \(n_{12}\) observations \(Y_1,\ldots ,Y_{n_{12}}\) receiving the treatment combination \(T_1=T_2=1\). A straightforward calculation shows that \(\varvec{B}_{\theta \theta }\) is given by
where \(\xi _{11}\) is the fraction of the observations which are assigned to receive both treatments. In situations where the full data is not available to us but \((\tilde{\theta },\tilde{\eta }_0,\tilde{\eta }_1,\tilde{\eta }_2)\) are known we may estimate \(\theta \) by \( \bar{\theta }_{C} =\tilde{\theta }-\varvec{\varUpsilon }_{\theta \varvec{\eta }}\varvec{\varUpsilon }_{\varvec{\eta }\varvec{\eta }}^{-1}(\tilde{\varvec{\eta }}-\bar{\varvec{\eta }})\). It can be verified that in this application, in which a normal linear model is involved and all estimators are functions of sufficient statistics, the estimators \(\bar{\theta }_{B}\) and \(\bar{\theta }_{C}\) coincide. Therefore \(\bar{\theta }_{C}\) is not discussed any further.
Table 1 provides a comparison of the asymptotic variances of (21) and (22) for a range of values of m and n. Table 1 displays asymptotic variances; the variances themselves are found by dividing any entry in the table by the size of the current study in the relevant row. Observe that both \(\varvec{A}_{\theta \theta }\) and \(\varvec{B}_{\theta \theta }\) decrease as a function of m for any fixed value of n and increase in n for any fixed m. For example when \(n=m=100\) then \(\varvec{A}_{\theta \theta }=11\) and \(\varvec{B}_{\theta \theta }=9.3\) whereas when \(m=100\) and \(n=5000\) then \(\varvec{A}_{\theta \theta }=501\) and \(\varvec{B}_{\theta \theta }=15.69\) and when \(m=5000\) and \(n=100\) then \(\varvec{A}_{\theta \theta }=1.2\) and \(\varvec{B}_{\theta \theta }=4.2\). Thus going down the first column of Table 1 the asymptotic variance \(\varvec{A}_{\theta \theta }\) increases by a factor of approximately 45 whereas that of \(\varvec{B}_{\theta \theta }\) by the much more modest 1.4. Similarly going across the first row the asymptotic variances of \(\varvec{A}_{\theta \theta }\) and \(\varvec{B}_{\theta \theta }\) are reduced by a factor of 9.2 and 2.2 respectively. Each pair (n, m) provides a direct comparison between the two proposed designs (design A, say, in which all experimental units in the current study receive both treatments and design B, say, which is a balanced design). Clearly, design A seems preferable in situations where m is much larger that n, otherwise design B is to be preferred.
We now look a bit deeper into the question of optimal design. Suppose for simplicity that the historical sample satisfies \(m_{1,0}=m_{1,1}=m_{2,0}=m_{2,2}\). Note that Table 1 considers only designs with \(\varvec{\xi }=(0,0,0,1)\) and \(\varvec{\xi }=(1/4,1/4,1/4,1/4)\). Therefore we next consider designs for the study \(\mathcal {S}\) where \(\varvec{\xi }=(\xi _{00},\xi _{10},\xi _{01},\xi _{11})\) is any value in the unit simplex. Clearly here \(\xi _{ij}\) denotes the proportion of observations who received treatment combination \(i \times j\) where i and j are in \(\{0,1\}\). It is not hard to see that the optimal design in the interior of the simplex, i.e. for a Type II Problem, is attained by Theorem 3.2 when \(\xi _{11}^{-1}+\varvec{1}^{T}(\varvec{\varUpsilon }_{\varvec{\eta \eta }}^{-1}+(\rho \varvec{\varSigma })^{-1})^{-1}\varvec{1}\) is minimized where
Symmetry considerations imply that under optimality \(\xi _{01}=\xi _{10}\) and since \(\xi _{00}=1-2\xi _{10}-\xi _{11}\) the minimization involves only a two dimensional search. Table 2 provides the optimal design, i.e., the vector \(\varvec{\xi }\) for estimating \(\theta \) for various values of the ratio \(\rho = n/m\) found by a grid search with step size 0.001 and the restriction that \(\xi _{00} \ge 0.02\). This restriction is necessary; otherwise the matrix \(\varvec{\varUpsilon }\) can not be inverted.
Table 2 provides the optimal design for estimating \(\theta \) as a function of the sampling ratio. Note that for large \(\rho \), i.e., when n is larger than m, we find that \(\xi _{01}=\xi _{10}=1/4\) and that the difference between \(\xi _{11}\) and \(\xi _{00}\) decreases in \(\rho \). We believe that the balanced design is optimal when \(\rho \rightarrow \infty \). It is also clear that for large \(\rho \) the designs appearing in Table 2 are generally superior to those in Table 1. For example when \(\rho =1\) we find that the asymptotic variances in Table 1 are 11.0 and 9.3 whereas the corresponding optimal asymptotic variance given in Table 2 is 8.0. However, for values of \(\rho \) smaller than a 1/4, i.e., when n is rela.tively small to m, then the optimal design sets \(\xi _{00}=0.02\), and \(\xi _{01}=\xi _{10}=0.001\) which are the smallest possible values allowed by our search algorithm. This suggest that further, at most minor, improvements are possible by setting \(\xi _{00}=0\) and/or \(\xi _{01}=\xi _{10}=0\). Clearly when \(\xi _{01}=\xi _{10}=\xi _{00}=0\) we have a Type I Problem.
Therefore we next consider the situation that \(\xi _{00}=0\) and \(\xi _{01}=\xi _{10}>0\), in which case the current study comprises of three arms and thus three group means: \(\bar{Y}_{1}(\mathcal {S})\), \(\bar{Y}_{2}(\mathcal {S})\) and \(\bar{Y}_{12}(\mathcal {S})\). We emphasize that this estimation problem is neither a Type I nor a Type II problem. Further observe that with these data alone we can not estimate \(\varvec{\omega }\). Nevertheless, the pair \((\bar{Y}_{1}(\mathcal {S}),\bar{Y}_{2}(\mathcal {S}))^{T}\) whose mean is \((\eta _{0}+\eta _{1},\eta _{0}+\eta _{2})\) can be aggregated with \(\widehat{\varvec{\eta }}\) the historical estimate of \(\varvec{\eta }\). By an appropriate modification of Lemma 2 it can be shown that \(\varvec{\eta }\) can be estimated by
where \(\varvec{S}=(\bar{Y}_{1}(\mathcal {S}),\bar{Y}_{2}(\mathcal {S}))^{T}\), \(\varvec{V} = \sigma ^{2}\textrm{diag}(\xi _{01}^{-1},\xi _{10}^{-1})\) is its asymptotic variance and
is the matrix which satisfies \(\mathbb {E}(\varvec{S})=\varvec{A\eta }\). Note that (23) is of the same form as (7) but with \(\varvec{A^{T}V^{-1}A}\) instead of \(\varvec{\varUpsilon }_{\varvec{\eta \eta }}\). Now, let \(\bar{\theta }_{D}\) denote the solution to \(\varvec{\varPsi }(\theta ,\varvec{\eta }^{\dagger })=0\) which is nothing but
A straightforward calculation shows that the asymptotic variance of \(\bar{\theta }_{D}\) is given by
The formula above is useful in finding the optimal design for small values of \(\rho \) when \(\xi _{00}=0\). For example when \(\rho =1/8\) then the design \(\varvec{\xi } = (0,0.0005,0.0005, 0.9990)\) results in a variance of 2.25 (actually 2.250751) which is slightly smaller than 2.27 the reported variance in the first row of Table 2. Finally we note that when \(\rho =1/8\) then \(A_{\theta \theta }\) equals (precisely) 2.25 which means that in this application a design for Type I Problem would be the most effective.
As noted by a referee, in addition to the above mentioned three arm trial one could choose various two arm designs for \(\mathcal {S}\). For example, one can choose a design for which \(\xi _{00}>0 \,\,,\xi _{11}>0\) and \(\xi _{01}=\xi _{10}=0\). Or alternatively designs for which \(\xi _{01}>0,\,\ \xi _{11}>0\) and \(\xi _{00}=\xi _{10}=0\) . It can be shown, however, that for small values of \(\rho \), where such designs are of interest, these two armed designs are not superior to a Type I design.
4.2 Using historical estimates in drug interaction studies
This subsection deals with the optimal design of drug interaction studies. Consider two drugs \(\varvec{D}_{1}\) and \(\varvec{D}_2\) with no-effect probabilities \(\eta _1\) and \(\eta _2\), respectively and let \(\theta \) denote the no-effect probability when both drugs are administered together. The drugs are called Bliss independent, see, Bliss (1939), Liu et al. (2018), if
If (25) does not hold and \(\theta <\eta _{1}\eta _{2}\) there is synergy among the drugs, otherwise there is antagonism. The concept of Bliss independence has seen a recent resurgence of interest as the need to assess the benefit of combination therapies and drug–drug interactions has increased. Some current references are Pallmann and Schaarscmidt (2016), Palmer and Sorger (2017), Russ and Kishony (2018) and Niu et al. (2019). Drug interaction studies are often carried out as single-dose experiments, e.g., Ansari et al. (2008), where the interaction is assessed by considering a single dose of each of the two drugs. A more elaborate design, which we will not consider here, assesses multiple drugs and doses using response surface methodology as in Lee (2010).
Naturally, the quantity of interest in drug interaction studies is
The formulation in (26) links the problem discussed here to the ANOVA setup considered earlier. In many applications of single dose interaction tests, whether using historical data or not, an explicit or implicit asymptotic argument is used, and the theoretical results for the asymptotic case presented above are relevant. For example, Demidenko and Miller (2019) describes a Daphnia acute test with two stressors, single doses of CuSO4 and of NiCl, where the numbers of surviving organisms in water were counted after 48 hours. The observations reported were the surviving fractions of organisms only, without reporting their original numbers thus, essentially, assuming their original numbers were very high, i.e., applying an asymptotic argument. But as pointed out by Pallmann and Schaarscmidt (2016), in single-dose experiments, correct statistical analysis should rely on the observed frequencies, and not on the observed rates of success or failure. Therefore the sample sizes used in each arm of the experiment are of crucial importance and in this subsection we provide finite sample results.
For simplicity suppose that there exist historical estimates of \(\eta _{1}\) and \(\eta _{1}\) based on independent binomial experiments with sizes \(m_1\) and \(m_2\). Suppose further that the current study allows for the recruitment of n experimental units, \(n_1\) of which will receive \(\varvec{D}_1\), \(n_2\) will receive \(\varvec{D}_2\) and \(n_{12}\) will receive both drugs. Obviously
and \(\theta \) can not be estimated unless \(n_{12}>0\). However it is possible that \(n_{1}=n_{2}=0\). The goal is to allocate the experimental units optimally, which is equivalent to the problem of optimally allocating \(n+m_1+m_2\) observations in an experiment in which the single dose arms are no smaller than \(m_1\) and \(m_2\), respectively. The optimal design problem can be approximated as the minimization of the large sample variance of (26)
subject to the constraint (27).
In contrast with the design problem encountered in Sect. 4.1 the design criterion depends on the unknown parameters, i.e., the probabilities \(\theta \) and \(\varvec{\eta }\). We propose allocating observations as if \(\eta _1=\widehat{\eta _1}\), \(\eta _2=\widehat{\eta _2}\) and \(\theta = \widehat{\eta _1}\widehat{\eta _2}\) is equal to its estimated value under the hypothesis of Bliss independence.
One simple approach to the minimization of (28) is the following greedy iterative procedure, which sequentially allocates observations into the condition where the variance is reduced most:
For example, if \(m_1=30\), \(m_2=50\), \(\widehat{\eta }_{1}=0.7\) and \(\widehat{\eta }_{2}=0.8\), and one had 56 observations, then 55 observations would be put in the arm where both treatments are administered, and only one would be used to improve the estimate of \(\hat{\eta }_{1}\). Table 3 contains a tabulation of the optimal allocation of \((n_{12}, n_{1}, n_{2})\). For selected combinations of the values of \(m_1\), \(m_2\), \(\eta _1\), \(\eta _2\) the table gives the minimal value of n, denoted as \(n_{\textrm{min}}\), for which replications of the historic observations are needed, and then the optimal allocation for \(n_{\textrm{min}}\). As one would expect, when \(\theta \) is closer to 0.5 than \(\eta _1\) or \(\eta _2\), a larger sample size \(n_{12}\) is allocated in the optimal design to estimating \(\theta \), than \(m_1\) or \(m_2\). In the opposite case, \(n_{12}\) is smaller than \(m_1\) or \(m_2\).
As pointed out by a reviewer, the optimal design may not be unique. For example, in the first entry of the table, when \(\eta _1=\eta _2 =0.3\), \(m+1-m_2=10\), and one has 23 observations to allocate, 22 observations should be used for estimating joint effect \(\theta \) and one observation should be used to improve the estimate of either of the individual effects \(\eta _1\) or \(\eta _2\).
5 Summary and discussion
This paper focuses on the situation when historically available information in the form of parameter estimates are either incorporated in the analysis of a current study or used to plan a future experiment. We did not explicitly discuss the large literature on data combination schemes or other two-stage plug-in methods. As mentioned, when historical estimates are incorporated in an analysis their variability is rarely accounted for in the analysis. A partial list of examples, drawn from the scientific literature was furnished earlier and many more exist. However, it seems very difficult to find published research where the details are given to the extent which would make the replication of the analysis possible. This limits one’s ability to apply the results of this paper to published research. However, the results presented here will inform future researchers of the scope and use of historical estimates and provide a toolkit for doing so. We hope that our investigation may have an effect on the quality of future analyses and publication standards.
We also agree with a referee who has noted that an estimate is always a coarsening of the full data, and it is clear that having only access to a historical estimate instead of the entire data leads to less efficient inferences.
Different disciplines exhibit different modes of using historical estimates. Social scientists often incorporate estimates from surveys in the process of model fitting, whereas biologists and engineers may use parameters estimated in experiments which are very different than their own. One way, of course, of incorporating historical estimates is using prior distributions within the Bayesian framework. For recent examples see Hoff (2019) and Bryan and Hoff (2020). Our approach, however, is frequentist, as are most of the applications in the literature. In particular, we show how to incorporate historical estimates in a principled way in scenarios which we classify at Type I Problems, where the historical parameters are not re-estimated, and Type II Problems, where they are. Two variants of Type II Problems are described. See Theorems 3.1, 3.2 and 3.4. We also show that if, given the current data \(\mathcal {D}\), it is possible to re-estimate the historical parameters then it is beneficial to do so at least for large sample sizes (Theorem 3.3). Other preference relations, in fact a hierarchy, among the estimators and any function thereof, were also established, cf. Theorems 3.5, 3.6. The loss of precision in the above mentioned settings is quantified in terms of a decomposition of the variance matrices. It was also demonstrated that the availability of historical estimates should be taken into account when an optimal experiment is designed. In particular, relevant methods for a two-way ANOVA and for testing drug interaction were discussed. Thus the results of this paper go beyond the existing knowledge on the use of historical estimates.
In our analysis we have assumed that the current data \(\mathcal {D}\) is a random sample and that the estimating Eq. (2) is of an additive form. These assumptions have been used merely to simplify the exposition and are easily modified to dependent data and various other estimating functions. It is clear that Type I and II Problems describe a broad range of possibilities, nevertheless they are insufficient for describing the rich collection of problems in which historical estimates may play a role. For example, our formulation assumes that the historical parameters \(\varvec{\eta }_1,\ldots ,\varvec{\eta }_K\) are distinct. However, in many situations this is not so. In fact, some of the historical studies may be full or partial replicates of each other. In cases when the current study is a partial replicate of a historical study, simple plug-in methods or re-estimation methods may be used. One has to be careful, though, about the choice of the estimates. We are aware of situations where a simple plug-in estimator performs better than a less than optimal re-estimating method. Throughout, we have assumed interchangeability. Clearly there are many experimental settings, especially in the sciences, where this assumption is realistic. In other situations, say clinical trials, heterogeneity rather than interchangeability is the rule. In such cases some modification of the methods proposed, using random effect models, may be possible. See Rukhin (2007) and the references therein.
Finally, it is also worth mentioning that the problem of accounting for historical estimates is naturally related, for obvious reasons, to sequential analysis, where data is collected over time, to meta-analysis, where the effort is to combine information from different sources and double sampling, and especially non-nested double sampling (Hidiroglou 2001), which attempts to provide better inferences by augmenting and predicting unobserved quantities from existing data sets. The literature on combining surveys (Kim and Rao 2012) is also relevant. Further understanding can be possibly attained by incorporating ideas from these fields.
References
Anderson TW (1957) Maximum likelihood estimates for a multivariate normal distribution when some observations are missing. J Am Stat Assoc 52:200–203
Ansari MA, Shah FA, Butt TM (2008) Combined use of entomopathogenic nematodes and Metarhizium anisopliae as a new approach for black vine weevil, Otiorhynchus sulcatus, control. Entomol Exp Appl 129:340–347
Becker NG (2017) Analysis of infectious disease data. Routledge, London
Benichou J, Gail MH (1989) A delta method for implicitly defined random variables. Am Stat 43:41–44
Bliss CI (1939) The toxicity of poisons applied jointly. Ann Appl Biol 26:585–615
Bryan JG, Hoff PD (2020). Smaller \(p\)-values in genomics studies using distilled historical information. arXiv preprint arXiv:2004.07887
Chen YH, Chen H (2000) A unified approach to regression analysis under double-sampling designs. J R Stat Soc Ser B 62:449–460
Cochran WG (1954) The combination of estimates from different experiments. Biometrics 10:101–129
Cooper I, Mondal A, Antonopoulos CG (2020) A SIR model assumption for the spread of COVID-19 in different communities. Chaos Solitons Fractals 139:110057
Davidov O, Haitovsky Y (2000) Optimal design for double sampling with continuous outcomes. J Stat Plan Inference 86:253–263
Davidov O, Zelen M (2004) Overdiagnosis in early detection programs. Biostatistics 5:603–613
Douidich M, Ezzrari A, Van der Weide R, Verme P (2016) Estimating quarterly poverty rates using labor force surveys: a primer. World Bank Econ Rev 30:475–500
Demidenko E, Miller TW (2019) Statistical determination of synergy based on the definition of the Bliss drugs independence. PLoS ONE 14(11):e0224137
Edmunds WJ, O’Callaghan CJ, Nokes DJ (1997) Who mixes with whom? A method to determine the contact patterns of adults that may lead to the spread of airborne infections. Proc R Soc Lond Ser B 264:949–957
Food and Drug Administration (2006) Fixed dose combinations, co-packaged drug products, and single-entityversions of previously approved antiretrovirals for the treatment of HIV. Guidance to Industry. U.S. Department of Health and Human Services, Food and Drug Administration, Center for Drug Evaluation and Research, Available at https://www.fda.gov/media/72248/download
Genest C, Ghoudi K, Rivest LP (1995) A semiparametric estimation procedure of dependence parameters in multivariate families of distributions. Biometrika 82:543–552
Goeyvaerts N, Hens N, Ogunjimi B, Aerts M, Shkedy Z, Damme PV, Beutels P (2010) Estimating infectious disease parameters from data on social contacts and serological status. J Ro Stat Soc Ser C 59:255–277
GOV.UK Department of Work and Pensions (2019) Family Resources Survey. UK Department of Work and Pensions. Available at https://www.gov.uk/government/collections/family-resources-survey-2
Heyde CC (2008) Quasi-likelihood and its application: a general approach to optimal parameter estimation. Springer, New York
Hidiroglou MA (2001) Double sampling. Surv Methodol 27:143–154
Hoff PD (2019) Smaller \(p\)-values via indirect information. arXiv preprint arXiv:1907.12589
Kanda S, Goto K, Shiraishi H, Kubo E, Tanaka A, Utsumi H, Sunami K, Kitazono S, Mizugaki H, Horinouch H (2016) Safety and efficacy of nivolumab and standard chemotherapy drug combination in patients with advanced non-small-cell lung cancer: a four arms phase Ib study. Ann Oncol 27(12):2242–2250
Kanda T, Fujikoshi Y (1998) Some basic properties of the MLE’s for a multivariate normal distribution with monotone missing data. Am J Math Manag Sci 18:161–192
Kim JK, Rao JNK (2012) Combining data from two independent surveys: a model assisted approach. Biometrika 99:85–100
Kogan Y, Halevi-Tobias K, Elishmereni M, Vuk-Pavlović S, Agur Z (2012) Reconsidering the paradigm of cancer immunotherapy by computationally aided real-time personalization. Cancer Res 72:2218–2227
Kronik N, Kogan Y, Elishmereni M, Halevi-Tobias K, Vuk-Pavlović S, Agur Z (2010) Predicting outcomes of prostate cancer immunotherapy by personalized mathematical models. PLoS ONE 5:e15482
Kozłowska E, Färkkilä A, Vallius T, Carpén O, Kemppainen J, Grénman S, Hautaniemi S (2018) Mathematical modeling predicts response to chemotherapy and drug combinations in ovarian cancer. Cancer Res 78:4036–4044
Lee S (2010) Drug interaction: focusing on response surface models. Korean J Anasthesiol 58:421–434
Lee KM, Wason J (2020) Including non-concurrent control patients in the analysis of platform trials: is it worth it? BMC Med Res Methodol 20:1–12
Lee SJ, Zelen M (1998) Scheduling periodic examinations for the early detection of disease: applications to breast cancer. J Am Stat Assoc 93:1271–1281
Liu Q, Yin X, Languino LR, Altieri DC (2018) Evaluation of drug combination effect using a Bliss independence dose-response surface model. Stat Biopharm Res 10(2):112–122
Moore JC, Stinson LL, Welniak LJ (2000) Income measurement errors in surveys: a review. J Off Stat 16:331–361
Morrison DF (1971) Expectations and variances of maximum likelihood estimates of the multivariate normal distribution parameters with missing data. J Am Stat Assoc 66:602–604
Mossong J, Hens N, Jit M, Beutels P, Auranen K, Mikolajczyk R, Heijne J (2008) Social contacts and mixing patterns relevant to the spread of infectious diseases. PLoS Med 5(3):74
Neuenschwander B, Capkun-Niggli G, Branson M, Spiegelhalter DJ (2010) Summarizing historical information on controls in clinical trials. Clin Trials 7:5–18
Newey WK, Ramalho JJ, Smith RJ (2005). Asymptotic bias for GMM and GEL estimators with estimated nuisance parameters. Identif Inference Econom Models 245–281
Niu J, Straubinger RM, Mager DE (2019) Pharmacodynamic drug–drug interactions. Clin Pharmacol Ther 105:1395–1406. https://doi.org/10.1002/cpt.1434
Pallmann P, Schaarscmidt F (2016) Common pitfalls when testing additivity of treatment mixtures with \(\chi ^2\). J Appl Entomol 140:135–141
Palmer AC, Sorger PK (2017) Combination cancer therapy can confer benefit via patient-to-patient variability without drug additivity or synergy. Cell 171:1678–1691
Peddada SD, Dinse GE, Kissling GE (2007) Incorporating historical control data when comparing tumor incidence rates. J Am Stat Assoc 102:1212–1220
Peto R, Pike M, Armitage P, Breslow NE, Cox DR, Howard SV, Mantel N, McPherson K, Peto J, Smith PG (1976) Design and analysis of randomized clinical trials requiring prolonged observation of each patient. Br J Cancer 34:585–612
Piantadosi S (2017) Clinical trials: a methodological perspective. Wiley, New York
Pierce DA (1982) The asymptotic effect of substituting estimators for parameters in certain types of statistics. Ann Stat 10:475–478
Pocock SJ (1976) The combination of randomized and historical controls in clinical trials. J Chronic Dis 29:175–188
Pukelsheim F (2006). Optimal design of experiments. Soc Ind Appl Math
Raftery AE, Alkema L, Gerland P (2014) Bayesian population projections for the United Nations. Stat Sci 29:58–68
Randles RH (1982) On the asymptotic normality of statistics with estimated parameters. Ann Stat 1:462–474
Ridder G, Moffitt R (2007) The econometrics of data combination. Handbook Econ 6:5469–5547
Roig MB, Krotka P, Burman CF, Glimm E, Gold SM, Hees K, Jacko P, Koenig F, Magirr D, Mesenbrink P, Viele K (2022) On model-based time trend adjustments in platform trials with non-concurrent controls. BMC Med Res Methodol 22:1–16
Rukhin AL (2007) Estimating a common vector parameter in interlaboratory studies. J Multivar Anal 98:435–454
Russ D, Kishony R (2018) Additivity of inhibitory effects in multidrug combinations. Nat Microb 3. https://doi.org/10.1038/s41564-018-0252-1
Tamhane AC (1978) Inference based on the regression estimator in double sampling. Biometrika 65:419–427
Tamma PD, Cosgrove SE, Maragakis LL (2012) Combination therapy for treatment of infections with gram-negative bacteria. Clin Microbiol Rev 25:450–470
Tasseva IV (2019) The changing education distribution and income inequality in great Britain. Euromod Working Paper Series EM 16/19, University of Essex, available at https://www.euromod.ac.uk/sites/default/files/working-papers/em16-19.pdf
Thompson ME (1997) Theory of sample surveys. Chapman and Hall, London
United Nations, Department of Economic and Social Affairs, Population Division (2014) World Population Prospects: The 2012 Revision, Methodology of the United Nations Population Estimates and Projections. ESA/P/WP.235
Van der Vaart AW (2000) Asymptotic statistics. Cambridge University Press, Cambridge
Viele K, Berry S, Neuenschwander B, Amzal B, Chen F, Enas N, Micallef S (2014) Use of historical control data for assessing treatment effects in clinical trials. Pharm Stat 13:41–54
Wallinga J, Teunis P, Kretzschmar M (2006) Using data on social contacts to estimate age-specific transmission parameters for respiratory-spread infectious agents. Am J Epidemiol 164:936–944
Yaari R, Dattner I, Huppert A (2018) A two-stage approach for estimating the parameters of an age-group epidemic model from incidence data. Stat Methods Med Res 27:1999–2014
Acknowledgements
Both authors thank COST Action IC1408 for Computationally Intensive Methods for the Robust Analysis of Non-standard Data that supported this research with grants for short visits. In addition the work of Ori Davidov was partially supported by the Israel Science Foundation Grants No. 456/17 and 2200/22 gratefully acknowledged. The authors are indebted to Anna Klimova for drawing their attention to the importance of Bliss independence and to the anonymous reviewers for their insightful and detailed comments.
Funding
Open access funding provided by Eötvös Loránd University.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix: Proofs
1.1 Proof of Theorem 3.1:
Proof
Since \(\bar{\varvec{\theta }}_{A}\) solves (2) we have
By assumption \(\varvec{\psi }\) is continuous and differentiable with respect to \(\varvec{\theta }\) and \(\varvec{\eta }_{1},\ldots ,\varvec{\eta }_K\). Thus, so is \(\varvec{\varPsi }\). Hence, by the mean value theorem
Applying the mean value theorem to \(\varvec{\varPsi }(\varvec{\theta }_{0},\widehat{\varvec{\eta }})\) in the display above yields
so (29) can be rewritten as
where, assuming consistency \(\varvec{R}=o(||\bar{\varvec{\theta }}_{A}-\varvec{\theta }_{0}||)+o(||\widehat{\varvec{\eta }}-\varvec{\eta }_{0}||)=o_{p}(1)\). Now, by the continuous mapping theorem and the law of large numbers we have:
which is a \(p\times p\) matrix. Similarly,
which is a \(p\times q\) matrix. For convenience we set \(\varvec{D}_{\varvec{\theta }_0}=\mathbb {E}_{0}(\partial \varvec{\psi }/\partial \varvec{\theta })\) and \(\varvec{D}_{\varvec{\eta }_0}=\mathbb {E}_{0}(\partial \varvec{\psi } /\partial \varvec{\eta })\). Hence we can reexpress (29) more concisely as
from which it follows, by the invertibility of \(\varvec{D}_{\varvec{\theta }_0}\), that
Since the first term in the curly brackets above is a function of the data \(\mathcal {D}\) collected in \(\mathcal {S}\) and the second term depends on the historical data, i.e., the studies \(\mathcal {S}_1,\ldots ,\mathcal {S}_K\) the two terms are independent. Now, by the central limit theorem
where \(\varvec{\varSigma }_{\varvec{\psi }}=\mathbb {E}_{0}(\varvec{\psi \psi }^{T}).\) By assumption \(\sqrt{m_{j}}( \widehat{\varvec{\eta }}_{j}-\varvec{\eta }_{j,0})\Rightarrow \mathcal {N}_{q_{j}}(\varvec{0},\varvec{\varSigma }_{j})\) for each j. Thus \(\sqrt{m}(\widehat{\varvec{\eta }}-\varvec{\eta }_{0}) \Rightarrow \mathcal {N}_{q}(\varvec{0},\varvec{\varSigma })\) where \(\varvec{\varSigma } = \text {BlockDiag}(\kappa _{1}\varvec{\varSigma }_{1},\ldots ,\kappa _{K}\varvec{\varSigma }_{K})\) with \(\kappa _{j} = \lim (m/m_j)\) for \(j=1,\ldots ,K\). Thus,
Collecting terms shows that \(\sqrt{n}(\bar{\varvec{\theta }}_{A}-\varvec{\theta }_{0})\Rightarrow \mathcal {N}_{p}(\varvec{0},\varvec{A}_{\varvec{\theta \theta }})\) where \(\varvec{A}_{\varvec{\theta \theta }}\) is as stated. Now, recall that \(\bar{\varvec{\eta }}_{A}=\widehat{\varvec{\eta }}\). Thus, marginally \(\sqrt{n}(\bar{\varvec{\eta }}_{A}-\varvec{\eta }_{0})\Rightarrow \mathcal {N}_{q}(\varvec{0},\rho \varvec{\varSigma })\), so \(\varvec{A}_{\varvec{\eta \eta }} = \rho \varvec{\varSigma }\). Clearly the joint asymptotic distribution of \(\sqrt{n}(\bar{\varvec{\theta }}_{A}-\varvec{\theta }_{0},\bar{\varvec{\eta }}_{A}-\varvec{\eta }_{0})^{T}\), is also multivariate normal. Thus,
as required, completing the proof. \(\square \)
Proof of Theorem 3.2:
Proof
Since \(\bar{\varvec{\theta }}_{B}\) solves (9) where \(\bar{\varvec{\eta }}\) is given in (7) we have
Following the derivations in the proof of Theorem 3.1, but with \(\bar{\varvec{\eta }}\) instead of \(\widehat{\varvec{\eta }}\), we find that (32) can be rewritten as
from which it follows, by the invertibility of \(\varvec{D}_{\varvec{\theta }_0}\), that
Now using (7) and (8) we find that
which we may substitute into (33) yielding
The three terms in the curly brackets in (34) satisfy:
Now, since \(\tilde{\varvec{\eta }}\) is a solution to (3) the quantity \(\sqrt{n}(\tilde{\varvec{\eta }}-\varvec{\eta }_{0})\) may be expressed as
for some function \(\varvec{\varphi }\) which is known as the influence function (cf. Van der Vaart 2000). For more details see Remark 9. It follows by the central limit theorem, that
are asymptotically jointly multivariate normal, thus so are the first two terms in the curly brackets of (34). Moreover the third term, which depends on the historical data is independent of the first two terms and normally distributed. Now the covariance among the first two terms is
However, by assumption \(\tilde{\varvec{\eta }}\) is asymptotically unbiased and \(\sqrt{n}\)-consistent, i.e., \(\mathbb {E}_{0}(\tilde{\varvec{\eta }}) = \varvec{\eta }_0 + b/n + o(1/n)\) so \(\mathbb {E}_{0}(n(\tilde{\varvec{\eta }}-\varvec{\eta }_{0})|\varvec{Y}_1))=n\mathbb {E}_{0}((\tilde{\varvec{\eta }}-\varvec{\eta }_{0}))+o(1) = O(1)\). Plugging the latter into the above display shows that covariance above converges to 0 as \(n\rightarrow \infty \). It now follows that all three terms appearing in (34) are asymptotically independent.
Set \(\bar{\varvec{\eta }}_{B}=\bar{\varvec{\eta }}\) and observe that using (34) we have \(\sqrt{n}(\bar{\varvec{\eta }}_{B}-\varvec{\eta }_{0})\Rightarrow \mathcal {N}_{q}(\varvec{0},\varvec{B}_{\varvec{\eta \eta }})\) where
Since \(\gamma /(1-\gamma ) =\rho \) we may reexpress the weight matrices as \(\varvec{W}_1=(\varvec{\varUpsilon }_{\varvec{\eta \eta }}^{-1}+(\rho \varvec{\varSigma })^{-1})^{-1}\varvec{\varUpsilon }_{\varvec{\eta \eta }}^{-1}\) and \(\varvec{W}_2=(\varvec{\varUpsilon }_{\varvec{\eta \eta }}^{-1}+(\rho \varvec{\varSigma })^{-1})^{-1}(\rho \varvec{\varSigma })^{-1}\). Now, using the fact that products of symmetric matrices commute and a bit of algebra it can be shown that
Collecting terms shows that \(\sqrt{n}(\bar{\varvec{\theta }}_{B}-\varvec{\theta }_{0})\Rightarrow \mathcal {N}_{p}(\varvec{0},\varvec{B}_{\varvec{\theta \theta }})\) where \(\varvec{B}_{\varvec{\theta \theta }}\) is as stated. The stochastic representation (30) shows that the joint asymptotic distribution of \(\sqrt{n}(\bar{\varvec{\theta }}-\varvec{\theta }_{0},\bar{\varvec{\eta }}-\varvec{\eta }_{0})\) is also multivariate normal with
as required, completing the proof. \(\square \)
Proof of Theorem 3.3:
The following preliminary Lemma will be used.
Lemma 1
Let \(\varvec{X}_{1}\) and \(\varvec{X}_{2}\) be random vectors with variances \(\varvec{V}_{1}=\mathbb {V}(\varvec{X}_{1})\) and \(\varvec{V}_{2}=\mathbb {V}(\varvec{X}_{2})\) with \(\varvec{V}_{1} \preceq \varvec{V}_{2}\). Then for any matrix \(\varvec{A}\) we have \(\mathbb {V}(\varvec{A}\varvec{X}_{1}) \preceq \mathbb {V}(\varvec{A}\varvec{X}_{2})\). As a consequence we also have \(\varvec{V}_{1}^{-1} \succeq \varvec{V}_{2}^{-1} \).
Proof of Lemma 1:
Proof
Observe that
for any vector \(\varvec{u}\). The inequality \(\varvec{v}^{T}\varvec{V}_{1}\varvec{v} \preceq \varvec{v}^{T}\varvec{V}_{2}\varvec{v}\) holds since \(\varvec{V}_2-\varvec{V}_1\) is non-negative definite by assumption. Thus \(\mathbb {V}(\varvec{A}\varvec{X}_{1}) \preceq \mathbb {V}(\varvec{A}\varvec{X}_{2})\) as claimed.
Now choose \(\varvec{A}=\varvec{V}_{1}^{-1/2}\varvec{V}_{2}^{-1/2}\) and note that
The equalities above hold since products of symmetric matrices commute. The inequality \(\varvec{V}_{1}^{-1} \succeq \varvec{V}_{2}^{-1} \) follows immediately. \(\square \)
We now continue with the proof of Theorem 3.3:
Proof
Observe that \(\varvec{A}\) is the variance matrix of the random vector
where \(\varvec{S}_1 \sim \mathcal {N}_{p}(\varvec{0},\varvec{\varSigma }_{\varvec{\psi }})\) and \(\varvec{S}_2 \sim \mathcal {N}_{q}(\varvec{0},\rho \varvec{\varSigma })\) are independent. Similarly, \(\varvec{B}\) is the variance matrix of the random vector
where \(\varvec{S}_1 \sim \mathcal {N}_{p}(\varvec{0},\varvec{\varSigma }_{\varvec{\psi }})\) and \(\varvec{S}_3 \sim \mathcal {N}_{q}(\varvec{0},(\varvec{\varUpsilon }_{\varvec{\eta \eta }}+(\rho \varvec{\varSigma })^{-1})^{-1}\) are independent. Now,
It is easy to verify that \(\varvec{\varUpsilon }_{\varvec{\eta \eta }}+(\rho \varvec{\varSigma })^{-1} \succeq (\rho \varvec{\varSigma })^{-1}\) so by the second part of Lemma 1 we have \(\rho \varvec{\varSigma } \succeq (\varvec{\varUpsilon }_{\varvec{\eta \eta }}+(\rho \varvec{\varSigma })^{-1})^{-1}\) and therefore
Applying Lemma 1 we find that
as required.
Next, an application of the \(\delta \)-method and Theorems 3.1 and 3.2 shows that \(\sqrt{n}(\varvec{\varPhi }(\bar{\varvec{\theta }}_A,\bar{\varvec{\eta }}_A)-\varvec{\varPhi }(\varvec{\theta }_0,\varvec{\eta }_0)) \Rightarrow \mathcal {N}_{r}(\varvec{0},\varvec{P}\varvec{A}\varvec{P}^{T})\) and \(\sqrt{n}(\varvec{\varPhi }(\bar{\varvec{\theta }}_B,\bar{\varvec{\eta }}_B)-\varvec{\varPhi }(\varvec{\theta }_0,\varvec{\eta }_0)) \Rightarrow \mathcal {N}_{r}(\varvec{0},\varvec{P}\varvec{B}\varvec{P}^{T})\) where r is the dimension of \(\varvec{\varPhi }\) and \(\varvec{P}=\mathbb {E}_0(\partial \varPhi /\partial {\varvec{\omega }})\). Observe that \(\varvec{P}\varvec{A}\varvec{P}^{T}\) is the variance of the random vector \(\varvec{P}\varvec{T}_1\) whereas \(\varvec{P}\varvec{B}\varvec{P}^{T}\) is the variance \(\varvec{P}\varvec{T}_2\). Since \(\varvec{B} \preceq \varvec{A}\) it follows from Lemma 1 that \(\varvec{P}\varvec{B}\varvec{P}^{T} \preceq \varvec{P}\varvec{A}\varvec{P}^{T}\) concluding the proof. \(\square \)
The following lemma motivates the use of the estimators (7) and (13)
Lemma 2
Let \(\varvec{W}\sim \mathcal {N}_{q}(\varvec{\eta },m^{-1}\varvec{\varSigma })\) and \((\varvec{U},\varvec{V})^{T}\sim \mathcal {N}_{p+q}((\varvec{\theta },\varvec{\eta })^{T},n^{-1}\varvec{\varUpsilon })\) be independent random vectors where
Then the MLEs of \(\varvec{\theta }\) and \(\varvec{\eta }\) are
Proof of Lemma 2:
Proof
The likelihood is given by
Now \(\varvec{U}|\varvec{V} \sim \mathcal {N}_{p}(\varvec{\lambda },\varvec{\varLambda })\) with
so \(\varvec{\lambda }\) is linear in both \(\varvec{\theta }\) and \(\varvec{\eta }\). Thus we may reparameterize \(f(\varvec{U}|\varvec{V};\varvec{\theta },\varvec{\eta })\) as \(f(\varvec{U}|\varvec{V};\varvec{\lambda })\) where
Also marginally \(\varvec{V}\) follows a \(\mathcal {N}_{q}(\varvec{\eta },n^{-1}\varvec{\varUpsilon }_{\varvec{\eta \eta }})\) distribution so
It now follows that the MLEs for \((\varvec{\lambda },\varvec{\eta })\) are
Thus by the invariance property of MLEs we find that the MLE of \(\varvec{\theta }\) is
which completes the proof. \(\square \)
Remark 8
To obtain the estimators (7) and (13) apply Lemma 2 and substitute \(\tilde{\varvec{\theta }}\) for \(\varvec{U}\), \(\tilde{\varvec{\eta }}\) for \(\varvec{V}\) and \(\widehat{\varvec{\eta }}\) for \(\varvec{W}\). Further substitute \(\tilde{\varvec{\varUpsilon }}\) and \(\widehat{\varvec{\varSigma }}\) for \(\varvec{\varUpsilon }\) and \(\varvec{\varSigma }\), respectively.
Proof of Theorem 3.4:
Proof
First note that the difference \(\tilde{\varvec{\eta }}-\bar{\varvec{\eta }}_{C}\) in (13) is a linear combination of \(\tilde{\varvec{\eta }}\) and \(\widehat{\varvec{\eta }}\) given by
Therefore,
Since \(\varvec{\varUpsilon }\) and \(\varvec{\varSigma }\) can be consistently estimated it follows that
Clearly, the fact that \(n/(n+m)\rightarrow \gamma \) implies that \((n\varvec{\varUpsilon }_{\varvec{\eta \eta }}^{-1}+m\varvec{\varSigma }^{-1})^{-1}n\varvec{\varUpsilon }_{\varvec{\eta \eta }}^{-1} \rightarrow \varvec{W}_1 \) and \((n\varvec{\varUpsilon }_{\varvec{\eta \eta }}^{-1}+m\varvec{\varSigma }^{-1})^{-1}m\varvec{\varSigma }^{-1} \rightarrow \varvec{W}_2\) so we may rewrite the display above as
where \(\varvec{M}\) is given in (14). Further observe that
and that
where \(\varvec{V}\) is given by (14). Now (36), (37) and (38) together imply that
as stated. In particular \(\varvec{C}_{\varvec{\theta \theta }}\) is the appropriate submatrix of \(\varvec{MVM}^T\). Multiplying out we find that
The matrices \(\varvec{\varUpsilon }_{\varvec{\eta \eta }}\), \(\varvec{\varSigma }\) and \(\varvec{W}_2\) are symmetric and thus their products commute. It follows that \(\varvec{R}\varvec{W}_{2}\varvec{\varUpsilon }_{\varvec{\eta \eta }}\varvec{W}_{2}\varvec{R}^{T}\) equals \(\varvec{R}\varvec{\varUpsilon }_{\varvec{\eta \eta }}\varvec{W}_{2}^{2}\varvec{R}^{T}\) and \(\rho \varvec{R}\varvec{W}_{2}\varvec{\varSigma }\varvec{W}_{2}\varvec{R}^{T}\) equals \(\rho \varvec{R}\varvec{\varSigma }\varvec{W}_{2}^{2}\varvec{R}^{T}\). It is also easy to verify that \(\varvec{\varUpsilon }_{\varvec{\theta \eta }}\varvec{W}_{2}\varvec{R}^{T}=\varvec{R}\varvec{W}_{2}\varvec{\varUpsilon }_{\varvec{\eta \theta }}\) so
Combining and simplifying we obtain
where
Now, using symmetry, standard algebraic manipulation and the fact that \(\rho =\gamma /(1-\gamma )\) we have
Thus \(\varvec{C}_{\varvec{\theta \theta }} = \varvec{\varUpsilon }_{\varvec{\theta \theta }} - \varvec{R}\varvec{\varUpsilon }_{\varvec{\eta \eta }}\varvec{W}_{2}\varvec{R}^{T} = \varvec{\varUpsilon }_{\varvec{\theta \theta }} - \varvec{\varUpsilon }_{\varvec{\theta \eta }}\varvec{\varUpsilon }_{\varvec{\eta \eta }}^{-1}\varvec{W}_{2}\varvec{\varUpsilon }_{\varvec{\theta \eta }}^{T}\) as required. It is also clear that \(\varvec{C}_{\varvec{\eta \eta }}=\varvec{B}_{\varvec{\eta \eta }}\) and that
where \(\tilde{\varvec{R}}=\tilde{\varvec{\varUpsilon }}_{\varvec{\theta \eta }}\tilde{\varvec{\varUpsilon }}_{\varvec{\eta \eta }}^{-1}\). Now \(n\textrm{Cov}(\tilde{\varvec{\theta }},\bar{\varvec{\eta }}) = n\textrm{Cov}(\tilde{\varvec{\theta }},\varvec{W}_{1}\tilde{\varvec{\eta }}+\varvec{W}_2\widehat{\varvec{\eta }}+o_{p}(1)) \rightarrow \varvec{\varUpsilon }_{\varvec{\theta \eta }}\varvec{W}_1\). Further note that \(n\textrm{Cov}(\tilde{\varvec{\eta }}-\bar{\varvec{\eta }},\bar{\varvec{\eta }}) = n\textrm{Cov}(\tilde{\varvec{\eta }},\varvec{W}_1\tilde{\varvec{\eta }}+\varvec{W}_2\widehat{\varvec{\eta }}+o_{p}(1))-n\textrm{Cov}(\bar{\varvec{\eta }},\bar{\varvec{\eta }}) \rightarrow \varvec{\varUpsilon }_{\varvec{\eta \eta }}\varvec{W}_1 - (\varvec{\varUpsilon }_{\varvec{\eta \eta }}^{-1}+(\rho \varvec{\varSigma })^{-1})^{-1}=\varvec{0}\) since
where we have used the fact that \(\rho = \gamma /(1-\gamma )\). Thus \(\varvec{C}_{\varvec{\theta \eta }}=\varvec{\varUpsilon }_{\varvec{\theta \eta }}\varvec{W}_1\) concluding the proof. \(\square \)
Proof of Theorem 3.5:
Proof
Suppose that \((\varvec{U},\varvec{V})^{T}\sim \mathcal {N}((\varvec{\theta },\varvec{\eta })^{T},\varvec{\varUpsilon })\) and \(\varvec{W}\sim \mathcal {N}(\varvec{\eta },\rho \varvec{\varSigma })\) are independent. Let \(I_{\varvec{\omega }}(\varvec{U},\varvec{V})\) and \(I_{\varvec{\omega }}(\varvec{U},\varvec{V},\varvec{W})\) denote the Fisher Information about \(\varvec{\omega }=(\varvec{\theta },\varvec{\eta })^{T}\) in \((\varvec{U},\varvec{V})\) and \((\varvec{U},\varvec{V},\varvec{W})\) respectively. It is clear that \(I_{\varvec{\omega }}(\varvec{U},\varvec{V}) = \varvec{\varUpsilon }^{-1}\). Moreover, repeating the calculations in proofs of Lemma 2 and Theorem 3.4 we deduce that \(I_{\varvec{\omega }}(\varvec{U},\varvec{V},\varvec{W})=\varvec{C}^{-1}\). The additivity of Fisher’s Information implies that
Equation (42) and Lemma 1 imply that
as stated. The fact that \(\varvec{V}_{\varvec{C}}^{\varvec{\varPhi }} \preceq \varvec{V}_{\varvec{\varUpsilon }}^{\varvec{\varPhi }}\) now follows as in Theorem 3.3. \(\square \)
Proof of Theorem 3.6:
Proof
By Equation (33) in the proof of Theorem 3.2 we have
and similarly,
The analysis in the proof of Theorem 3.2 shows that in both equations above the terms in the curly brackets are asymptotically independent. Conditions (18) and (17) immediately imply the conclusion of the Theorem. \(\square \)
Remark 9
Recall that \((\tilde{\varvec{\theta }},\tilde{\varvec{\eta }})\) simultaneously solve \(\varvec{\varPsi }(\varvec{\theta },\varvec{\eta })=\varvec{0}\) and \(\varvec{\varGamma }(\varvec{\theta },\varvec{\eta })=\varvec{0}\) where \(\varvec{\varGamma }(\varvec{\theta },\varvec{\eta })=n^{-1} \sum _{i=1}^{n}\varvec{\gamma }(\varvec{\theta },\varvec{\eta },\varvec{Y}_{i})\). Standard calculations show that
where
Using the above notations and rewriting Eq. (33) we have
As demonstrated in the proof of Theorem 3.2 the two terms above are asymptotically independent so we can re-express \(\bar{\varvec{\theta }}_{B}\) as
where \(\varvec{Q}_i\) are IID \(\mathcal {N}(\varvec{0},\varvec{W}_1\varvec{\varUpsilon }_{\varvec{\eta \eta }} \varvec{W}_1^{T} + \rho \varvec{W}_2\varvec{\varSigma }\varvec{W}_2^{T})\) are random variables which are independent of \(\mathcal {D}\). Further note that by (43)
where \(\varvec{D}^{ij}\) is the appropriate submatrix of \(\varvec{D}^{-1}\). Substituting the formulas above into Eq. (13) for \(\bar{\varvec{\theta }}_{C}\) and simplifying we find that
Therefore comparing the estimators \(\bar{\varvec{\theta }}_{B}\) and \(\bar{\varvec{\theta }}_{C}\) amounts to comparing their influence functions implicit in (45) and (46), i.e.,
and
respectively. Also note that
so although in principle it is possible to always compare the above influence functions in practice this comparison is very difficult unless some further simplifying assumptions are imposed.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Davidov, O., Rudas, T. On the use of historical estimates. Stat Papers 65, 203–236 (2024). https://doi.org/10.1007/s00362-022-01375-z
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00362-022-01375-z