1 Introduction

There are many circumstances in which a statistical analysis either requires, or can greatly benefit, from the use of historical, that is existing, information. In this paper we focus on the situation where the historical information consists of parameter estimates. These may be essential for model fitting but impossible, and/or very expensive, to collect in the context of the current study. Although related, we will not explicitly discuss the large literature on data combination schemes or other two-stage plug-in methods. An example of the former is Ridder and Moffitt (2007) comprehensive review of methodologies for data combination common in econometrics whereas an example of the latter is Genest et al. (1995) seminal paper on two-step semi-parametric inference in copula models. These and similar problems have received considerable attention and have a long history in Statistics, see, e.g., Cochran (1954).

Historical estimates are used in a variety of applications in the social, physical, and biomedical sciences. For example, some models for the spread of infectious diseases, such as the SIR model (Becker 2017) popularized in connection with COVID-19, e.g., Cooper et al. (2020), require the input of age-specific transmission parameters which can be estimated from social contact networks (Edmunds et al. 1997; Wallinga et al. 2006) and then used to fit epidemic models (Mossong et al. 2008; Goeyvaerts et al. 2010, Yaari et al. 2018). Another interesting application is the optimization of cancer treatment where Kronik et al. (2010) develop a framework for predicting the outcome of prostate cancer immunotherapy by fitting personalized mathematical models. Their model consists of a set of differential equations whose behavior is governed by a collection of parameters, some of which are global parameters while others are subject specific. The values of the global parameters were obtained from at least ten different published studies, see their Table 2, whereas the subject level parameters were estimated by fitting a model to each participant assuming that the global parameters were estimated without error. See Kogan et al. (2012) and Kozłowska et al. (2018) for similar applications. It is worth noting that the applications above may be viewed as a model for situations in which knowledge collected in one setting, experimental or observational, is then used to estimate quantities arising in a different experiment and is quite common in the biomedical sciences. See Lee and Zelen (1998) and Davidov and Zelen (2004) for a similar structure arising in the planning of early detection programs.

Another very important application in which historical estimates are used is clinical trials. Consider, for example, the situation in which the effect of a combination of treatments is assessed (e.g., Tamma et al. 2012, Kanda et al. 2016). In such cases there exists a collection of therapies which have been independently proven to be somewhat successful at treating a medical condition. The objective of a new study may then be to assess whether a combination of these therapies provides an even better outcome. In the simplest case, one may view this problem as a three armed clinical trial comparing treatments \(\varvec{A}\),\(\varvec{B}\) and \(\varvec{A+B}\) in which historical estimates on the efficacy of treatments \(\varvec{A}\) and \(\varvec{B}\) already exist. An important example of such situations is the Food and Drug. Administration (2006) guidelines for submitting applications for approval of fixed dose combinations, i.e., co-packaged drug products, of previously approved antiretrovirals for the treatment of HIV. In particular, Attachment A of the aforementioned document considers the scenario in which a non-innovator, i.e., a generic drug company, wants to obtain approval for a combination of already approved ingredients. In this case, only efficacy data for the combination needs to be submitted. We will revisit and thoroughly analyze two forms of this example later on. More broadly, the use of historical data in the contexts of clinical trials has been investigated by numerous researchers using multiple perspectives, cf., Pocock (1976), Peto et al. (1976), Neuenschwander et al. (2010), Viele et al. (2014), and Piantadosi (2017) among many others. As noted by a referee, a particularly relevant class of designs are platform trials which allow adding new treatments to the experiment and thus controls may become non-concurrent, cf. Lee and Wason (2020) and Roig et al. (2022).

The use of historical estimates is also widespread in the social sciences. For example, in the fitting of some econometric models researchers may use values estimated from previously collected survey data. We point to the paper of Newey et al. (2005) which focuses on the asymptotic bias of the estimated parameters. The complexity of using historical estimates in the social sciences is further illustrated by the work of Tasseva (2019). In a microsimulation study investigating the effect of the recent expansion in higher education in Great Britain on household inequalities, previously obtained estimates from the Family Resources Survey for Great Britain (GOV.UK 2019) were used. While sampling variability could be taken into account using bootstrap methods, as noted by the author, measurement error, inevitably present in income information collected in surveys, see, e.g., Moore et al. (2000), could not be accounted for using this method. Similarly, Douidich et al. (2016) describe an imputation-related-method for incorporating estimates obtained in labor force surveys (which are easily and cheaply conducted) into household expenditure surveys (which are much more time consuming and expensive) in order to estimate poverty rates in Morocco. Likewise, demographic model fitting and projections rely on historical data. The standard method of population projections (see United Nations 2014) is based on the combination of cohort survival rates, i.e., historical data, with current data on cohort sizes. Raftery et al. (2014) proposed a Bayesian approach to take the uncertainty associated with historical data into account. It is worth noting that in this case the uncertainty accounted for by Bayesian modeling did not come from observational errors but rather from the fact that the true population figures may have changed over time.

Researchers often do not adequately account for the variability of the historical estimates when incorporating them into a current analysis. In fact, the practice of plugging-in the estimated values for certain parameters is widespread. However, this practice is often not disclosed as many practitioners view this strategy as a natural way of “doing things”. Consequently, the objectives and contributions of this communication are twofold: first, we draw attention to current practice, and secondly, and more importantly, we provide a principled methodology for incorporating historical estimates into a current analysis. Surprisingly, despite the ubiquity of historical data and estimates and the many papers that touch on various aspects thereof, a general methodology discussing the use of historical estimates, as given here, has been thus far lacking. In particular we consider two broad settings in which historical estimates are employed. Three such estimators are presented in Sect. 3 and one in Sect. 4.1. Their limiting distributions are found and a theoretical analysis comparing their precision is conducted. When comparable, a preference order among the different approaches is established. Our paper goes beyond the existing knowledge by providing an inventory of ways in which historical estimates can be used, and by quantifying the properties of the resulting estimators. We also demonstrate how these results may be used in the design of experiments.

The paper is organized as follows. Our notation and formulation are outlined in Sect. 2. Section 3 provides our main theoretical findings which include the limiting distributions of the estimates in the presence of historical estimates and a comparison thereof. In Sect. 4 two applications are described in conjunction with accompanying numerical experiments. The first application addresses the two-way analysis of variance (ANOVA) problem introduced in Sect. 2. The second, related application, deals with a drug interaction study within the framework of Bliss-independence (Bliss 1939), an old concept which has garnered much recent attention. We conclude with a discussion in Sect. 5. All proofs are collected in an Appendix.

2 Notation and formulation

Consider a designed experiment or observational study, denoted by \(\mathcal {S}\), in which data \(\mathcal {D}\) consisting of n observations are collected. Usually, the observations are independent and identically distributed. Suppose further that the model describing the distribution of \(\mathcal {D}\) is indexed by \(\varvec{\omega }^T=(\varvec{\theta }^T,\varvec{\eta }^T)\) where \(\varvec{\theta }\in \mathbb {R}^{p}\) and \(\varvec{\eta }\in \mathbb {R}^{q}\) is the concatenation of \(\varvec{\eta }_{1}\in \mathbb {R}^{q_1},\ldots ,\varvec{\eta }_{K}\in \mathbb {R}^{q_K}\) with \(q=q_1+\cdots +q_K\). Let \(\varvec{\varPhi }(\varvec{\omega })\) be some function of the model parameters which is of interest to the researchers. Clearly, \(\varvec{\varPhi }(\varvec{\omega })\) may be a function of \(\varvec{\theta }\) alone, \(\varvec{\eta }\) alone or of both \(\varvec{\theta }\) and \(\varvec{\eta }\). The primary goal of the study \(\mathcal {S}\), which we refer to as the current study, is inference on \(\varvec{\varPhi }(\varvec{\omega })\) in the presence of historical data which we view as a collection of K, independent estimates \(\widehat{\varvec{\eta }}_{1},\ldots ,\widehat{\varvec{\eta }}_{K}\) obtained from historical studies \(\mathcal {S}_1,\ldots ,\mathcal {S}_K\) of sizes \(m_1,\ldots ,m_K\) and \(m= m_1+\cdots +m_K\) denotes the total sample size in the historical studies.

In some circumstances, it may not be possible to estimate \(\varvec{\omega }\) using the data \(\mathcal {D}\). However, if \(\varvec{\eta }\) were known in advance then it would be possible to estimate \(\varvec{\theta }\). As an example, such a situation would arise if the model \(f(\cdot ;\varvec{\theta },\varvec{\eta })\) is not identifiable whereas the model \(f(\cdot ;\varvec{\theta },\varvec{\eta }_0)\) is identifiable for every fixed value of \(\varvec{\eta }_0\). In other circumstances given the data \(\mathcal {D}\) both \(\varvec{\theta }\) and \(\varvec{\eta }\) are estimable (e.g., Peddada et al. 2007). Thus, in this communication we consider two distinct settings, the second of which has two variants. In the first setting, referred to as a Type I Problem, only the parameter \(\varvec{\theta }\) is estimable using the data \(\mathcal {D}\), while \((\varvec{\eta }_{1},\ldots ,\varvec{\eta }_{K})\) are fixed at their historical estimated values \((\widehat{\varvec{\eta }}_{1},\ldots ,\widehat{\varvec{\eta }}_{K})\). In the second setting, referred to as Type II Problem, both \(\varvec{\theta }\) and \(\varvec{\eta }\) are estimable using \(\mathcal {D}\) and a two-step procedure is utilized to estimate \(\varvec{\theta }\) while updating the estimates for \((\varvec{\eta }_{1},\ldots ,\varvec{\eta }_{K})\). It may also happen that the available data corresponds to a Type II problem and while a Type II analysis would be possible, the researcher may decide to conduct a Type I Analysis, i.e., estimate \(\varvec{\theta }\) as if the data came from a Type I Problem. One of our results shows that this is an inferior strategy, i.e., if the data \(\mathcal {D}\) identifies \(\varvec{\omega }\) it is always advisable to re-estimate \(\varvec{\eta }\) and the loss of precision is quantified in terms of a simple decomposition of the variance matrices of the resulting estimators. It is also important to emphasize that there are situations in which the investigator, by means of the design of the study \(\mathcal {S}\), may control whether the problem is of Type I or a Type II.

To fix ideas consider the two-way ANOVA model in which the expected value of an outcome Y is given by

$$\begin{aligned} \mathbb {E}(Y|T_{1},T_{2}) = \eta _{0} + \eta _{1}T_{1} + \eta _{2}T_{2} + \theta T_{1}T_{2} \end{aligned}$$
(1)

where for \(i=1,2\), \(T_{i}\in \{0,1\}\) indicates whether treatment i is administered. Here \(\eta _{0}\) denotes the mean of Y when neither treatment is administered, \(\eta _{i}\) models the marginal increase in the expectation of Y when treatment i is administered and \(\theta \) models the interaction \(T_{1} \times T_{2}\). Suppose, now that the historical data consists of two studies \(\mathcal {S}_1\) and \(\mathcal {S}_2\) of sizes \(m_1\) and \(m_2\), respectively, where in the study \(\mathcal {S}_i\) treatment i was compared with a control. Clearly the historical data provides no information on \(\theta \). Thus inference on \(\theta \) would require a new study \(\mathcal {S}\) in which \(T_{1} = T_{2} = 1\) for some subset of the observations. For simplicity, interchangeability is assumed, i.e., all experimental units, in \(\mathcal {S}_1\) and \(\mathcal {S}_2\) as well as \(\mathcal {S}\), are assumed to be drawn from the same population, e.g., Peddada et al. (2007), and therefore any change in the mean response may be attributed solely to the treatment combination received. The assumption of interchangeability may be relaxed as discussed in Sect. 5.

One objective of this communication is to provide a methodology for effective design and analysis of a new study \(\mathcal {S}\) of size n which allows the estimation of \(\theta \) and utilizes the historical estimates of \((\eta _{0},\eta _{1},\eta _{2})\) obtained from \(\mathcal {S}_1\) and \(\mathcal {S}_2\). Depending on its objectives, the study \(\mathcal {S}\) may be of various forms. For example, one may choose to allocate all n observations to receive both treatments, i.e., \(T_1=T_2=1\). In this case, the data \(\mathcal {D}\) is an IID sample of observations with mean \(\eta _0+\eta _1+\eta _2+\theta \) and variance \(\sigma ^2\). Although the parameter \(\theta \) is not identifiable from \(\mathcal {D}\) alone it is estimable given the historical data, so this is clearly a Type I Problem. Alternatively, if \(\mathcal {S}\) allocates observation to all treatment combinations then \(\theta \) as well as \((\eta _0,\eta _1,\eta _2)\) are estimable from \(\mathcal {S}\) and this falls within the framework of a Type II Problem. This example will be further analyzed in Sect. 4.1.

3 Results

Our main theoretical findings, i.e., Theorems 3.1, 3.2, and 3.4 describe the limiting distributions of estimators for \(\varvec{\omega }\) which are then compared in Theorems 3.3, 3.5 and 3.6. Remark 7 provides a brief summary of the results of this Section.

3.1 Type I problems

Suppose first that we are in the setting of a Type I Problem. Recall that in such circumstances only \(\varvec{\theta }\) is estimated while \((\varvec{\eta }_1,\ldots ,\varvec{\eta }_K)\) are fixed at their historical values. Thus, let \(\bar{\varvec{\theta }}_{A}\) solve

$$\begin{aligned} \varvec{\varPsi }(\varvec{\theta },\widehat{\varvec{\eta }})=\varvec{0} \end{aligned}$$
(2)

where \(\widehat{\varvec{\eta }}=(\widehat{\varvec{\eta }}_{1}^{T},\ldots ,\widehat{\varvec{\eta }}_{K}^{T})^{T}\). The estimating Eq. (2) may be a score equation motivated by likelihood theory, a generalized estimating equation derived by quasi-likelihood or any other statistical estimation framework. Observe that the solution \(\bar{\varvec{\theta }}_{A}\) of (2) is obtained by plugging-in the sample values of the K independent estimators \(\widehat{\varvec{\eta }}_{1},\ldots ,\widehat{\varvec{\eta }}_{K}\). For simplicity we may further assume that the data \(\mathcal {D}\) is a random sample \(\varvec{Y}_{1},\ldots ,\varvec{Y}_{n}\) and (2) is of the form

$$\begin{aligned} \varvec{\varPsi }(\varvec{\theta },\widehat{\varvec{\eta }})=n^{-1}\sum _{i=1}^{n}\varvec{\psi }(\varvec{\theta },\widehat{\varvec{\eta }},\varvec{Y}_{i}). \end{aligned}$$

The function \(\varvec{\psi }\) is assumed to be: (i) continuously differentiable with respect to both \(\varvec{\theta }\) and \(\varvec{\eta }_{1},\ldots ,\varvec{\eta }_{K}\); it is further assumed to satisfy (ii) \(\mathbb {E}_{0}(\varvec{\psi }) = \varvec{0}\); (iii) \(\mathbb {E}_{0}(\varvec{\psi }\varvec{\psi }^T) < \infty \); (iv) the matrix \(\mathbb {E}_{0}(\partial \varvec{\psi }/\partial \varvec{\eta })\) exists; and (v) the matrix \(\mathbb {E}_{0}(\partial \varvec{\psi }/\partial \varvec{\theta })\) exists and is invertible. Here \(\mathbb {E}_{0}(\cdot )\) denotes the expectation taken at \(\varvec{\omega }_{0}=(\varvec{\theta }_0,\varvec{\eta }_{0})=(\varvec{\theta }_0,\varvec{\eta }_{1,0},\ldots ,\varvec{\eta }_{K,0})\), the true value of all parameters. Conditions \((i)-(v)\) are all standard regularity conditions often imposed in the literature (cf., Heyde 2008, Van der Vaart 2000). We now have the following:

Theorem 3.1

Let \(\bar{\varvec{\theta }}_{A}\) be a solution to (2) and set \(\bar{\varvec{\eta }}_{A}=\widehat{\varvec{\eta }}\). Assume that: (i) \(\bar{\varvec{\theta }}_{A}\) is consistent at \(\varvec{\omega }_{0}\); (ii) the estimating function \(\varvec{\psi }\) satisfies the regularity conditions listed above; and (iii) the historical estimates satisfy \(\sqrt{m_j}(\widehat{\varvec{\eta }}_{j}-\varvec{\eta }_{j,0}) \Rightarrow \mathcal {N}_{q_{j}}(\varvec{0},\varvec{\varSigma }_{j})\) and are independent of each other and of the current study. Then if \((m/m_j)\rightarrow \kappa _j < \infty \) for all \(j=1,\ldots ,K\) as \(m_j \rightarrow \infty \) and \(n/m \rightarrow \rho \in (0,\infty )\) as \(n \rightarrow \infty \) we have

$$\begin{aligned} \sqrt{n}(\bar{\varvec{\theta }}_{A}-\varvec{\theta }_{0},\bar{\varvec{\eta }}_{A}-\varvec{\eta }_{0})^T \Rightarrow \mathcal {N}_{p+q}(\varvec{0},\varvec{A}) \end{aligned}$$

where

$$\begin{aligned} \varvec{A} = \begin{pmatrix} \varvec{A}_{\varvec{\theta \theta }} &{} \varvec{A}_{\varvec{\theta \eta }} \\ \varvec{A}_{\varvec{\eta \theta }} &{} \varvec{A}_{\varvec{\eta \eta }} \end{pmatrix} \end{aligned}$$

with

$$\begin{aligned} \varvec{A}_{\varvec{\theta \theta }}&=(\varvec{D}^{-1}_{\varvec{\theta }_0})[\varvec{\varSigma }_{\varvec{\psi }} + \rho \varvec{D}_{\varvec{\eta }_0} \varvec{\varSigma } \varvec{D}^{T}_{\varvec{\eta }_0}] (\varvec{D}^{-1}_{\varvec{\theta }_0})^{T},\\ \varvec{A}_{\varvec{\theta \eta }}&=-\rho (\varvec{D}_{\varvec{\theta }_0})^{-1}\varvec{D}_{\varvec{\eta }_0}\varvec{\varSigma },\\ \varvec{A}_{\varvec{\eta \eta }}&=\rho \varvec{\varSigma }. \end{aligned}$$

where \(\varvec{D}_{{\varvec{\theta }_0}}=\mathbb {E}_{0}(\partial \varvec{\psi }/\partial {\varvec{\theta })}\), \(\varvec{D}_{{\varvec{\eta }_0}}=\mathbb {E}_{0}(\partial \varvec{\psi }/\partial {\varvec{\eta })}\), \(\varvec{\varSigma }_{\varvec{\psi }}=\mathbb {E}_{0}(\varvec{\psi }\varvec{\psi }^T)\) and \(\varvec{\varSigma } = \text {BlockDiag}(\kappa _{1}\varvec{\varSigma }_{1},\ldots ,\kappa _{K}\varvec{\varSigma }_{K})\).

Remark 1

Clearly, \(\varvec{A}_{\varvec{\theta \theta }}\) is the \(p\times p\) asymptotic variance matrix of \(\bar{\varvec{\theta }}_{A}\), \(\varvec{A}_{\varvec{\eta \eta }}\) is the \(q\times q\) asymptotic variance matrix of \(\bar{\varvec{\eta }}_{A}\) and \(\varvec{A}_{\varvec{\theta \eta }} = \varvec{A}_{\varvec{\eta \theta }}^T\) is their \(p\times q\) asymptotic covariance matrix.

Remark 2

As pointed out by a referee the application of Theorem 3.1 is predicated on the fact that the historical studies report the estimates of the variance matrices \(\varvec{\varSigma }_{1},\ldots ,\varvec{\varSigma }_{K}\).

The proof of Theorem 3.1 is a straightforward, but somewhat involved, application of the delta method. In contrast with Randles (1982) and Pierce (1982) which describe the limiting distribution of statistics that are explicit functions of estimated parameters, the estimator \(\bar{\varvec{\theta }}_{A}\) is an implicit function of \(\bar{\varvec{\eta }}_{A}\). For a related but less general result see Benichou and Gail (1989). Further note that \((\varvec{D}^{-1}_{{\varvec{\theta }_0}})\varvec{\varSigma }_{\varvec{\psi }}(\varvec{D}^{-1}_{{\varvec{\theta }_0}})^{T}\) is the asymptotic variance of \(\bar{\varvec{\theta }}_{A}\) when the true values of \(\varvec{\eta }_{1},\ldots ,\varvec{\eta }_{K}\) are known in advance. Thus the term

$$\begin{aligned} \rho \varvec{D}_{\varvec{\eta }_0} \varvec{\varSigma }\varvec{D}^{T}_{\varvec{\eta }_0} \end{aligned}$$

may be viewed as the penalty for substituting estimates for the true values of the parameters. The penalty may also be rewritten as \(\rho \sum _{j=1}^{K} \kappa _{j} \varvec{D}_{j} \varvec{\varSigma }_{j} \varvec{D}^{T}_{j}\) where \(\varvec{D}_{j} = \mathbb {E}_{0}(\partial \varvec{\psi }/\partial {\varvec{\eta }_{j})}\) which expresses its dependence on the relative sample sizes, the asymptotic variances of the historical estimators and the sensitivity of the estimation procedure with respect to the historical estimates, embodied in the matrices \(\varvec{D}_1,\ldots ,\varvec{D}_K\).

Remark 3

Note that if \(\rho \) is very small which occurs when \(m \gg n\), then the penalty is inconsequential, i.e., the asymptotic variance of \(\bar{\varvec{\theta }}_{A}\) is close to its variance when \(\varvec{\eta }_{1},\ldots ,\varvec{\eta }_{K}\) are fully known.

3.2 Type II problems

Next, consider the case where both \(\varvec{\theta }\) and \(\varvec{\eta }\) are estimable using the data \(\mathcal {D}\) observed in the current study \(\mathcal {S}\). Further assume that two estimating functions \(\varvec{\varPsi }\) and \(\varvec{\varGamma }\) are available to us; \(\varvec{\varPsi }\) is an estimating function for \(\varvec{\theta }\) given a known fixed value of \(\varvec{\eta }\), as in Type I Problems, whereas \(\varvec{\varGamma }\) is an estimating function for \(\varvec{\eta }\) given a known fixed value of \(\varvec{\theta }\). For example, within the likelihood framework \(\varvec{\varPsi }\) is the score with respect to \(\varvec{\theta }\) while \(\varvec{\varGamma }\) is the score with respect to \(\varvec{\eta }\).

We propose estimating \(\varvec{\omega }\) using a two step procedure. In the first step the data \(\mathcal {D}\) is used to obtain the pair \((\tilde{\varvec{\theta }},\tilde{\varvec{\eta }})^T\) which simultaneously solve

$$\begin{aligned} \varvec{\varPsi }(\varvec{\theta },\varvec{\eta })=\varvec{0} \quad \text{ and } \quad \varvec{\varGamma }(\varvec{\theta },\varvec{\eta })=\varvec{0}. \end{aligned}$$
(3)

Under standard regularity conditions, cf., the conditions listed just before the statement of Theorem 3.1, the estimators \((\tilde{\varvec{\theta }},\tilde{\varvec{\eta }})^T\) satisfy

$$\begin{aligned} \sqrt{n}(\tilde{\varvec{\theta }}-\varvec{\theta }_{0},\tilde{\varvec{\eta }}-\varvec{\eta }_{0})^T \Rightarrow \mathcal {N}_{p+q}(\varvec{0},\varvec{\varUpsilon }) \end{aligned}$$
(4)

where \(\varvec{\varUpsilon }\) is assumed to be a non-singular variance matrix which can be consistently estimated from the data by, say \(\tilde{\varvec{\varUpsilon }}\), the standard sandwich estimator (Van der Vaart 2000). For convenience, we may partition \(\varvec{\varUpsilon }\) as

$$\begin{aligned} \varvec{\varUpsilon } = \begin{pmatrix} \varvec{\varUpsilon }_{\varvec{\theta }\varvec{\theta }} &{} \varvec{\varUpsilon }_{\varvec{\theta }\varvec{\eta }}\\ \varvec{\varUpsilon }_{\varvec{\eta }\varvec{\theta }} &{} \varvec{\varUpsilon }_{\varvec{\eta }\varvec{\eta }} \end{pmatrix} \end{aligned}$$
(5)

where \(\varvec{\varUpsilon }_{\varvec{\theta }\varvec{\theta }}\) and \(\varvec{\varUpsilon }_{\varvec{\eta }\varvec{\eta }}\) denote the marginal asymptotic variances of \(\tilde{\varvec{\theta }}\) and \(\tilde{\varvec{\eta }}\), respectively, and \(\varvec{\varUpsilon }_{\varvec{\theta }\varvec{\eta }}\) is their asymptotic covariance. Naturally, a similar partition holds for \(\tilde{\varvec{\varUpsilon }}\). Furthermore, as in Sect. 3.1, at our disposal are K independent historical estimates of \(\varvec{\eta }_{1},\ldots ,\varvec{\eta }_{K}\) obtained using studies of sizes \(m_1,\ldots ,m_K\) which satisfy \(\sqrt{m_{j}}(\widehat{\varvec{\eta }}_{j}-\varvec{\eta }_{j,0})\Rightarrow \mathcal {N}_{q_{j}}(\varvec{0},\varvec{\varSigma }_{j})\), where, again, it is assumed that \(\varvec{\varSigma }_{j}\) are non-singular and can be consistently estimated for all \(j=1,\ldots ,K\). Thus

$$\begin{aligned} \sqrt{m}(\widehat{\varvec{\eta }}-\varvec{\eta }_{0})\Rightarrow \mathcal {N}_{q}(\varvec{0},\varvec{\varSigma }) \end{aligned}$$
(6)

where \(\varvec{\varSigma }\) is given in the statement of Theorem 3.1. Let \(\widehat{\varvec{\varSigma }}\) be a consistent estimator of \(\varvec{\varSigma }\).

The historic and current estimates of \(\varvec{\eta }\) can be aggregated, or combined, in many ways. Lemma 2, appearing in the Appendix, suggests using the estimator

$$\begin{aligned} \bar{\varvec{\eta }} = (n\tilde{\varvec{\varUpsilon }}_{\varvec{\eta \eta }}^{-1}+m\widehat{\varvec{\varSigma }}^{-1})^{-1} (n\tilde{\varvec{\varUpsilon }}_{\varvec{\eta \eta }}^{-1}\tilde{\varvec{\eta }}+m\widehat{\varvec{\varSigma }}^{-1}\widehat{\varvec{\eta }}) \end{aligned}$$
(7)

which is the MLE under normality assuming that the matrices \({\varvec{\varUpsilon }}_{\varvec{\eta \eta }}\) and \(\varvec{\varSigma }\) are known. Note that

$$\begin{aligned} \bar{\varvec{\eta }}= \varvec{W}_1\tilde{\varvec{\eta }}+\varvec{W}_2\widehat{\varvec{\eta }}+o_{p}(1) \end{aligned}$$

where the weights \(\varvec{W}_1\) and \(\varvec{W}_2\) are the symmetric matrices

$$\begin{aligned} \varvec{W}_1= & {} (\gamma \varvec{\varUpsilon }_{\varvec{\eta \eta }}^{-1}+(1-\gamma )\varvec{\varSigma }^{-1})^{-1}\gamma \varvec{\varUpsilon }_{\varvec{\eta \eta }}^{-1} \quad \text {and} \nonumber \\ \varvec{W}_2= & {} (\gamma \varvec{\varUpsilon }_{\varvec{\eta \eta }}^{-1}+(1-\gamma )\varvec{\varSigma }^{-1})^{-1}(1-\gamma )\varvec{\varSigma }^{-1} \end{aligned}$$
(8)

which satisfy \(\varvec{I} =\varvec{W}_1+\varvec{W}_2\) with \(\gamma = \lim (n/(n+m))\). Thus (7) differs from the best linear unbiased estimator by at most an \(o_{p}(1)\) term.

In the second step we find \(\bar{\varvec{\theta }}_{B}\) by solving

$$\begin{aligned} \varvec{\varPsi }(\varvec{\theta },\bar{\varvec{\eta }})=0. \end{aligned}$$
(9)

where \(\bar{\varvec{\eta }}\) is given by (7). We now have:

Theorem 3.2

Let \(\bar{\varvec{\theta }}_{B}\) be a solution to (9) where \(\bar{\varvec{\eta }}_{B}=\bar{\varvec{\eta }}\) is given in (7). Assume that the regularity conditions of Theorem 3.1 hold. Then

$$\begin{aligned} \sqrt{n}(\bar{\varvec{\theta }}_{B}-\varvec{\theta }_{0},\bar{\varvec{\eta }}_{B}-\varvec{\eta }_{0}) \Rightarrow \mathcal {N}_{p+q}(\varvec{0},\varvec{B}) \end{aligned}$$
(10)

where

$$\begin{aligned} \varvec{B} = \begin{pmatrix} \varvec{B}_{\varvec{\theta \theta }} &{} \varvec{B}_{\varvec{\theta \eta }} \\ \varvec{B}_{\varvec{\eta \theta }} &{} \varvec{B}_{\varvec{\eta \eta }} \end{pmatrix} \end{aligned}$$

with

$$\begin{aligned} \varvec{B}_{\varvec{\theta \theta }}&= (\varvec{D}^{-1}_{\varvec{\theta }_0})[\varvec{\varSigma }_{\varvec{\psi }} + \varvec{D}_{\varvec{\eta }_0}(\varvec{\varUpsilon }_{\varvec{\eta \eta }}^{-1}+(\rho \varvec{\varSigma })^{-1})^{-1}\varvec{D}^{T}_{\varvec{\eta }_0}] (\varvec{D}^{-1}_{\varvec{\theta }_0})^{T},\\ \varvec{B}_{\varvec{\theta \eta }}&=-\varvec{D}_{\varvec{\theta }_{0}}^{-1}\varvec{D}_{\varvec{\eta }_{0}}(\varvec{\varUpsilon }_{\varvec{\eta \eta }}^{-1}+(\rho \varvec{\varSigma })^{-1})^{-1},\\ \varvec{B}_{\varvec{\eta \eta }}&=(\varvec{\varUpsilon }_{\varvec{\eta \eta }}^{-1}+(\rho \varvec{\varSigma })^{-1})^{-1}. \end{aligned}$$

Although the mechanics are slightly more involved, the proof of Theorem 3.2 builds on the proof of Theorem 3.1. Moreover, the structures of the asymptotic variance matrices \(\varvec{A}\) and \(\varvec{B}\) are analogous with the exception that the variance matrix \(\rho \varvec{\varSigma }\) appearing in \(\varvec{A}\) is replaced with \((\varvec{\varUpsilon }_{\varvec{\eta \eta }}^{-1}+(\rho \varvec{\varSigma })^{-1})^{-1}\) in \(\varvec{B}\).

Remark 4

Observe that \((\varvec{\varUpsilon }_{\varvec{\eta \eta }}^{-1}+(\rho \varvec{\varSigma })^{-1})^{-1} \rightarrow \varvec{0}\) as \(\rho \rightarrow 0\) so the conclusions of Remark 3 hold here as well.

It is clear that whenever the model for \(\mathcal {D}\) identifies \(\varvec{\omega }\), both \((\bar{\varvec{\theta }}_{B},\bar{\varvec{\eta }}_{B})\) and \((\bar{\varvec{\theta }}_{A},\bar{\varvec{\eta }}_{A})\) can be computed. Next, using the concept of the Loewner order we show the former is superior to the latter. Recall that the matrix \(\varvec{V}_1\) is said to be smaller in the Loewner order compared with the matrix \(\varvec{V}_2\) if \(\varvec{V}_2-\varvec{V}_1\) is non-negative definite (Pukelsheim 2006). This relationship is denoted by \(\varvec{V}_1\preceq \varvec{V}_2\). Suppose now that \(\varvec{V}_1\) and \(\varvec{V}_2\) are the variances of two (asymptotically) unbiased estimators. Then \(\varvec{V}_1\preceq \varvec{V}_2\) implies that the estimator associated with \(\varvec{V}_1\) is more efficient than the estimator associated with \(\varvec{V}_2\). This means, for example, that the confidence ellipsoid associated with \(\varvec{V}_1\) lies within the confidence ellipsoid associated with \(\varvec{V}_2\).

Theorem 3.3

Whenever the data \(\mathcal {D}\) identifies \(\varvec{\omega }\) we have

$$\begin{aligned} \varvec{B} \preceq \varvec{A}. \end{aligned}$$
(11)

Moreover, for any function \(\varvec{\varPhi }\) we have \(\varvec{V}_{\varvec{B}}^{\varvec{\varPhi }} \preceq \varvec{V}_{\varvec{A}}^{\varvec{\varPhi }}\) where \(\varvec{V}_{\varvec{A}}^{\varvec{\varPhi }}\) and \(\varvec{V}_{\varvec{B}}^{\varvec{\varPhi }}\) are the asymptotic variances of \(\varvec{\varPhi }(\bar{\varvec{\theta }}_{A},\bar{\varvec{\eta }}_{A})\) and \(\varvec{\varPhi }(\bar{\varvec{\theta }}_{B},\bar{\varvec{\eta }}_{B})\) respectively.

Theorem 3.3 indicates that, if possible, it is always asymptotically beneficial to estimate both \(\varvec{\theta }\) and \(\varvec{\eta }\) using the data \(\mathcal {D}\) collected in the study \(\mathcal {S}\). Moreover, Theorem 3.3 holds also when only a sub-vector of \(\varvec{\eta }\) is identified by the data \(\mathcal {D}\).

Remark 5

As noted by a referee, an alternative approach to Type II Problems would be to combine the first and second estimation steps. This can be done by simultaneously solving the estimating equations

$$\begin{aligned} \varvec{\varPsi }(\varvec{\theta },{\varvec{\eta }}) = \varvec{0} \quad \text{ and } \quad \varvec{\varGamma }(\varvec{\theta },{\varvec{\eta }})+m\widehat{\varvec{\varSigma }}^{-1}(\widehat{\varvec{\eta }}-\varvec{\eta })=\varvec{0}. \end{aligned}$$
(12)

The system (12) is obtained from (3) by augmenting the estimating function \(\varvec{\varGamma }\) with the term \(m\widehat{\varvec{\varSigma }}^{-1}(\widehat{\varvec{\eta }}-\varvec{\eta })\). The latter is a pseudo-score equation which follows directly from (6). By appropriately modifying the proof of Theorem 3.2 it can be shown that \((\bar{\varvec{\theta }}_{B},\bar{\varvec{\eta }}_{B})\) have the same limiting distribution as the solution of (12). It thus follows that the estimator (7) is asymptotically efficient up to the first order.

Another variant of Type II Problems occurs when the data \(\mathcal {D}\) is not available, but nevertheless the estimates \((\tilde{\varvec{\theta }},\tilde{\varvec{\eta }})\) from the current study as well as their estimated variance, i.e., \(\tilde{\varvec{\varUpsilon }}\) is given. The objective is then to combine the current estimators (4) with the historical estimators (6). To this end we propose estimating \(\varvec{\theta }\) by

$$\begin{aligned} \bar{\varvec{\theta }}_{C} = \tilde{\varvec{\theta }} - \tilde{\varvec{\varUpsilon }}_{\varvec{\theta }\varvec{\eta }}\tilde{\varvec{\varUpsilon }}_{\varvec{\eta }\varvec{\eta }}^{-1}(\tilde{\varvec{\eta }}- \bar{\varvec{\eta }}_{C}) \end{aligned}$$
(13)

where \(\bar{\varvec{\eta }}_{C}=\bar{\varvec{\eta }}\) is given by (7). The estimators (7) as well as (13) are motivated by Lemma 2 and Remark 8 appearing in the Appendix.

Theorem 3.4

Let \((\bar{\varvec{\theta }}_{C},\bar{\varvec{\eta }}_{C})^T\) be defined by (7) and (13). Suppose further that (4) and (6) hold and both \(\varvec{\varUpsilon }\) and \(\varvec{\varSigma }\) can be consistently estimated. Then as \(n \rightarrow \infty \) we have

$$\begin{aligned} \sqrt{n}(\bar{\varvec{\theta }}_{C}-\varvec{\theta }_{0},\bar{\varvec{\eta }}_{C}-\varvec{\eta }_{0})^T \Rightarrow \mathcal {N}_{p+q}(\varvec{0},\varvec{C}) \end{aligned}$$

where \(\varvec{C}=\varvec{M}\varvec{V}\varvec{M}^T\) with

$$\begin{aligned} \varvec{V} = \begin{pmatrix} \varvec{\varUpsilon } &{} \varvec{0} \\ \varvec{0} &{} \rho \varvec{\varSigma } \end{pmatrix} \quad \text {and} \quad \varvec{M}= \begin{pmatrix} \varvec{I} &{} -\varvec{R}\varvec{W}_{2} &{} \varvec{R}\varvec{W}_{2} \\ \varvec{0} &{} \varvec{W}_{1} &{} \varvec{W}_{2} \end{pmatrix}. \end{aligned}$$
(14)

The matrices \(\varvec{W}_1\) and \(\varvec{W}_2\) are defined in (8) and \(\varvec{R}=\varvec{\varUpsilon }_{\varvec{\theta }\varvec{\eta }}\varvec{\varUpsilon }_{\varvec{\eta }\varvec{\eta }}^{-1}\). Moreover, we have:

$$\begin{aligned} \varvec{C}_{\varvec{\theta \theta }}&=\varvec{\varUpsilon }_{\varvec{\theta \theta }} - \varvec{\varUpsilon }_{\varvec{\theta \eta }} \varvec{\varUpsilon }_{\varvec{\eta \eta }}^{-1} \varvec{W}_2 \varvec{\varUpsilon }_{\varvec{\theta \eta }}^{T},\\ \varvec{C}_{\varvec{\theta \eta }}&=\varvec{\varUpsilon }_{\varvec{\theta \eta }}\varvec{W}_1\\ \varvec{C}_{\varvec{\eta \eta }}&=(\varvec{\varUpsilon }_{\varvec{\eta \eta }}^{-1}+(\rho \varvec{\varSigma })^{-1})^{-1}. \end{aligned}$$

Theorem 3.4 describes the large sample behavior of the estimators (7) and (13). Further insight is facilitated by considering the simplest possible situation, i.e., when \((\theta ,\eta )\in \mathbb {R}^2\), in which case \(\sqrt{m}(\widehat{\eta }-\eta _{0})\Rightarrow \mathcal {N}(0,\sigma ^2)\) for the historical data, whereas for the current study \(\sqrt{n}(\tilde{\theta }-\theta _{0},\tilde{\eta }-\eta _{0})^{T} \Rightarrow \mathcal {N}(0,\varvec{\varUpsilon })\) where

$$\begin{aligned} \varvec{\varUpsilon }= \begin{pmatrix} \upsilon _{\theta \theta }^{2} &{} \upsilon _{\theta \eta } \\ \upsilon _{\theta \eta } &{} \upsilon _{\eta \eta }^{2} \end{pmatrix}. \end{aligned}$$

It is not hard to see that (13) reduces to \(\bar{\theta }=\tilde{\theta }-(\tilde{\upsilon }_{\theta \eta }/\tilde{\upsilon }_{\eta \eta }^2)(\tilde{\eta }-\bar{\eta })\) where \(\bar{\eta } = w_{1}^{*}\tilde{\eta }+w_{2}^{*}\widehat{\eta }\) with

$$\begin{aligned} w_{1}^{*} = \frac{n/\tilde{\upsilon }_{\eta \eta }^2}{n/\tilde{\upsilon }_{\eta \eta }^2+m/\widehat{\sigma }^2} \quad \text {and} \quad w_{2}^{*} = \frac{m/\widehat{\sigma }^2}{n/\tilde{\upsilon }_{\eta \eta }^2+m/\widehat{\sigma }^2}. \end{aligned}$$

Furthermore \(\varvec{C}_{\theta \theta }\) simplifies to

$$\begin{aligned} \upsilon _{\theta \theta }^2-\frac{\upsilon _{\theta \eta }^2}{\upsilon _{\eta \eta }^2}w_2 = \upsilon _{\theta \theta }^2(1-w_2r^2), \end{aligned}$$
(15)

where \(r=\upsilon _{\theta \eta }/(\upsilon _{\theta \theta }\upsilon _{\eta \eta })\) is the asymptotic correlation between \(\tilde{\theta }\) and \(\tilde{\eta }\) and

$$\begin{aligned} w_2 = \frac{ (1-\gamma )/\sigma ^{2} }{ \gamma /\upsilon _{\eta \eta }^2 + (1-\gamma )/\sigma ^{2} } \end{aligned}$$

is the limiting value of \(w_{2}^{*}\) as \(n/(n+m) \rightarrow \gamma \). It follows that the asymptotic relative efficiency of \(\bar{\theta }\) to \(\tilde{\theta }\) is \(1-w_2r^2\), which is at most unity (when \(\upsilon _{\theta \eta }=0\)) and no less than \(1-r^2\) (when \(\gamma \) is close to 0). Clearly, the historical estimates are useful only if the covariance \(\upsilon _{\theta \eta }\) is non-zero and highly useful whenever \(w_2\) is close to unity. A similar but more involved analysis applies when the parameters are multidimensional.

We emphasize that the structure of the estimators \(\bar{\varvec{\eta }}_{C}\) and \(\bar{\varvec{\theta }}_{C}\) as well as the form of \(\varvec{C}\) are related to, but much more general, than results obtained in the literature on both double sampling and monotone missing normal data (Anderson 1957; Morrison 1971; Kanda and Fujikoshi 1998). Double sampling is a widely used technique in survey sampling, where the estimator is also known as the generalized regression estimator (Thompson 1997), as well as in other applications, cf. Davidov and Haitovsky (2000), Chen and Chen (2000) and the references therein. We also note that Eq. (15) is a generalization of the formulas obtained for the usual double sampling estimator (e.g., Tamhane 1978) where \(w_2 = m/(n+m)\). The following Theorem substantially generalizes on results obtained in the literature on both the double sampling and monotone missing data.

Theorem 3.5

We have

$$\begin{aligned} \varvec{C} \preceq \varvec{\varUpsilon } \end{aligned}$$
(16)

Moreover, for any function \(\varvec{\varPhi }\) we have \(\varvec{V}_{\varvec{C}}^{\varvec{\varPhi }} \preceq \varvec{V}_{\varvec{\varUpsilon }}^{\varvec{\varPhi }}\) where \(\varvec{V}_{\varvec{C}}^{\varvec{\varPhi }}\) and \(\varvec{V}_{\varvec{\varUpsilon }}^{\varvec{\varPhi }}\) are the asymptotic variances of \(\varPhi (\bar{\varvec{\theta }}_{C},\bar{\varvec{\eta }}_{C})\) and \(\varPhi (\tilde{\varvec{\theta }},\tilde{\varvec{\eta }})\) respectively.

In words, the estimator \((\bar{\varvec{\theta }}_{C},\bar{\varvec{\eta }}_{C})\), incorporating the historical estimates and derived by combining \((\tilde{\varvec{\theta }},\tilde{\varvec{\eta }})\) and \(\widehat{\varvec{\eta }}\), is more precise than \((\tilde{\varvec{\theta }},\tilde{\varvec{\eta }})\), the estimator based only on the current study.

Remark 6

It is also important to emphasize that in finite, typically small samples, the estimator \(\varvec{C}_{\varvec{\theta \theta }}\) may be in fact inferior to \(\varvec{\varUpsilon }_{\varvec{\theta \theta }}\). This typically occurs when the “regression matrix” \(\varvec{R}\), see the statement of Theorem 3.4, is poorly estimated. This feature has been also recognized in the double sampling literature (Tamhane 1978).

A little algebra shows that

$$\begin{aligned} \varvec{C}_{\varvec{\theta \theta }}&=\varvec{\varUpsilon }_{\varvec{\theta \theta }} - \varvec{\varUpsilon }_{\varvec{\theta \eta }} \varvec{\varUpsilon }_{\varvec{\eta \eta }}^{-1} (\varvec{\varUpsilon }_{\varvec{\eta \eta }}^{-1}+(\rho \varvec{\varSigma })^{-1})^{-1} (\rho \varvec{\varSigma })^{-1} \varvec{\varUpsilon }_{\varvec{\theta \eta }}^{T},\\ \varvec{C}_{\varvec{\theta \eta }}&=\varvec{\varUpsilon }_{\varvec{\theta \eta }}\varvec{\varUpsilon }_{\varvec{\eta \eta }}^{-1}(\varvec{\varUpsilon }_{\varvec{\eta \eta }}^{-1}+(\rho \varvec{\varSigma })^{-1})^{-1}\\ \varvec{C}_{\varvec{\eta \eta }}&=(\varvec{\varUpsilon }_{\varvec{\eta \eta }}^{-1}+(\rho \varvec{\varSigma })^{-1})^{-1}. \end{aligned}$$

so we can remove the dependence of \(\varvec{C}\) on the matrices \(\varvec{W}_1\) and \(\varvec{W}_2\).

Clearly, whenever the data \(\mathcal {D}\) is available both \((\bar{\varvec{\theta }}_{B},\bar{\varvec{\eta }}_{B})\) and \((\bar{\varvec{\theta }}_{C},\bar{\varvec{\eta }}_{C})\) can be calculated where \(\bar{\varvec{\eta }}_{B}=\bar{\varvec{\eta }}_{C}\) are given in (7). Recall that \(\bar{\varvec{\theta }}_{B}\) solves \(\varvec{\varPsi }(\varvec{\theta },\bar{\varvec{\eta }})=\varvec{0}\) where \(\varvec{\varPsi }(\varvec{\theta },\varvec{\eta })=n^{-1}\sum _{i=1}^{n} \varvec{\psi }(\varvec{\theta },\varvec{\eta },\varvec{Y}_i)\). Similarly, we can view \(\bar{\varvec{\theta }}_{C}\) as a solution to an estimating equation \(\varvec{\varLambda }(\varvec{\theta },\bar{\varvec{\eta }})=\varvec{0}\) where \(\varvec{\varLambda }(\varvec{\theta },\varvec{\eta })=n^{-1}\sum _{i=1}^{n} \varvec{\lambda }(\varvec{\theta },\varvec{\eta },\varvec{Y}_i)\). The form of \(\varvec{\varLambda }\) can be easily deduced from Lemma 2 and that of \(\varvec{\lambda }\) by plugging in the influence functions for \(\tilde{\varvec{\theta }}\) and \(\tilde{\varvec{\eta }}\) into \(\varvec{\varLambda }\). In fact, the precise form of the influence function of \(\bar{\varvec{\theta }}_{C}\) is readily derived, for more details see Remark 9 appearing in the Appendix. It is worth noting that \(\varvec{\varPsi }\) operates on the full data \(\mathcal {D}\) whereas \(\varvec{\varLambda }\) operates on functions thereof namely the estimators \((\tilde{\varvec{\theta }},\tilde{\varvec{\eta }})\) and \(\widehat{\varvec{\eta }}\). Thus \((\bar{\varvec{\theta }}_{C},\bar{\varvec{\eta }}_{C})\) can be viewed as functions of a coarsening of the data \(\mathcal {D}\) and therefore is expected to be less efficient than \((\bar{\varvec{\theta }}_{B},\bar{\varvec{\eta }}_{B})\). This indeed is the case under mild regularity conditions. A formal statement requires the introduction of some additional notation. Let \(\varvec{h}=\varvec{h}(\varvec{\theta },\varvec{\eta },\varvec{Y})\) denote any estimating function and denote \(\varvec{D}_{\varvec{\theta }_{0}}(\varvec{h}) = \mathbb {E}_{0}(\partial \varvec{h}/\partial {\varvec{\theta })}\) and \(\varvec{D}_{\varvec{\eta }_{0}}(\varvec{h}) = \mathbb {E}_{0}(\partial \varvec{h}/\partial {\varvec{\theta })}\). Note that earlier we referred to \(\varvec{D}_{\varvec{\theta }_{0}}(\varvec{\psi })\) and \(\varvec{D}_{\varvec{\eta }_{0}}(\varvec{\psi })\) simply as \(\varvec{D}_{\varvec{\theta }_{0}}\) and \(\varvec{D}_{\varvec{\eta }_{0}}\). Now:

Theorem 3.6

Suppose that both \(\bar{\varvec{\omega }}_{B}\) and \(\bar{\varvec{\omega }}_{C}\) can be obtained. If

$$\begin{aligned} \varvec{D}_{\varvec{\theta }_{0}}(\varvec{\psi })^{-1}\varvec{D}_{\varvec{\eta }_{0}}(\varvec{\psi }) \le \varvec{D}_{\varvec{\theta }_{0}}(\varvec{\lambda })^{-1} \varvec{D}_{\varvec{\eta }_{0}}(\varvec{\lambda }) \end{aligned}$$
(17)

component-wise and

$$\begin{aligned} (\varvec{D}_{\varvec{\theta }_{0}}(\varvec{\psi })^{-1})\mathbb {E}_{0}(\varvec{\psi \psi }^T)(\varvec{D}_{\varvec{\theta }_{0}}(\varvec{\psi })^{-1})^T \preceq (\varvec{D}_{\varvec{\theta }_{0}}(\varvec{\lambda })^{-1})\mathbb {E}_{0}(\varvec{\lambda \lambda }^T)(\varvec{D}_{\varvec{\theta }_{0}}(\varvec{\lambda })^{-1})^T \end{aligned}$$
(18)

in the Loewner order, then

$$\begin{aligned} \varvec{B} \preceq \varvec{C}. \end{aligned}$$
(19)

Moreover, for any function \(\varvec{\varPhi }\) we have \(\varvec{V}_{\varvec{B}}^{\varvec{\varPhi }} \preceq \varvec{V}_{\varvec{C}}^{\varvec{\varPhi }}\) where \(\varvec{V}_{\varvec{B}}^{\varvec{\varPhi }}\) and \(\varvec{V}_{\varvec{C}}^{\varvec{\varPhi }}\) are the asymptotic variances of \(\varvec{\varPhi }(\bar{\varvec{\theta }}_{B},\bar{\varvec{\eta }}_{B})\) and \(\varvec{\varPhi }(\bar{\varvec{\theta }}_{C},\bar{\varvec{\eta }}_{C})\) respectively.

Condition (18) holds when the estimating equation \(\varvec{\varPsi }(\varvec{\theta },\varvec{\eta }_0)=\varvec{0}\) results in more efficient estimators for \(\varvec{\theta }\) than those resulting from \(\varvec{\varLambda }(\varvec{\theta },\varvec{\eta }_0)=\varvec{0}\) when \(\varvec{\eta }=\varvec{\eta }_{0}\) is set to its true value. This condition holds for any sensible choice of \(\varvec{\varPsi }\). In particular it holds for the score equations associated with maximum likelihood estimation. Condition (17) roughly means that \(\varvec{\varPsi }\) is less sensitive to small perturbations in both \(\varvec{\theta }\) and \(\varvec{\eta }\) compared with \(\varvec{\varLambda }\). Conditions (17) are (18) are not necessary. For example, the conclusion of Theorem 3.6 may hold if \(\varvec{\psi }\) is more sensitive to small perturbations but at the same time much more efficient. We believe that the aforementioned conditions hold broadly and the estimators \((\bar{\varvec{\theta }}_{C},\bar{\varvec{\eta }}_{C})\), described in Theorem 3.4, are generally less efficient than \((\bar{\varvec{\theta }}_{B},\bar{\varvec{\eta }}_{B})\), described in Theorem 3.2. For an additional discussion see Remark 9 in the Appendix. There are, however, situations in which \(\varvec{B}=\varvec{C}\) and situations where \(\bar{\varvec{\omega }}_{B} =\bar{\varvec{\omega }}_{C}\) for any data \(\mathcal {D}\). As we shall see in the next section this is the case in normal linear models in which the estimators \((\tilde{\varvec{\theta }},\tilde{\varvec{\eta }})\) and \(\widehat{\varvec{\eta }}\) are actually sufficient statistics. Finally, it is worth noting that if \(\varvec{\varUpsilon }_{\varvec{\theta \eta }} = \varvec{0}\) then the estimator \((\bar{\varvec{\theta }}_{C},\bar{\varvec{\eta }}_{C})\) does not improve \(\tilde{\varvec{\theta }}\) whereas there is always an improvement when the full data \(\mathcal {D}\) is available.

Remark 7

To summarize, the proposed estimators are designed to extract as much information as possible from the data. Recall that in Type I Problems the parameter \(\varvec{\eta }\) is not estimable given the current data \(\mathcal {D}\). The pair \((\bar{\varvec{\theta }}_{A},\bar{\varvec{\eta }}_{A})\) solves the system of equations \(\varvec{\varPsi }(\bar{\varvec{\theta }}_{A},\bar{\varvec{\eta }}_{A})=\varvec{0}\) with \(\bar{\varvec{\eta }}_{A}=\widehat{\varvec{\eta }}\) where \(\widehat{\varvec{\eta }}\) is the historical estimator. In Type II problems the pair \((\bar{\varvec{\theta }}_{B},\bar{\varvec{\eta }}_{B})\) solves the system of equations \(\varvec{\varPsi }(\bar{\varvec{\theta }}_{B},\bar{\varvec{\eta }}_{B})=0\) where \(\bar{\varvec{\eta }}_{B}\), given by (7), is a weighted combination of \(\widehat{\varvec{\eta }}\) and \(\tilde{\varvec{\eta }}\). Moreover, as noted earlier it can be shown that the resulting estimators are asymptotically efficient. Our approach to Type I and the first variant of Type II problems are of similar structure: plug into \(\varvec{\varPsi }\) the best available estimator of \(\varvec{\eta }\). In the second variant of Type II Problems the pair \((\bar{\varvec{\theta }}_{C},\bar{\varvec{\eta }}_{C})\) solves a different system of estimating equations which we denote by \(\varvec{\varLambda }(\bar{\varvec{\theta }}_{C},\bar{\varvec{\eta }}_{C})=\varvec{0}\). These equations are the likelihood equations given in Lemma 2 assuming that the variance matrices \(\varvec{\varSigma }\) and \(\varvec{\varUpsilon }\) are fully known. Since \((\bar{\varvec{\theta }}_{C},\bar{\varvec{\eta }}_{C})\) are based on further coarsening of the data they are generally less efficient.

4 Illustrations, applications and numerical results

In this section two applications are discussed in detail. In Sect. 4.1 the two-way ANOVA problem introduced in Sect. 2 is investigated. In particular, various design options for the current study \(\mathcal {S}\) are evaluated. It is worth noting that although the abovementioned ANOVA problem is among the simplest possible, its analysis is far from trivial. Next, in Sect. 4.2 we discuss the use of historical estimates in the design of drug interaction studies in the context of Bliss independence. A simple algorithm for the design of such studies is proposed.

4.1 Two way ANOVA

Recall the ANOVA model of Sect. 2 where the studies \(\mathcal {S}_1\) and \(\mathcal {S}_2\) were designed to estimate \(\varvec{\eta }_{1}=(\eta _0,\eta _1)^T\) and \(\varvec{\eta }_{2}=(\eta _0,\eta _2)^T\), respectively. Note that the parameter \(\eta _0\) is estimated in both studies so \(\varvec{\eta }_{1}\) and \(\varvec{\eta }_{2}\) are not distinct. Therefore employing any of the aforementioned findings requires the aggregation of the historical estimates as if they came from a single experiment. The historical studies result in the estimates \((\widehat{\eta }_{0}(\mathcal {S}_{1}),\widehat{\eta }_{1}(\mathcal {S}_{1}))\) and \((\widehat{\eta }_{0}(\mathcal {S}_{2}),\widehat{\eta }_{2}(\mathcal {S}_{2}))\) as well as their standard errors. Given these we can easily back calculate the unobserved means and sample sizes in the studies \(\mathcal {S}_1\) and \(\mathcal {S}_2\) and estimate \((\eta _0,\eta _1,\eta _2)\) by:

$$\begin{aligned} \widehat{\eta }_{1} = \bar{Y}_1(\mathcal {S}_1) - \widehat{\eta }_{0}, \quad \widehat{\eta }_{2} = \bar{Y}_2(\mathcal {S}_2) - \widehat{\eta }_{0}, \quad \text{ and } \quad \widehat{\eta }_{0} = \frac{ m_{1,0}\bar{Y}_0(\mathcal {S}_1) + m_{2,0}\bar{Y}_0(\mathcal {S}_2) }{m_{1,0}+m_{2,0}},\nonumber \\ \end{aligned}$$
(20)

where the quantity \(\bar{Y}_j(\mathcal {S}_i)\) is the average response on treatment \(j\in \{0,1,2\}\) in study \(i\in \{1,2\}\) and \(m_{i,j}\) is the size of of treatment group j in study i. Under the usual conditions

$$\begin{aligned} \sqrt{m}(\widehat{\eta _0}-{\eta }_{0},\widehat{\eta _1}-{\eta }_{1}, \widehat{\eta _2}-{\eta }_{2})^T \Rightarrow \mathcal {N}(\varvec{0},\varvec{\varSigma }), \end{aligned}$$

for some matrix \(\varvec{\varSigma }\). Furthermore, if (1) is a homoscedastic model with variance \(\sigma ^2\) and \(m_{1,0}=m_{1,1}=m_{2,0}=m_{2,2}\), i.e., the studies \(\mathcal {S}_{1}\) and \(\mathcal {S}_{2}\) are balanced and of the same size, then it is easy to see that

$$\begin{aligned} \varvec{\varSigma } =\sigma ^2 \begin{pmatrix} 2 &{} -2 &{} -2 \\ -2 &{} 6 &{} 2 \\ -2 &{} 2 &{} 6 \end{pmatrix}. \end{aligned}$$

We will now investigate various designs for a new study \(\mathcal {S}\). When the primary focus of \(\mathcal {S}\) is inference on \(\theta \) then it may be advantageous in some circumstances to allocate all n observations to the treatment arm receiving both treatments one and two, i.e., \(T_{1}=T_{2}=1\) for all observations. This is clearly a Type I problem since \(\varvec{\omega }\) is not identifiable from \(\mathcal {D}\) but given \(\varvec{\eta }\) the parameter \(\theta \) is estimable. Note that an unbiased estimator for \(\theta \) is

$$\begin{aligned} \bar{\theta }_{A} =\bar{Y}_{12}(\mathcal {S})-(\widehat{\eta }_{0}+\widehat{\eta }_{1}+\widehat{\eta }_{2}) \end{aligned}$$
(21)

and it is not hard to see that (21) solves (2) when \(\psi (\theta ,\eta _0,\eta _1,\eta _2,Y_i)=Y_i-\eta _0-\eta _1-\eta _2-\theta \). Thus, \(\varvec{\varSigma }_{\psi } = \sigma ^2\), \(\varvec{D}_{\theta _{0}} = 1\) and \(\varvec{D}_{\eta _{0}}= -(1,1,1)\) and it follows that \(\varvec{A}_{\theta \theta }\), the asymptotic variance of (21) as described in Theorem 3.1, reduces to

$$\begin{aligned} \sigma ^2 \times (1+10\rho ) \quad \text{ where } \quad \rho =\lim \frac{n}{m}. \end{aligned}$$

The second term appearing in the parentheses in the above display is an inflation factor, i.e., the price to pay for substituting estimates for the unknown value of \((\eta _0,\eta _1,\eta _2)\). Note that when \(n/m \rightarrow 0\) as both \(m\rightarrow \infty \) and \(n\rightarrow \infty \) the asymptotic variance of \(\bar{\theta }_{A}\) approaches \(\sigma ^2\). In practice this requires a large current study and even larger historical data. Incidentally, since \(\bar{\theta }_{A}\) is a linear function of \(\bar{Y}_{12}(\mathcal {S})\) and \((\widehat{\eta }_{0},\widehat{\eta }_{1},\widehat{\eta }_{2})\) it is not hard to see that its exact variance is \(\sigma ^2(1/n+10/m)\) which coincides with the asymptotic form.

Alternatively, suppose that the study \(\mathcal {S}\) allocates n/4 observations to all treatment combinations. In this case the data \(\mathcal {D}\) identifies \(\varvec{\omega } = (\theta ,\eta _0,\eta _1,\eta _2)^T\), so this is a Type II problem. The usual estimators for this design are \(\tilde{\eta _{0}} =\bar{Y}_{0}(\mathcal {S}), \tilde{\eta _{1}} =\bar{Y}_{1}(\mathcal {S})-\bar{Y}_{0}(\mathcal {S}) , \tilde{\eta _{2}}=\bar{Y}_{2}(\mathcal {S})-\bar{Y}_{0}(\mathcal {S})\) and

$$\begin{aligned} \tilde{\theta } =\bar{Y}_{12}(\mathcal {S})-(\bar{Y}_{1}(\mathcal {S})+\bar{Y}_{2}(\mathcal {S}))+\bar{Y}_{0}(\mathcal {S}) \end{aligned}$$

and thus the limiting variance of \((\tilde{\theta },\tilde{\varvec{\eta }})^{T}\) is

$$\begin{aligned} \varvec{\varUpsilon } =\sigma ^2 \begin{pmatrix} 16 &{} 4 &{} -8 &{} -8 \\ 4 &{} 4 &{} -4 &{} -4 \\ -8 &{} -4 &{} 8 &{} 4 \\ -8 &{} -4 &{} 4 &{} 8 \end{pmatrix}. \end{aligned}$$

Next we aggregate the historical and current estimates for \(\varvec{\eta }\). As in Sect. 3 we estimate \(\varvec{\eta }\) by \(\bar{\varvec{\eta }} = \varvec{W_1}\tilde{\varvec{\eta }}+\varvec{W_2} \widehat{\varvec{\eta }}\) where

$$\begin{aligned} \varvec{W_1} = (n{\varvec{\varUpsilon }}_{\varvec{\eta \eta }}^{-1}+m{\varvec{\varSigma }}^{-1})^{-1} n{\varvec{\varUpsilon }}_{\varvec{\eta \eta }}^{-1} \quad \text{ and } \quad \varvec{W_2} = (n\varvec{\varUpsilon }_{\varvec{\eta \eta }}^{-1}+m\varvec{\varSigma }^{-1})^{-1} m\varvec{\varSigma }^{-1}. \end{aligned}$$

Note that the weight matrices are functions of the variances \(\varvec{\varUpsilon }_{\varvec{\eta \eta }}\) and \(\varvec{\varSigma }\) as well as the ratio \(n/(n+m)\). Since \(\mathcal {D}\) is fully available to us then we can estimate \(\theta \) by

$$\begin{aligned} \bar{\theta }_{B} =\bar{Y}_{12}(\mathcal {S})-(\bar{\eta }_{0}+\bar{\eta }_{1}+\bar{\eta }_{2}). \end{aligned}$$
(22)

Note that the estimators (21) and (22) are of the same functional form. Further note that the statistic \(\bar{Y}_{12}(\mathcal {S})\) in (22) is a function of the \(n_{12}\) observations \(Y_1,\ldots ,Y_{n_{12}}\) receiving the treatment combination \(T_1=T_2=1\). A straightforward calculation shows that \(\varvec{B}_{\theta \theta }\) is given by

$$\begin{aligned} \sigma ^2 \times (\xi _{11}^{-1}+\varvec{1}^{T}(\varvec{\varUpsilon }_{\varvec{\eta \eta }}^{-1}+(\rho \varvec{\varSigma })^{-1})^{-1}\varvec{1}) \quad \text{ where } \quad \rho = \lim \frac{n}{m}, \end{aligned}$$

where \(\xi _{11}\) is the fraction of the observations which are assigned to receive both treatments. In situations where the full data is not available to us but \((\tilde{\theta },\tilde{\eta }_0,\tilde{\eta }_1,\tilde{\eta }_2)\) are known we may estimate \(\theta \) by \( \bar{\theta }_{C} =\tilde{\theta }-\varvec{\varUpsilon }_{\theta \varvec{\eta }}\varvec{\varUpsilon }_{\varvec{\eta }\varvec{\eta }}^{-1}(\tilde{\varvec{\eta }}-\bar{\varvec{\eta }})\). It can be verified that in this application, in which a normal linear model is involved and all estimators are functions of sufficient statistics, the estimators \(\bar{\theta }_{B}\) and \(\bar{\theta }_{C}\) coincide. Therefore \(\bar{\theta }_{C}\) is not discussed any further.

Table 1 provides a comparison of the asymptotic variances of (21) and (22) for a range of values of m and n. Table 1 displays asymptotic variances; the variances themselves are found by dividing any entry in the table by the size of the current study in the relevant row. Observe that both \(\varvec{A}_{\theta \theta }\) and \(\varvec{B}_{\theta \theta }\) decrease as a function of m for any fixed value of n and increase in n for any fixed m. For example when \(n=m=100\) then \(\varvec{A}_{\theta \theta }=11\) and \(\varvec{B}_{\theta \theta }=9.3\) whereas when \(m=100\) and \(n=5000\) then \(\varvec{A}_{\theta \theta }=501\) and \(\varvec{B}_{\theta \theta }=15.69\) and when \(m=5000\) and \(n=100\) then \(\varvec{A}_{\theta \theta }=1.2\) and \(\varvec{B}_{\theta \theta }=4.2\). Thus going down the first column of Table 1 the asymptotic variance \(\varvec{A}_{\theta \theta }\) increases by a factor of approximately 45 whereas that of \(\varvec{B}_{\theta \theta }\) by the much more modest 1.4. Similarly going across the first row the asymptotic variances of \(\varvec{A}_{\theta \theta }\) and \(\varvec{B}_{\theta \theta }\) are reduced by a factor of 9.2 and 2.2 respectively. Each pair (nm) provides a direct comparison between the two proposed designs (design A, say, in which all experimental units in the current study receive both treatments and design B, say, which is a balanced design). Clearly, design A seems preferable in situations where m is much larger that n, otherwise design B is to be preferred.

Table 1 Asymptotic variances for \(\theta \) for Type I and Type II Problems (with a balanced design) as function of the sizes of the historical and current studies

We now look a bit deeper into the question of optimal design. Suppose for simplicity that the historical sample satisfies \(m_{1,0}=m_{1,1}=m_{2,0}=m_{2,2}\). Note that Table 1 considers only designs with \(\varvec{\xi }=(0,0,0,1)\) and \(\varvec{\xi }=(1/4,1/4,1/4,1/4)\). Therefore we next consider designs for the study \(\mathcal {S}\) where \(\varvec{\xi }=(\xi _{00},\xi _{10},\xi _{01},\xi _{11})\) is any value in the unit simplex. Clearly here \(\xi _{ij}\) denotes the proportion of observations who received treatment combination \(i \times j\) where i and j are in \(\{0,1\}\). It is not hard to see that the optimal design in the interior of the simplex, i.e. for a Type II Problem, is attained by Theorem 3.2 when \(\xi _{11}^{-1}+\varvec{1}^{T}(\varvec{\varUpsilon }_{\varvec{\eta \eta }}^{-1}+(\rho \varvec{\varSigma })^{-1})^{-1}\varvec{1}\) is minimized where

$$\begin{aligned} \varvec{\varUpsilon } =\sigma ^2 \begin{pmatrix} \frac{1}{\xi _{00}}+\frac{1}{\xi _{10}}+\frac{1}{\xi _{01}}+\frac{1}{\xi _{11}} &{} \frac{1}{\xi _{00}} &{} -\left( \frac{1}{\xi _{00}}+\frac{1}{\xi _{10}}\right) &{} -\left( \frac{1}{\xi _{00}}+\frac{1}{\xi _{01}}\right) \\ \frac{1}{\xi _{00}} &{} \frac{1}{\xi _{00}} &{} -\frac{1}{\xi _{00}} &{} -\frac{1}{\xi _{00}} \\ -\left( \frac{1}{\xi _{00}}+\frac{1}{\xi _{10}}\right) &{} -\frac{1}{\xi _{00}} &{} \left( \frac{1}{\xi _{00}}+\frac{1}{\xi _{10}}\right) &{} \frac{1}{\xi _{00}} \\ -\left( \frac{1}{\xi _{00}}+\frac{1}{\xi _{01}}\right) &{} -\frac{1}{\xi _{00}} &{} \frac{1}{\xi _{00}} &{} \frac{1}{\xi _{00}}+\frac{1}{\xi _{10}} \end{pmatrix}. \end{aligned}$$

Symmetry considerations imply that under optimality \(\xi _{01}=\xi _{10}\) and since \(\xi _{00}=1-2\xi _{10}-\xi _{11}\) the minimization involves only a two dimensional search. Table 2 provides the optimal design, i.e., the vector \(\varvec{\xi }\) for estimating \(\theta \) for various values of the ratio \(\rho = n/m\) found by a grid search with step size 0.001 and the restriction that \(\xi _{00} \ge 0.02\). This restriction is necessary; otherwise the matrix \(\varvec{\varUpsilon }\) can not be inverted.

Table 2 Optimal design for Type II problems as a function of the sampling ratio \(\rho \)

Table 2 provides the optimal design for estimating \(\theta \) as a function of the sampling ratio. Note that for large \(\rho \), i.e., when n is larger than m, we find that \(\xi _{01}=\xi _{10}=1/4\) and that the difference between \(\xi _{11}\) and \(\xi _{00}\) decreases in \(\rho \). We believe that the balanced design is optimal when \(\rho \rightarrow \infty \). It is also clear that for large \(\rho \) the designs appearing in Table 2 are generally superior to those in Table 1. For example when \(\rho =1\) we find that the asymptotic variances in Table 1 are 11.0 and 9.3 whereas the corresponding optimal asymptotic variance given in Table 2 is 8.0. However, for values of \(\rho \) smaller than a 1/4, i.e., when n is rela.tively small to m, then the optimal design sets \(\xi _{00}=0.02\), and \(\xi _{01}=\xi _{10}=0.001\) which are the smallest possible values allowed by our search algorithm. This suggest that further, at most minor, improvements are possible by setting \(\xi _{00}=0\) and/or \(\xi _{01}=\xi _{10}=0\). Clearly when \(\xi _{01}=\xi _{10}=\xi _{00}=0\) we have a Type I Problem.

Therefore we next consider the situation that \(\xi _{00}=0\) and \(\xi _{01}=\xi _{10}>0\), in which case the current study comprises of three arms and thus three group means: \(\bar{Y}_{1}(\mathcal {S})\), \(\bar{Y}_{2}(\mathcal {S})\) and \(\bar{Y}_{12}(\mathcal {S})\). We emphasize that this estimation problem is neither a Type I nor a Type II problem. Further observe that with these data alone we can not estimate \(\varvec{\omega }\). Nevertheless, the pair \((\bar{Y}_{1}(\mathcal {S}),\bar{Y}_{2}(\mathcal {S}))^{T}\) whose mean is \((\eta _{0}+\eta _{1},\eta _{0}+\eta _{2})\) can be aggregated with \(\widehat{\varvec{\eta }}\) the historical estimate of \(\varvec{\eta }\). By an appropriate modification of Lemma 2 it can be shown that \(\varvec{\eta }\) can be estimated by

$$\begin{aligned} \varvec{\eta }^{\dagger }=(n\varvec{A}^T\varvec{V}^{-1}\varvec{A}+m\varvec{\varSigma }^{-1})^{-1}(n\varvec{A}^T\varvec{V}^{-1}\varvec{S}+m\varvec{\varSigma }^{-1}\widehat{\varvec{\eta }}) \end{aligned}$$
(23)

where \(\varvec{S}=(\bar{Y}_{1}(\mathcal {S}),\bar{Y}_{2}(\mathcal {S}))^{T}\), \(\varvec{V} = \sigma ^{2}\textrm{diag}(\xi _{01}^{-1},\xi _{10}^{-1})\) is its asymptotic variance and

$$\begin{aligned} \varvec{A} = \begin{pmatrix} 1 &{} 1 &{} 0 \\ 1 &{} 0 &{} 1 \end{pmatrix}. \end{aligned}$$

is the matrix which satisfies \(\mathbb {E}(\varvec{S})=\varvec{A\eta }\). Note that (23) is of the same form as (7) but with \(\varvec{A^{T}V^{-1}A}\) instead of \(\varvec{\varUpsilon }_{\varvec{\eta \eta }}\). Now, let \(\bar{\theta }_{D}\) denote the solution to \(\varvec{\varPsi }(\theta ,\varvec{\eta }^{\dagger })=0\) which is nothing but

$$\begin{aligned} \bar{\theta }_{D} =\bar{Y}_{12}(\mathcal {S})-(\eta _{0}^{\dagger }+\eta _{1}^{\dagger }+\eta _{2}^{\dagger }) \end{aligned}$$
(24)

A straightforward calculation shows that the asymptotic variance of \(\bar{\theta }_{D}\) is given by

$$\begin{aligned} \sigma ^2 \times (\xi _{11}^{-1}+\varvec{1}^{T}(\varvec{A^{T}V^{-1}A}+(\rho \varvec{\varSigma })^{-1})^{-1}\varvec{1}) \quad \text{ where } \quad \rho =\lim \frac{n}{m}. \end{aligned}$$

The formula above is useful in finding the optimal design for small values of \(\rho \) when \(\xi _{00}=0\). For example when \(\rho =1/8\) then the design \(\varvec{\xi } = (0,0.0005,0.0005, 0.9990)\) results in a variance of 2.25 (actually 2.250751) which is slightly smaller than 2.27 the reported variance in the first row of Table 2. Finally we note that when \(\rho =1/8\) then \(A_{\theta \theta }\) equals (precisely) 2.25 which means that in this application a design for Type I Problem would be the most effective.

As noted by a referee, in addition to the above mentioned three arm trial one could choose various two arm designs for \(\mathcal {S}\). For example, one can choose a design for which \(\xi _{00}>0 \,\,,\xi _{11}>0\) and \(\xi _{01}=\xi _{10}=0\). Or alternatively designs for which \(\xi _{01}>0,\,\ \xi _{11}>0\) and \(\xi _{00}=\xi _{10}=0\) . It can be shown, however, that for small values of \(\rho \), where such designs are of interest, these two armed designs are not superior to a Type I design.

4.2 Using historical estimates in drug interaction studies

This subsection deals with the optimal design of drug interaction studies. Consider two drugs \(\varvec{D}_{1}\) and \(\varvec{D}_2\) with no-effect probabilities \(\eta _1\) and \(\eta _2\), respectively and let \(\theta \) denote the no-effect probability when both drugs are administered together. The drugs are called Bliss independent, see, Bliss (1939), Liu et al. (2018), if

$$\begin{aligned} \theta =\eta _{1}\eta _{2}. \end{aligned}$$
(25)

If (25) does not hold and \(\theta <\eta _{1}\eta _{2}\) there is synergy among the drugs, otherwise there is antagonism. The concept of Bliss independence has seen a recent resurgence of interest as the need to assess the benefit of combination therapies and drug–drug interactions has increased. Some current references are Pallmann and Schaarscmidt (2016), Palmer and Sorger (2017), Russ and Kishony (2018) and Niu et al. (2019). Drug interaction studies are often carried out as single-dose experiments, e.g., Ansari et al. (2008), where the interaction is assessed by considering a single dose of each of the two drugs. A more elaborate design, which we will not consider here, assesses multiple drugs and doses using response surface methodology as in Lee (2010).

Naturally, the quantity of interest in drug interaction studies is

$$\begin{aligned} \varPhi (\theta ,\varvec{\eta })=\log (\theta )-\log (\eta _{1})-\log (\eta _{2}). \end{aligned}$$
(26)

The formulation in (26) links the problem discussed here to the ANOVA setup considered earlier. In many applications of single dose interaction tests, whether using historical data or not, an explicit or implicit asymptotic argument is used, and the theoretical results for the asymptotic case presented above are relevant. For example, Demidenko and Miller (2019) describes a Daphnia acute test with two stressors, single doses of CuSO4 and of NiCl, where the numbers of surviving organisms in water were counted after 48 hours. The observations reported were the surviving fractions of organisms only, without reporting their original numbers thus, essentially, assuming their original numbers were very high, i.e., applying an asymptotic argument. But as pointed out by Pallmann and Schaarscmidt (2016), in single-dose experiments, correct statistical analysis should rely on the observed frequencies, and not on the observed rates of success or failure. Therefore the sample sizes used in each arm of the experiment are of crucial importance and in this subsection we provide finite sample results.

For simplicity suppose that there exist historical estimates of \(\eta _{1}\) and \(\eta _{1}\) based on independent binomial experiments with sizes \(m_1\) and \(m_2\). Suppose further that the current study allows for the recruitment of n experimental units, \(n_1\) of which will receive \(\varvec{D}_1\), \(n_2\) will receive \(\varvec{D}_2\) and \(n_{12}\) will receive both drugs. Obviously

$$\begin{aligned} n=n_{1}+n_{2}+n_{12} \end{aligned}$$
(27)

and \(\theta \) can not be estimated unless \(n_{12}>0\). However it is possible that \(n_{1}=n_{2}=0\). The goal is to allocate the experimental units optimally, which is equivalent to the problem of optimally allocating \(n+m_1+m_2\) observations in an experiment in which the single dose arms are no smaller than \(m_1\) and \(m_2\), respectively. The optimal design problem can be approximated as the minimization of the large sample variance of (26)

$$\begin{aligned} \frac{1}{n_{12}}\frac{1-\theta }{\theta }+\frac{1}{n_1+m_1}\frac{1-\eta _{1}}{\eta _1}+\frac{1}{n_2+m_2}\frac{1-\eta _2}{\eta _2}, \end{aligned}$$
(28)

subject to the constraint (27).

In contrast with the design problem encountered in Sect. 4.1 the design criterion depends on the unknown parameters, i.e., the probabilities \(\theta \) and \(\varvec{\eta }\). We propose allocating observations as if \(\eta _1=\widehat{\eta _1}\), \(\eta _2=\widehat{\eta _2}\) and \(\theta = \widehat{\eta _1}\widehat{\eta _2}\) is equal to its estimated value under the hypothesis of Bliss independence.

One simple approach to the minimization of (28) is the following greedy iterative procedure, which sequentially allocates observations into the condition where the variance is reduced most:

figure a

For example, if \(m_1=30\), \(m_2=50\), \(\widehat{\eta }_{1}=0.7\) and \(\widehat{\eta }_{2}=0.8\), and one had 56 observations, then 55 observations would be put in the arm where both treatments are administered, and only one would be used to improve the estimate of \(\hat{\eta }_{1}\). Table 3 contains a tabulation of the optimal allocation of \((n_{12}, n_{1}, n_{2})\). For selected combinations of the values of \(m_1\), \(m_2\), \(\eta _1\), \(\eta _2\) the table gives the minimal value of n, denoted as \(n_{\textrm{min}}\), for which replications of the historic observations are needed, and then the optimal allocation for \(n_{\textrm{min}}\). As one would expect, when \(\theta \) is closer to 0.5 than \(\eta _1\) or \(\eta _2\), a larger sample size \(n_{12}\) is allocated in the optimal design to estimating \(\theta \), than \(m_1\) or \(m_2\). In the opposite case, \(n_{12}\) is smaller than \(m_1\) or \(m_2\).

As pointed out by a reviewer, the optimal design may not be unique. For example, in the first entry of the table, when \(\eta _1=\eta _2 =0.3\), \(m+1-m_2=10\), and one has 23 observations to allocate, 22 observations should be used for estimating joint effect \(\theta \) and one observation should be used to improve the estimate of either of the individual effects \(\eta _1\) or \(\eta _2\).

5 Summary and discussion

This paper focuses on the situation when historically available information in the form of parameter estimates are either incorporated in the analysis of a current study or used to plan a future experiment. We did not explicitly discuss the large literature on data combination schemes or other two-stage plug-in methods. As mentioned, when historical estimates are incorporated in an analysis their variability is rarely accounted for in the analysis. A partial list of examples, drawn from the scientific literature was furnished earlier and many more exist. However, it seems very difficult to find published research where the details are given to the extent which would make the replication of the analysis possible. This limits one’s ability to apply the results of this paper to published research. However, the results presented here will inform future researchers of the scope and use of historical estimates and provide a toolkit for doing so. We hope that our investigation may have an effect on the quality of future analyses and publication standards.

We also agree with a referee who has noted that an estimate is always a coarsening of the full data, and it is clear that having only access to a historical estimate instead of the entire data leads to less efficient inferences.

Table 3 The minimal sample size \(n_{min}\), at which optimal allocation requires to repeat historical observations, for selected values of the no-success probabilities \(\eta _1\), \(\eta _2\) and historical sample sizes \(m_1\), \(m_2\) in a test of drug interaction

Different disciplines exhibit different modes of using historical estimates. Social scientists often incorporate estimates from surveys in the process of model fitting, whereas biologists and engineers may use parameters estimated in experiments which are very different than their own. One way, of course, of incorporating historical estimates is using prior distributions within the Bayesian framework. For recent examples see Hoff (2019) and Bryan and Hoff (2020). Our approach, however, is frequentist, as are most of the applications in the literature. In particular, we show how to incorporate historical estimates in a principled way in scenarios which we classify at Type I Problems, where the historical parameters are not re-estimated, and Type II Problems, where they are. Two variants of Type II Problems are described. See Theorems 3.1, 3.2 and 3.4. We also show that if, given the current data \(\mathcal {D}\), it is possible to re-estimate the historical parameters then it is beneficial to do so at least for large sample sizes (Theorem 3.3). Other preference relations, in fact a hierarchy, among the estimators and any function thereof, were also established, cf. Theorems 3.5, 3.6. The loss of precision in the above mentioned settings is quantified in terms of a decomposition of the variance matrices. It was also demonstrated that the availability of historical estimates should be taken into account when an optimal experiment is designed. In particular, relevant methods for a two-way ANOVA and for testing drug interaction were discussed. Thus the results of this paper go beyond the existing knowledge on the use of historical estimates.

In our analysis we have assumed that the current data \(\mathcal {D}\) is a random sample and that the estimating Eq. (2) is of an additive form. These assumptions have been used merely to simplify the exposition and are easily modified to dependent data and various other estimating functions. It is clear that Type I and II Problems describe a broad range of possibilities, nevertheless they are insufficient for describing the rich collection of problems in which historical estimates may play a role. For example, our formulation assumes that the historical parameters \(\varvec{\eta }_1,\ldots ,\varvec{\eta }_K\) are distinct. However, in many situations this is not so. In fact, some of the historical studies may be full or partial replicates of each other. In cases when the current study is a partial replicate of a historical study, simple plug-in methods or re-estimation methods may be used. One has to be careful, though, about the choice of the estimates. We are aware of situations where a simple plug-in estimator performs better than a less than optimal re-estimating method. Throughout, we have assumed interchangeability. Clearly there are many experimental settings, especially in the sciences, where this assumption is realistic. In other situations, say clinical trials, heterogeneity rather than interchangeability is the rule. In such cases some modification of the methods proposed, using random effect models, may be possible. See Rukhin (2007) and the references therein.

Finally, it is also worth mentioning that the problem of accounting for historical estimates is naturally related, for obvious reasons, to sequential analysis, where data is collected over time, to meta-analysis, where the effort is to combine information from different sources and double sampling, and especially non-nested double sampling (Hidiroglou 2001), which attempts to provide better inferences by augmenting and predicting unobserved quantities from existing data sets. The literature on combining surveys (Kim and Rao 2012) is also relevant. Further understanding can be possibly attained by incorporating ideas from these fields.