1 Introduction

Meta-analysis methodology developed for synthesizing information across multiple independent (but comparative) [1] sources has a long history and remains to be a popular research topic in statistics [2,3,4,5,6]. It is particularly useful for the settings where a single study is inadequate for drawing a reliable conclusion and conclusions can often be strengthened by aggregating information from all studies of the same or similar kind. Meta-analysis methodology has become a broadly used tool in many fields, such as biomedical research, pathology, library and information science, and education. One of the research topics in meta-analysis that remains open is how to handle an observed zero-total-event study that is defined to be a study that observes zero number of event in both treatment and control arms cf. [7,8,9]. This problem has long been debated since the high-profile publication by Nissen and Wolski [1], as there are divergent but inconclusive views on how to handle zero-total-event studies [7, 10]. In this article, we revisit this problem and propose a novel exact meta-analysis procedure to handle zero-total-event studies.

The line of research is motivated by the drug safety evaluation study of Nissen and Wolski [1] on the use of diabetic drug Avandia. In Nissen and Wolski [1], the authors collected data from 48 clinical studies and conducted a meta-analysis to assess whether Avandia significantly increases the risk of myocardial infraction and death from cardiovascular diseases. Most of these studies reported zero or a very small number of events in one or both treatment and control groups. Nissen and Wolski [1] used Peto’s method to combine information across all studies, which effectively discarded more than half of 48 studies in the analysis of cardiovascular death (25 out of the total 48 studies are zero-total-event studies). This practice was challenged by Diamond et al. [11], initiating a hot debate in the community with diverging views on how to handle observed zero-total-event studies in pharmaceutical industry and biostatistics community. The key difficulties are that 0/0 has no mathematical definition and also that most of the existing meta-analysis methods rely on normality or large sample justifications and therefore are not suited for analysis of zero-total-event studies. Indeed, as stated in Xie et al. [10], with the probabilities of both treatment and control events \((\pi _{0k}, \pi _{1k})\) not equal to zero (even if very small), the probability of observing a zero-total-event study is zero when the numbers of patients in both treatment arms \(n_k \rightarrow \infty\) and \(m_k \rightarrow \infty\). Thus, when a zero-total-event study is observed, it is an indication that the sample sizes are not large enough for this particular underlying set. Until today, the statistical inference problem at the center of this debate is still open and unanswered [7, 10].

Consider a typical setting of K-independent clinical trials (control vs treatment): \(X_k \sim\) Binomial\((n_k,\) \(\pi _{0k})\) and \(Y_k\) \(\sim\) Binomial\((m_k, \pi _{1k})\), \(k = 1, \ldots , K\). We can often express the sample data in K \(2 \times 2\) tables (cf., Table 1) with \(X_k\) and \(Y_k\) being the numbers of events in the control and treatment arms of the \(k\mathrm{{th}}\) trial.

Table 1 \(2\times 2\) clinical study with control and treatment

For k = 1,2, \(\ldots,\) K. Often \((\pi _{0k},\pi _{1k})\) is reparameterized to \((\theta _k, \eta _{k})\), with the log odds ratio \(\theta _k = \log \bigg ({\pi _{1k} \over 1-\pi _{1k}}\bigg /{\pi _{0k}\over 1-\pi _{0k}}\bigg )\) and \(\eta _{k}=\) \(\log \bigg \{\bigg (\frac{\pi _{1k}}{1-\pi _{1k}}\bigg )\) \(\bigg (\frac{\pi _{0k}}{1-\pi _{0k}}\bigg )\bigg \}\). The classical common odds ratio model assumes \(\theta _1\) \(=\) \(\cdots\) \(=\) \(\theta _K\) \(\equiv\) \(\theta\), but the rates \((\pi _{0k},\pi _{1k})\) allow to be different from one study to another; cf. [1, 2, 7, 12, 13], among others. In rare event studies, both \(\pi _{0k} > 0\) and \(\pi _{1k} > 0\) but are very small. In this case, the observed data, say \(\{x_k^{obs}, y_k^{obs}\}\), can often be zero or some very small numbers (\(n_i\) and \(m_i\) can be large typically in thousands). The studies with observed data \(x_k^{obs} = y_k^{obs} = 0\) are referred to as zero-total-event studies in the literature cf. [7, 8]. In this article, we focus on the inference problem of the common log odds ratio \(\theta\). More specifically, we construct a finite-sample performance guaranteed level-\(\alpha\) confidence interval for \(\theta\) in meta-analysis while incorporating information from potentially many zero-total-event studies.

The analysis of rare event data, in particular incorporating zero-total-event studies in a meta-analysis, raises specific statistical challenges and has been intensely studied [7,8,9, 13,14,15,16,17]. Most commonly used meta-analysis methods rely on the asymptotic distribution of the combined estimator to make inference. For instance, the widely used inverse variance-weighted method combines point estimators from individual studies, assuming that the distributions of all the estimators can be well approximated by normal distributions. The classical Mantel–Haenszel [18] and Peto methods [19] also rely on the normal approximation to the distribution of the combined estimator. However, the normal approximations are ill-suited for rare event data and results for rare event data in practice often yield an unacceptably low coverage probability [13, 15]. In addition, the commonly practiced “continuity correction” (i.e., adding 0.5 or 0.1 to zero cells) is shown with compelling evidence to have undesirable impact on inference outcomes [14, 15]. Conditional likelihood inference methods have also been proposed for meta-analysis of \(2\times 2\) tables, e.g., Cox and Snell [12]. In particular, one can make inference relying on a conditional likelihood function and the finite-sample Fisher exact test, for which computing algorithms and small-sample approximations are developed [20, 21]. Under the conditional inference framework, the conditional likelihood function of a zero-total-event study is constant and thus the study does not contribute to the inference. However, based on the likelihood principle [22], Xie et al. [10] showed that the conditional likelihood, although maintaining test size, loses power (compared to the full-likelihood method) and Fisher exact test is not particularly suited for analysis of zero-total-event clinical trials, a conclusion also reached independently in Finkelstein and Levin [7]. Bayesian methods have also been experienced to analyze zero-total-event studies, in which zero-total-event studies typically contribute to the meta-analysis inference. Since the use of priors imposes an additional model assumption and rare event data are very sensitive to the prior choices, it is argued in the field that a Bayesian approach “may raise more questions than they settle” (cf. Finkelstein and Levin [7]). In recent years, several finite-sample methods are proposed for rare event data but for different inference problems. For instance, Tian et al. [13] propose an exact method for meta-analysis of risk difference \(\pi _{1k}-\pi _{0k}\). Although Tian et al. [13] does not use large sample approximations, it is on risk difference and cannot handle the parameter of odds ratio. Yang et al. [9] review exact meta-analysis methods with a focus on rare events and show that the method by Tian et al. [13] is a special case of Xie et al. [5]. Cai et al. [16] suggest to use a Poisson model to analyze the rare event \(2\times 2\) tables. This approach avoids the difficult question of 0/0, but by changing the distribution assumption it also changes the original inference target in the two binomial \(2\times 2\) tables.

Despite all the efforts, it remains an open and unanswered inference problem in statistics on how to handle the zero-total-event studies in analysis of the common odds ratio [7, 10]. The debate on zero-total-event studies are centered on two questions: (a) Does a zero-total-event study possess any information concerning the parameter of common odds ratio? (b) If it does, how can we effectively incorporate zero-total-event studies in meta-analysis? In Xie et al. [10], the authors showed that zero-total-event studies indeed possess information about the parameter of common odds ratio in meta-analysis. In the current article, we provide a solution to the second question on how to effectively include zero-total-event studies to help make a combined inference on the parameter of common log odds ratio \(\theta\) in meta-analysis.

Our solution is developed based on a newly developed inferential framework called repro samples method [23]. The repro samples method uses a simple yet fundamental idea: study the performance of artificial samples that are generated by mimicking the sampling mechanism of the observed data; the artificial samples are then used to help quantify the uncertainty in estimation of model and parameters. The repro samples development is deeply rooted in and grown from prior developments of artificial sample-based inference procedures across Bayesian, frequentist, and fiducial paradigms (i.e., approximate Bayesian computing, Bootstrap, generalized fiducial inference, and inferential model; See further discussions in Xie and Wang [23]). It does not need to rely on likelihood functions or large-sample theories, and it is especially effective for difficult inference problems in which regularity conditions and thus regular inference approaches do not apply. Xie and Wang [23] and Wang et al. [24] used the repro samples framework to address two open inference questions in statistics concerning (a) Gaussian mixture and (b) high-dimensional regression models, where the authors successfully provided finite-sample confidence set for discrete unknown parameters (i.e., unknown number of components in the mixture model and unknown sparse model in the high-dimensional model) along with joint confidence sets for the unknown discrete parameter together with the remaining model parameters. In our current paper, our problem does not involve any discrete parameters. However, we can still use some of the key techniques in the repro samples framework to develop a novel method with finite-sample supporting theories to address the highly non-trivial inference problem concerning zero-total-event studies.

The rest of this article is organized as follows. Section 2 introduces the repro samples method and our proposed inference procedure. Section 3 provides extensive simulation studies to examine the performance of proposed method and compare it with the popular Mantel–Haenszel and Peto methods as well as an oracle approach. A new analysis of the Avandia data reported in Nissen and Wolski [1] using the proposed repro samples method is provided in Sect. 4. A brief summary and further discussions are given in Sect. 5.

2 Repro Samples Method for Meta-analysis of \(2 \times 2\) Tables

Since the repro samples method is relatively new, we first provide in Sect. 2.1, a brief description of the method, based on which we provide our new development tailored to zero-total-event studies in Sects. 2.2 and 2.3.

2.1 Notations, Terminologies, and a Brief Review of Repro Samples Method

Suppose the sample data Y \(\in \mathcal{Y}\) are generated from an algorithmic model:

$$\begin{aligned} {Y} = G({\theta }, U) \end{aligned}$$
(1)

where \(G(\cdot , \cdot )\) is a known mapping from \(\Theta \times \mathcal{U} \mapsto \mathcal{Y}\), \(\theta \in \Theta\) is the model parameter, and \({U} = (U_1, \ldots U_r)^\top \in \mathcal{U} \subset R^r\), \(r>0\), is a random vector whose distribution is known or can be simulated from. Thus, given \({\theta } \in \Theta\), we know how to simulate data Y from (1). In fact, this is the only assumption needed in the repro samples development. The model \(G(\cdot , \cdot )\) can be very complicated either in an explicit or implicit form, including complex examples, such as differential equations or generative neural network models. As long as we can generate Y for a given \(\theta\), we can apply the method. Denote the observed data \({y}^{obs} = G({\theta }^{(o)}, {{u}}^{rel})\), where \({\theta }^{(o)} \in \Theta\) is the true value and \({{u}}^{rel}\) is the corresponding (unknown) realization of U.

Let \(T(\cdot , \cdot )\) be a mapping function from \(\mathcal{U} \times \Theta \rightarrow\) \(\mathcal{T} \subseteq R^{q}\), for some \(q \le n\). Also, for each given \({\theta }\), let \(B_{\alpha }({\theta })\) be a Borel set such that

$$\begin{aligned} {\mathbb {P}} \left\{ T({U}, {\theta }) \in B_{ \alpha }({\theta }) \right\} \ge \alpha , \quad 0< \alpha < 1. \end{aligned}$$
(2)

The function T is referred to as a nuclear mapping function. A repro samples method proposes to construct a subset in \(\Theta\):

$$\begin{aligned} \Gamma _{\alpha }({y}^{obs}) = \big \{{\theta }: \exists \, {u}^* \in \mathcal{U} \text{ such } \text{ that } {y}^{obs} = G({{\theta }}, {u}^*), \, T({u}^*, {\theta }) \in B_{\alpha }({\theta }) \big \} \subset \Theta . \end{aligned}$$
(3)

In another words, for a potential value \({\theta }\), if there exists a \({u}^*\) such that the artificial sample \({{y}}^* = G( {{\theta }}, {{u}}^*)\) matches \({y}^{obs}\) (i.e., \({y}^* = {y}^{obs}\)) and \(T({u}^*, {\theta }) \in B_{\alpha }({\theta })\), then we keep this \({\theta }\) in the set. Since \({y}^{obs} = G({{\theta }}^{(o)},\) \({u}^{rel})\), if \(T({u}^{rel}, {\theta }^{(o)}) \in B_{\alpha }({\theta }^{(o)})\), then \({\theta }^{(o)} \in \Gamma _{\alpha }({y}^{obs})\). Similarly, under model \({Y} = G({{\theta }}^{(o)}, {U})\), if \(T({U}, {\theta }^{(o)}) \in B_\alpha ({\theta }^{(o)})\), then \({\theta }^{(o)} \in \Gamma _{\alpha }({Y})\). Thus, by construction, \({\mathbb {P}} \big \{{\theta }^{(o)} \in \Gamma _\alpha ({Y}) \big \} \ge {\mathbb {P}} \big \{T({U}, {\theta }^{(o)}) \in B_\alpha (\theta ^{(o)}) \big \} \ge \alpha.\) This proves that \(\Gamma _{\alpha }({y}^{obs})\) is a level-\(\alpha\) confidence set for \({\theta }^{(o)}\). This development is likelihood free and does not need to rely on any large-sample theories.

The repro samples development utilizes the ideas of inversion and matching of artificial and observed samples. Let’s illustrate the development using a very simple toy example of \(Y \sim N(\theta ,1)\). In the form of (1), \(Y = \theta + U\), where \(U \sim N(0,1)\). Suppose the true underlying parameter value is \(\theta ^{(o)} = 1.35\) and the realization is \(u^{rel} = 1.06\), giving us a single observed data point \(y^{obs} = 2.41\). We only know \(y^{obs} = 2.41\) and \(u^{rel}\) is a realization from N(0, 1) but we do not know its value 1.06. We would like to make an inference for \(\theta ^{(o)}\). Let \(T(U, \theta ) = U\), then the level-\(95\%\) Borel set in (2) is the interval \((-1.96, 1.96)\). By the proposed construction (3), we keep and only keep those potential \(\theta\) values that can reproduce \(y^{obs} = 2.41\) by setting (matching) \(\theta + u^* = 2.41\) with a (potential) realized error \(u^* \in (-1.96, 1.96)\). This method of getting the set of \(\theta\)’s is essentially an inversion procedure and the method leads us to a level-\(95\%\) confidence set (0.45, 4.37), which is exactly the same best possible level \(95\%\) confidence interval when observing a single data point \(y^{obs} = 2.41\) using the classical frequentist method.

The repro samples method does not need to involve the likelihood function and has a finite-sample performance guarantee. Xie and Wang [23] also showed that the repro samples method is more general and flexible and subsumes the Neyman–Pearson framework as a special case. Using the repro samples development in our current paper on meta-analysis of \(2 \times 2\) tables, we ask for a potential value of the common log odds ratio parameter \(\theta\) and a given confidence level \(\alpha\), whether the \(\theta\) value can potentially be used to generate an artificial dataset that matches the observed studies. If it does, we keep the \(\theta\) value in our level-\(\alpha\) confidence set. One complication is that there are also nuisance parameters \({\varvec{\eta }} = (\eta _1, \ldots , \eta _K)^T\). We provide our detailed development in Sect. 2.2.

2.2 Repro Samples Method and Finite-Sample Confidence Set for the Common Log Odds Ratio in \(2\times 2\) Tables

For the common odds ratio model in the \(2 \times 2\) tables, we have \(\pi _{0,k} = e^{(\theta + \eta _{k})/2}\) and \(\pi _{1,k} = e^{(\theta - \eta _{k})/2}\), for \(k = 1, \ldots , K\). We denote \({\varvec{X}}=(X_{1},\ldots ,X_{K})^{T}\) and \({\varvec{Y}}=(Y_{1},\ldots ,Y_{K})^{T}\). In the form of (1), the pair of binomial models \(X_k \sim\) Binomial\((n_k,\) \(\pi _{0k})\) and \(Y_k\) \(\sim\) Binomial\((m_k, \pi _{1k})\), \(k = 1, \ldots , K\) can be re-expressed as

$$\begin{aligned} X_k = \sum _{j = 1}^{n_k} I\{ U_{kj} \le e^{(\eta _{k} + \theta )/2} \}\,\,\hbox { and }\,\, Y_k = \sum _{j = 1}^{m_k} I\{V_{kj} \le e^{(\eta _{k} - \theta )/2} \}, \end{aligned}$$
(4)

where \(U_{kj}\) and \(V_{kj}\) are iid U(0, 1) distributed random variables, for \(j = 1, \ldots , n_k\) or \(m_k\), \(k = 1, \ldots , K.\) Our observed sample data are \({\varvec{x}}^{obs}=(x^{obs}_{1}, \ldots ,x^{obs}_{K})^{T}\) and \({\varvec{y}}^{obs}=(y^{obs}_{1},\ldots ,y^{obs}_{K})^{T}\) with \(x_k^{obs} = \sum _{j = 1}^{n_k} I\{ u_{kj}^{rel} \le e^{(\eta _{k}^{(o)} + \theta ^{(o)})/2} \}\) and \(y_k^{obs} = \sum _{j = 1}^{n_k} I\{ v_{kj}^{rel} \le e^{( \eta _{k}^{(o)} - \theta ^{(o)})/2} \}\), where \(\theta ^{(o)}\) and \({\varvec{\eta }}^{(o)} =(\eta _{1}^{(o)}, \ldots ,\eta _{K}^{(o)})^{T}\) are the true parameter values, and \({\varvec{u}}_k^{rel} = (u_{k1}^{rel}, \ldots , u_{km_k}^{rel})^T\) and \({\varvec{v}}_k^{rel} = (v_{k1}^{rel}, \ldots , v_{km_k}^{rel})^T\) are the corresponding realized random vectors that generated \({\varvec{x}}^{obs}\) and \({\varvec{y}}^{obs}\), respectively. The number of tables K and each table’s \((n_k, m_k)\) are given (and not necessarily going to infinity). Among the K tables, we allow many zero-total-event studies with \(x_k^{obs} = y_k^{obs} = 0\), but assume that at least one of \(x_k^{obs} \not = 0\) and one of \(y_k^{obs} \not = 0\). Our goal is to use the repro samples method to construct a performance guaranteed level-\(\alpha\) confidence interval for the common log odds ratio parameter \(\theta ^{(o)}\) while taking care of the remaining K nuisance model parameters \(\eta _{k}\), \(k = 1, \ldots , K\).

Mantel–Haenszel statistic is a commonly used estimator of common log odds ratio

$$\begin{aligned} W_{MH}({\varvec{X}},{\varvec{Y}})= \log \Bigg (\sum _{k=1}^{K}R_{k}\bigg /\sum _{k=1}^{K}S_{k}\Bigg ), \end{aligned}$$

where \(R_{k}=X_{k}(m_{k}-Y_{k})/(m_{k}+n_{k})\) and \(S_{k}=Y_{k}(n_{k}-X_{k})/(m_{k}+n_{k})\). To make inference, the Mantel–Haenszel method uses the large-sample theorems by which

$$\begin{aligned} W({\varvec{X}},{\varvec{Y}};\theta ) = W_{MH}({\varvec{X}},{\varvec{Y}}) - \theta \end{aligned}$$
(5)

is normally distributed as both \({n_k} \rightarrow \infty\) and \({m_k} \rightarrow \infty\), for all \(k = 1, \ldots , K\) [2, 25]. In rare event studies, especially for those contain zero-total-event studies, the large-sample theorems do not apply, so using Mantel–Haenszel method is not theoretically justified for zero-total-event studies. However, due to its simplicity and good empirical performance especially in large-sample situations, we use \(W({\varvec{X}},{\varvec{Y}};\theta )\) in (5) to help develop the nuclear mapping function in our repro samples method to obtain a performance guaranteed finite-sample confidence interval for \(\theta\).

For the sample data generated with parameter values \((\theta , {\varvec{\eta }}^T)\), \(X_{k} = \sum _{j=1}^{n_k} I\{U_{kj} \le\) \(e^{(\eta _{k}+\theta )/2}\}\), and \(Y_{k} = \sum _{j=1}^{m_k} I{\{V_{kj} \le e^{(\eta _{k}-\theta )/2}\}}\), the distribution of \(W({\varvec{X}},{\varvec{Y}};\theta )\) depends on both \(\theta\) and the K nuisance parameters \({\varvec{\eta }} = (\eta _{1}, \ldots , \eta _{K})^T\). We use a profile approach to control the impact of the nuisance parameters \({\varvec{\eta }}\). Specifically, let \({{\widetilde{X}}}_{k} = \sum _{j=1}^{n_k} I{\{U_{kj}' \le e^{({{\widetilde{\eta }}}_{k}+\theta )/2}\}}\) and \({{\widetilde{Y}}}_{k} = \sum _{j=1}^{m_k} I{\{V_{kj}' \le e^{({{\widetilde{\eta }}}_{k}-\theta )/2}\}}\), where \(U_{kj}'\) and \(V_{kj}'\) are iid U(0, 1) distributed random variables. We define, for \(t \ge 0\),

$$\begin{aligned} \gamma _{(\theta , {\widetilde{\eta }}^{T})}\{t\} = {\textbf{P}} \left\{ \big |W(\widetilde{{\varvec{X}}}, \widetilde{{\varvec{Y}}};\theta )\big | < t \right\} . \end{aligned}$$
(6)

In the special case with \(\widetilde{{\varvec{\eta }}} = {\varvec{\eta }}\), we have \(\gamma _{(\theta , \eta ^T)}\{|W({\varvec{X}},{\varvec{Y}};\theta )|\} \sim U(0,1)\). In particular, we can show that \(1 - \gamma _{(\theta , {\widetilde{\eta }}^T)}\left\{ |W({\varvec{x}},{\varvec{y}};\theta )|\right\} = {\textbf{P}} \big \{ \big |W(\widetilde{{\varvec{X}}},\widetilde{{\varvec{Y}}};\theta )\big | \ge \big |W({\varvec{x}},{\varvec{y}};\theta )\big | \big \}\) is the p value to reject the null hypothesis \(H_0:\), a sample dataset \(({\varvec{x}},{\varvec{y}})\) is generated from \((\theta , \tilde{{\varvec{\eta }}}^T)\), when in fact a sample dataset \(({\varvec{x}},{\varvec{y}})\) is generated from \((\theta , {{\varvec{\eta }}}^T)\).

Following the profile method proposed in Xie and Wang [23], we define our nuclear mapping function as

$$\begin{aligned} T({\varvec{X}},{\varvec{Y}};\theta ) = \min _{{{\widetilde{\eta }}} \, \in \, {\textbf{R}}^K} \gamma _{(\theta , {{\widetilde{\eta }}}^T)}\left\{ |W({\varvec{X}},{\varvec{Y}};\theta )|\right\} . \end{aligned}$$
(7)

It is clear that \(T({\varvec{X}},{\varvec{Y}};\theta ) \le \gamma _{(\theta , \eta ^T)}\{|W({\varvec{X}},{\varvec{Y}};\theta )|\}\), i.e., \(T({\varvec{X}},{\varvec{Y}};\theta )\) is dominated by \(\gamma _{(\theta , \eta ^T)}\{|W({\varvec{X}},{\varvec{Y}};\theta )|\}\). Since \(X_{k} = \sum _{j=1}^{n_k} I\{U_{kj} \le\) \(e^{(\eta _{k}+\theta )/2}\}\) and \(Y_{k} = \sum _{j=1}^{m_k} I\{V_{kj} \le\) \(e^{(\eta _{k}-\theta )/2}\}\), the mapping \(T({\varvec{X}},{\varvec{Y}};\theta )\) is a function of \({{\varvec{U}}} = \{U_{kj}, 1 \le j \le n_k, 1 \le k \le K\}\), \({{\varvec{V}}} = \{V_{kj}, 1 \le j \le m_k, 1 \le k \le K\}\), and \((\theta , {\varvec{\eta }}^T)\). Thus, for a given \(\theta\), the distribution of \(T({\varvec{X}},{\varvec{Y}};\theta )\) still depends on the nuisance parameter \({\varvec{\eta }}\). However, we always have

$$\begin{aligned} {\textbf{P}}\left\{ T({\varvec{X}},{\varvec{Y}};\theta ) \le \alpha \right\} \ge {\textbf{P}}\left[ \gamma _{(\theta , \eta ^T)}\{|W({\varvec{X}},{\varvec{Y}};\theta )|\} \le \alpha \right] = \alpha . \end{aligned}$$
(8)

Thus, a Borel set corresponding to (2) is \(B_\alpha = (0, \alpha ]\) which is free of both \(\theta\) and \({\varvec{\eta }}\).

Following (3), the level-\(\alpha\) repro samples confidence set for \(\theta\) is

$$\begin{aligned} \Gamma _\alpha ({\varvec{x}}_{obs}, {\varvec{y}}_{obs})&= \big \{\theta : \exists \, ({{\varvec{u}}}^*, {{\varvec{v}}}^*) \hbox { and } {{\varvec{\eta }}} \hbox { such that } ({{\varvec{x}}}^{obs}, {{\varvec{y}}}^{obs}) = ({{\varvec{x}}}^*, {{\varvec{y}}}^*), \, T({\varvec{x}}^*, {\varvec{y}}^*;\theta ) \le \alpha \big \} \nonumber \\&= \big \{\theta : \exists \, ({{\varvec{u}}}^*, {{\varvec{v}}}^*) \hbox { and } {{\varvec{\eta }}} \hbox { such that } ({{\varvec{x}}}^{obs}, {{\varvec{y}}}^{obs}) = ({{\varvec{x}}}^*, {{\varvec{y}}}^*), \, T({\varvec{x}}^{obs}, {\varvec{y}}^{obs};\theta ) \le \alpha \big \} \nonumber \\&= \left\{ \theta : T({\varvec{x}}^{obs}, {\varvec{y}}^{obs};\theta ) \le \alpha \right\} , \end{aligned},$$
(9)

where \({{\varvec{x}}}^* = (x_1^*, \ldots , x_K^*)^T\) and \({{\varvec{y}}}^* = (y_1^*, \ldots , y_K^*)^T\) with \(x_k^* = \sum _{j = 1}^{n_k} I\{ u^*_{kj} \le e^{(\theta + \eta _{k})/2} \}\) and \(y_k^* = \sum _{j = 1}^{m_k} I\{v^*_{kj} \le e^{(\eta _{k} - \theta )/2} \},\) for \(1 \le k \le K.\) The first equation of (9) follows the repro samples approach. The last equation holds since, for a given \(\theta\), there always exist \(({{\varvec{u}}}^*, {{\varvec{v}}}^*)\) and \({\varvec{\eta }}\) such that \(({{\varvec{x}}}^{obs}, {{\varvec{y}}}^{obs}) = ({{\varvec{x}}}^*, {{\varvec{y}}}^*)\).

By Eq. (8), we have the following theorem.

Theorem 1

Under the above setup and suppose the random sample are generated using the parameter values \((\theta ^{(o)}, {\varvec{\eta }}^{(o)T})\), i.e., \(X_{k} = \sum _{j=1}^{n_k} I{\{U_{kj} \le e^{(\eta _{k}^{(o)}+\theta ^{(o)})/2}\}}\) and \(Y_{k} = \sum _{j=1}^{m_k} I{\{V_{kj} \le e^{(\eta _{k}^{(o)}-\theta ^{(o)})/2}\}}\), we have

$$\begin{aligned} {\textbf{P}}\left\{ \theta ^{(o)} \in \Gamma _\alpha ({\varvec{X}}, {\varvec{Y}}) \right\} \ge \alpha . \end{aligned}$$

The theorem states that the set \(\Gamma _\alpha ({\varvec{x}}_{obs}, {\varvec{y}}_{obs}) = \big \{\theta : T({\varvec{x}}^{obs}, {\varvec{y}}^{obs};\theta ) \le \alpha \big \}\) is a level-\(\alpha\) confidence set for the common log odds ratio \(\theta ^{(o)}\). This theoretical result holds for any given number of studies K and any sample sizes \(n_k\), \(k = 1, \ldots , K\). To ensure the finite-sample theoretical guarantee, we have used a profile method to control against all possibly cases for the K nuisance parameters \(\eta _k\), some of which may be very extreme. Thus, by design, the approach is expected to be conservative. See Sects. 4 and 5 for a numerical study and additional discussions.

2.3 Monte–Carlo Implementation and Computing Algorithm

To construct the level-\(\alpha\) confidence set in (9), we need to calculate \(T({\varvec{x}}_{obs}, {\varvec{y}}_{obs};\theta ) = \min _{{{\widetilde{\eta }}}} \gamma _{(\theta , {{\widetilde{\eta }}}^T)}\{|W({\varvec{x}}_{obs},{\varvec{y}}_{obs};\theta )|\}\), for a potential \(\theta\) value. This can be done using a Monte–Carlo method to approximate \(\gamma _{(\theta , {{\widetilde{\eta }}}^T)}\{|W({\varvec{x}}_{obs},{\varvec{y}}_{obs};\theta )|\}\). Specifically, for any set of fixed \((\theta , \widetilde{{\varvec{\eta }}}^T)\), we can approximate the function \(\gamma _{(\theta , {{\widetilde{\eta }}}^T)}\{t\}\) by

$$\begin{aligned} \gamma _{(\theta , {{\widetilde{\eta }}}^T)}\{t\} \approx \frac{1}{M} \sum _{s=1}^M {I}\left\{ \big |W(\widetilde{{\varvec{x}}}^{(s)},\widetilde{{\varvec{y}}}^{(s)};\theta )\big | < t \right\} , \end{aligned}$$
(10)

where \(\widetilde{{\varvec{x}}}^{(s)} = ({{\widetilde{x}}}_{1}^{(s)}, \ldots , {{\widetilde{x}}}_{K}^{(s)})^T\), \(\widetilde{{\varvec{y}}}^{(s)} = ({{\widetilde{y}}}_{1}^{(s)}, \ldots , {{\widetilde{y}}}_{K}^{(s)})^T\), \({{\widetilde{x}}}_{k}^{(s)} = \sum _{j=1}^{n_k} I{\{U_{kj}^{(s)} \le e^{({{\widetilde{\eta }}}_{k}+\theta )/2}\}}\), \({{\widetilde{y}}}_{k}^{(s)} = \sum _{j=1}^{m_k} I{\{V_{kj}^{(s)} \le e^{({{\widetilde{\eta }}}_{k}-\theta )/2}\}}\), and \((U_{kj}^{(s)}, V_{kj}^{(s)})\) are simulated iid U(0, 1) random numbers, for \(s = 1 \ldots , M\). Thus, we can approximate \(\gamma _{(\theta , {\widetilde{\eta }}^T)}\{|W({\varvec{x}}_{obs},{\varvec{y}}_{obs};\theta )|\}\), which is only a function of \((\theta , \widetilde{{\varvec{\eta }}}^T)\). We then call an optimization program to find its minimum value over \(\tilde{{\varvec{\eta }}}\), and it leads to \(T({\varvec{x}}_{obs},{\varvec{y}}_{obs};\theta )\) that is a function of \(\theta\) when given \(({\varvec{x}}_{obs},{\varvec{y}}_{obs})\).

The Monte–Carlo mean-squared error (MSE) of using (10) to approximate \(\gamma _{(\theta , {{\widetilde{\eta }}})}\{t\}\) is \(E\left( \frac{1}{M} \sum _{s=1}^M {I}\left\{ \big |W(\widetilde{{\varvec{x}}}^{(s)},\widetilde{{\varvec{y}}}^{(s)};\theta )\big | < t \right\} -\gamma _{(\theta , {{\widetilde{\eta }}}^T)}\{t\} \right) ^2\) \(\approx \frac{\gamma _{(\theta , {{\widetilde{\eta }}}^T)}\{t\}\{1 - \gamma _{(\theta , {{\widetilde{\eta }}}^T)}\{t\} \}}{M}\) \(\le \frac{1}{4\,M};\) cf. Koehler et al. [26, Sect. 4.1]. So, if we take \(M=1000\), then the Monte–Carlo MSE is controlled to be less than 0.00025. This choice does not depend on \(\theta\) or \({\varvec{\eta }}\), nor the number of studies in the meta-analysis or the sample sizes of each individual study.

We provide below a computing algorithm.

Algorithm 1
figure a

Calculation of confidence interval of common log odds ratio

3 Simulation Studies

In this section, we provide three simulation studies to examine the empirical performance of the proposed repro samples method on making inference for the common log odds ratio \(\theta\) and also make comparisons with the popular Peto and Mantel–Haenszel methods. In particular, we report the empirical coverage probabilities and average lengths of the confidence intervals based on 500 replications with \(M=1000\).

In the first simulation study, to generate simulated data, we design a context similar to the structure of Avandia dataset, following Tian et al. [13] and Liu et al. [8]. Concretely, \(K=48\)-independent \(2\times 2\) tables were generated using the same sample sizes of Avandia dataset. The true incidence rate \(\pi _{1k}^{(o)}\) in kth trial was generated from a uniform distribution U(0, 0.008). Then, the incidence rate \(\pi _{0k}^{(o)}\) was determined by relationship \(\textrm{logit}(\pi _{0k}^{(o)})=\theta ^{(o)}+\textrm{logit}(\pi _{1k}^{(o)})\), where several true common log odds ratio values \(\theta ^{(o)}\) under various scenarios were examined. Finally, the kth table was simulated by the binomial distributions with the generated \((\pi _{0k}^{(o)},\pi _{1k}^{(o)})\).

In the implementation of our repro samples algorithm, we confine our potential \(\theta\) values within the 99.95% confidence interval of the true \(\theta ^{(o)}\) obtained using the Mantel–Haenszel approach. For each \(\theta\), it is noted that the nuclear mapping involves the minimization over \({\varvec{\eta }}=(\eta _{1},\ldots ,\eta _{K})^T\) with \(K=48\). We apply the R function optim in the package ‘stats’ to find the minimum value. In the implementation of minimization via optim’ an initial value of \({\varvec{\eta }}\) need to be specified. Recall that \(\eta _{k}=\log \big (\pi _{1k}/(1-\pi _{1k})\big )+\log \big (\pi _{0k}/(1-\pi _{0k})\big )\) for \(k=1,\ldots ,K\). Then, if kth trial has nonzero events in both groups, the initial value of \(\eta _{k}\) is given by \({\hat{\eta }}_{k}=\log \Big (\frac{{\hat{\pi }}_{1k}}{1-{\hat{\pi }}_{1k}}\Big )+\log \Big (\frac{{\hat{\pi }}_{0k}}{1-{\hat{\pi }}_{0k}}\Big )\), where \({\hat{\pi }}_{1k}=x_k/n_k\) and \({\hat{\pi }}_{0k}=y_k/m_k\). However, it will not work for trials with zero events in one arms. In view of the similarity among all the trials, we use \(\min \{{\hat{\eta }}_{k}:k\text{ th } \text{ trial } \text{ has } \text{ nonzero } \text{ events } \text{ in } \text{ both } \text{ groups }, 1\le k \le K\}\) as the initial value of \(\eta _{k}\) for trials with zero events in one or both group.

Table 2 is a summary of the empirical results obtained based on 500 data replications, where the common odds ratio takes different values. If the common odds ratio equals to 1, the treatment and control groups have the equal odds and the treatment has no effect on changing the event occurrence rate. When a common odds ratio is larger than 1, the probability of event occurrence is lower for control group; when a common odds ratio are less than 1, the probability of event occurrence is lower for treatment group. From Table 2, we can see that the proposed repro samples method produces valid confidence intervals for the prespecified confidence level 95% for all different values of \(\theta ^{(o)}\). The empirical coverage probabilities of the Mantel–Haenszel method are mostly on target, although a few of them have slightly undercoverage rates. Peto method has the worst numerical performance. It only works for moderate values of \(\theta ^{(o)}\) and breaks down for those large and small \(\theta ^{(o)}\) values. As indicated in Table 2, the empirical coverage probabilities of Peto method are far less than 0.95 when the true odds ratios are larger than 3. The coverage rates even reached zero when the true odds ratios \(\theta ^{(o)}\) are 6 and 7. Although the repro samples approach has the desired coverage rates across all cases, the interval lengths are a little longer than other two approaches, as we have anticipated.

We design the second numerical study to demonstrate that the proposed repro samples method can effectively extract information hidden in the zero-total-event studies for the common odds ratio parameter. Suppose we have two datasets, both of which include two non-zero-total-event studies and three zero-total-event studies: (a) (3/100, 2/100), (2/300, 1/300), (0/600, 0/300), (0/600, 0/300), (0/300, 0/300); and (b) (2/100, 2/100), (1/50, 1/50), (0/100, 0/300), (0/100, 0/300), and (0/100, 0/300). For each of the two datasets, we use our algorithm to obtain the two level-\(95\%\) confidence intervals for the common log odds ratio \(\theta ^{(o)}\), one using all five studies and the other using only the two non-zero-total-event studies (excluding the three zero-total-event studies). Figure 1 depicts the comparisons of these two sets of intervals. Based on the figure, we can see that the confidence intervals obtained by excluding the three zero-total-event studies are significantly wider than the intervals obtained by including them. This set of results further affirms the conclusion that zero-total-event studies have information and impact the inference of the common odds ratio as discussed in Xie et al. [10]. Overall, the repro samples method provides an approach to effectively include zero-total-event studies in the analysis of the common odds ratio parameter in meta-analysis.

The repro samples method uses a profiling technique to control all possible scenarios of the unknown nuisance parameter \({\varvec{\eta }}\). The intervals obtained by the method is often conservative. In the third simulation study, we examine the conservativeness of the proposed intervals. We compare the proposed repro samples method (labeled as “repro (profile)”) to an oracle (labeled as “repro (oracle)”) and also a corresponding plug-in (labeled as “repro (plug-in)”) method. Here, the confidence set by the oracle method is \(\big \{\theta : \gamma _{(\theta , {\eta ^{(0)}}^T)}\big \{|W({\varvec{X}},{\varvec{Y}}; \theta )|\big \}\) \(\le \alpha \big \}\), where the unknown nuisance parameter \({\varvec{\eta }}\) is replaced by their true values \({{\varvec{\eta }}^{(0)}} = (\eta _1^{(0)}, \ldots , \eta _K^{(0)})\). For the plug-in method, the set constructed is \(\big \{\theta : \gamma _{(\theta , {{\widehat{\eta }}}^T)}\big \{|W({\varvec{X}},{\varvec{Y}};\theta )|\big \} \le \alpha \big \}\), where the unknown nuisance parameters \({\varvec{\eta }}\) is replaced by a set of point estimators \(\widehat{{\varvec{\eta }}} = ({{\hat{\eta }}}_1, \ldots , {{\hat{\eta }}}_K)^T\). Here, the point estimator \({{\hat{\eta }}}_k\) is computed as follows: when the kth trial has nonzero events in both groups, we set \({\hat{\eta }}_{k}=\log \Big (\frac{{\hat{\pi }}_{1k}}{1-{\hat{\pi }}_{1k}}\Big )+\log \Big (\frac{{\hat{\pi }}_{0k}}{1-{\hat{\pi }}_{0k}}\Big )\), where \({\hat{\pi }}_{1k}=x_k/n_k\) and \({\hat{\pi }}_{0k}=y_k/m_k\). Otherwise, we set \({{\hat{\eta }}}_k = \min \{{\hat{\eta }}_{k'}: k'\)th trial has nonzero events in both groups, \(1\le k' \le K\}\). The plug-in method can be implemented in practice as an alternative repro samples method. However, when large-sample theories do not apply, the theoretical property of the estimator \({{\hat{\eta }}}_k\) is not known and thus the finite-sample theoretical performance of the plug-in confidence set is not known either. Table 3 provides a summary of the numerical results for the comparison. The sampling data in Table 3 were generated by mimicking the structure of Avandia dataset, as in the first simulation studies, but with true common odds ratio only being 1.0 to 1.9. To simulate the rare event case, the true values of \(\pi _{1k}^{(o)}\) in Table 3 (a) were generated from U(0, 0.008). We have also considered a case of non-rare event in Table 3 (b), in which the true values of \(\pi _{1k}^{(o)}\) were generated from U(0.1, 0.3). In the rare event case, the intervals from the proposed profile method are about \(10-13\%\) longer than those from the oracle method using the true nuisance parameters; while in the non-rare-event case, they are about \(7-8\%\) longer. The intervals from the plug-in method are only slightly (less than \(3\%\)) longer than those from the oracle method, demonstrating good efficiency, although in a few occasions, the numerical coverage rates are slightly less than the desired level of .95.

In summary, in the first simulation study designed to mirror the structure of Avandia dataset and rare event trials, both the proposed repro samples method and the Mantel–Haenszel method have displayed reasonable empirical performance, while the Peto’s method fails when the common odds ratio is away from 1. Theoretically, both the Mantel–Haenszel and Peto methods are valid in situations where the large-sample theories apply and their performance is not known in situations when the large-sample theories are violated. The proposed repro samples method has theoretically guaranteed performance in any (both small and large sample) settings, although it is a little conservative due to a profiling step designed to ensure its theoretical validity in all scenarios. In the second simulation study, majority studies are zero-total-event studies (three out of the five) in both settings. We see that the proposed repro samples method can incorporate the information from these majority zero-total-event studies. The Mantel–Haenszel and Peto methods, on the other hand, can only utilize the non-zero-total studies in their respective computations, implicitly disregarding the information from the majority zero-total-event studies. This study underscores a situation where the repro samples method may be the only viable option. Finally, the third simulation study confirms that the proposed repro samples method utilizing the profiling technique to handle the nuisance parameters is conservative, in comparison with the oracle method using the true nuisance parameter values as well as a corresponding plug-in method using point estimates for the unknown nuisance parameters. The plug-in approach can possibly serve as an more efficient alternative repro samples approach in practice, although its finite-sample theoretical performance is not known, the same as the Mantel–Haenszel and Peto methods.

Table 2 Comparisons of the 95% confidence intervals of the common log odds ratio from the Mantel–Haenszel, Peto, and repro samples methods. Sampling data are generated by mimicking the structure of Avandia dataset with different true common odds ratios and true \(\pi _0\) from uniform distribution on (0, 0.008)
Fig. 1
figure 1

Illustration of the impact of zero-total-event studies on level-\(95\%\) confidence intervals of common log odds ratio: analysis using two studies (removing zero-total-event studies) versus analysis using all five studies. The two datasets used are as follows: a (3/100, 2/100),  (2/300, 1/300),  (0/600, 0/300),  (0/600, 0/300),  and (0/300, 0/300) and b (2/100, 2/100),  (1/50, 1/50),  (0/100, 0/300), (0/100, 0/300), and (0/100, 0/300)

Table 3 Comparisons of the 95% confidence intervals of the common log odds ratio from the proposed repro samples (profile) method with the corresponding oracle and plug-in approaches. Sampling data are generated by mimicking the structure of Avandia dataset with true common odds ratio being 1.0 to 1.9

4 Real Data Analysis

Avandia is the trade name of drug rosiglitazone, which was widely used for treatment of type II diabetes mellitus. The Avandia dataset studied in Nissen and Wolski [1] includes data from \(K =48\)-independent clinical trials to examine its effect on cardiovascular morbidity and mortality. Among the 48 trials, there are two large trials with sample sizes at least 1456 in each group. The other 46 trials have sample sizes at most 1172 in one arm. In the dataset, the events of myocardial infarction and cardiovascular death have very low incidence rates. Thus, many trials do not contain any or only contain very few events of interest, especially for the endpoint of cardiovascular death. Evidently, there exist many trials with zero number of event in one of two arms or in both arms. Particularly, among the 48 trials, 10 of them report no events for myocardial infarction and 25 of them report no events for cardiovascular death in both of treatment and control groups. The entire dataset can be found in Table I of the supplementary material of Tian et al. [13]. Xie et al. [10] used the likelihood principle to show that the zero-total-event trials have information of the common odds ratio. However, it is a challenging task to effectively incorporate these studies in a meta-analysis [7, 10]. In this article, we re-analyze the Avandia dataset using the newly developed finite-sample repro sample method along with the widely used Mantel–Haenszel and Peto methods to construct confidence intervals for the common odds ratio.

Table 4 Analysis of Avandia dataset: \(95\%\) confidence intervals of common odds ratio

The 95% confidence intervals for common odds ratios of myocardial infarction and cardiovascular death obtained by these three approaches, denoted as MH, Peto, and Repro-1, respectively, are listed in Table 4. For the endpoint of cardiovascular death, three methods output the similar results. Particularly, all three confidence intervals include the value of 1. Thus, all three methods suggest that the drug rosiglitazone has no statistically significant effect on mortality of cardiovascular death. As for myocardial infarction, the results of the three methods are different. The confidence intervals of conventional Mantel–Haenszel and Peto methods exclude the value 1, while the one using the repro samples method includes it. According to Mantel–Haenszel and Peto means, the drug rosiglitazone has statistically significant effect, although their lower bounds just barely over 1. Using the repro samples method, we could not conclude that the drug rosiglitazone has a statistically significant effect on myocardial infarction, although the repro samples method is conservative and is easier to have a lower bound to be less than 1.

Finally, we examine the impact of zero-total-event studies on the common odds ratio confidence intervals in the Avandia dataset. Specifically, we re-run our repro sample algorithm by deleting the zero-total-event studies and compare the confidence intervals obtained without including zero-total-event studies, denoted by Repro-2 in Table 4, with those previously obtained including these zero-total-event studies. From Table 4, we can see that the intervals with and without including the zero-total-event studies are quite different. The intervals with including zero-total-event studies are narrower than those without including zero-total-event studies. This shows that utilizing zero-total-event studies in meta-analysis is important and beneficial for the inference of the common odds ratio in general. It affirms the conclusion in Xie et al. [10] that the zero-total-event studies have meaningful information and can impact the inference of the common odds ratio.

5 Further Remarks and Discussion

Questions on whether a zero-total-event study contains any information for the common odds ratio in meta-analysis of \(2\times 2\) tables and how to incorporate such studies when making inference for the common odds ratio have long been debated and remained to be open in statistics cf. [7, 10]. The difficulty is due to the lack of mathematical definition for 0/0 and also because most existing meta-analysis approaches rely on large sample theories and normality assumptions, both of which are not applicable for the zero-total-event studies. In this paper, using the recent developed repro samples inferential framework, we are able to develop a finite-sample meta-analysis approach to make inference for the common odds ratio while incorporating information from the zero-total-event studies. The developed inference procedure has guaranteed theoretical performance and is validated in numerical studies. It provides an affirmative answer to the set of open research questions.

The repro samples framework is developed based on the ideas of inversion, matching of artificial and observed samples, and simplifying uncertainty quantification through a Borel set concerning U. It does not need any regularity conditions, or relies on any large-sample theories. It can provide finite-sample inference with few assumptions and is an ideal tool to address some difficult and complicated inference problems. Beside the development in this article, the repro samples method can also be used to develop new finite-sample procedures in other meta-analysis settings; for instance, developing a new finite-sample approach to perform meta-analysis and combine information in a random-effects model with only a few studies, a setting studied in Michael et al. [27]. Furthermore, the repro samples method is also very effective for other complex inference problems that involve discrete or non-numerical parameters. For instance, Xie and Wang [23] and Wang et al. [24] provided solutions for two highly nontrivial problems in statistics: (a) how to quantify the uncertainty in the estimation of the unknown number of components and make inference for the associated parameters in a Gaussian mixture and (b) how to quantify the uncertainty in model estimation and construct confidence sets for the unknown true model, the regression coefficients, or both true model and coefficients jointly in high-dimensional regression models. We anticipate these developments will stimulate further developments to address more complicated and non-trivial inference problems in statistics and data science where a solution is currently unavailable or cannot be easily obtained.

Finally, the proposed repro samples method has a theoretically guaranteed validity performance for any given number of studies K and any sample sizes \((n_k, m_k)\), \(k = 1, \ldots , K\). To achieve this strict finite-sample property, it adopts a profiling method to control any possible values of nuisance parameters. By design, the method is conservative. Our numerical results also demonstrate the proposed method is valid but conservative. Practically, the plug-in method described in Sect. 4 can possibly serve as a more deficient (less conservative) alternative, although the theoretical finite-sample property of the plug-in method is not known. It remains to be of interest in future research to see whether we can improve the power of the proposed repro samples method while maintain its strict finite-sample validity. Additionally, since the number of nuisance parameters is the same as the number of studies, the practical performance of the optimization (especially the use of optim function) is likely be impacted when the number of studies is getting larger. In our simulation study with \(K = 48\) nuisance parameters, the numerical performance appears to be reasonable. Nevertheless, it is worthy of further investigating the optimization used in the profile step and explore other more effective optimization algorithms or procedures.