# Meta analysis of binary data with excessive zeros in two-arm trials

**Part of the following topical collections:**

## Abstract

We present a novel Bayesian approach to random effects meta analysis of binary data with excessive zeros in two-arm trials. We discuss the development of likelihood accounting for excessive zeros, the prior, and the posterior distributions of parameters of interest. Dirichlet process prior is used to account for the heterogeneity among studies. A zero inflated binomial model with excessive zero parameters were used to account for excessive zeros in treatment and control arms. We then define a modified unconditional odds ratio accounting for excessive zeros in two arms. The Bayesian inference is carried out using Markov chain Monte Carlo (MCMC) sampling techniques. We illustrate the approach using data available in published literature on myocardial infarction and death from cardiovascular causes. Bayesian approaches presented here use all the data, including the studies with zero events and capture heterogeneity among study effects, and produce interpretable estimates of overall and study-level odds-ratios, over the commonly used frequentist’s approaches. Results from the data analysis and the model selection also indicate that the proposed Bayesian method, while accounting for zero events, adjusts for excessive zeros and provides better fit to the data resulting in the estimates of overall odds-ratio and study-level odds-ratios that are based on the totality of the information.

## Keywords

Dirichlet process Model selection Markov chain Monte Carlo Simulation## Introduction

An arm is a standard term for describing clinical trial and it represents a treatment group or a set of subjects. A two-arm study compares a drug with a placebo or drug A with drug B. Sometimes in these studies, the outcome may be binary. A binary outcome is an outcome whose unit can take on only two possible states “0" and “1". For example, outcomes of clinical trials data such as the morbidity and mortality studies are often binary in nature.

The mean and variance for the binomial random variable are *E*(*Y*)=*np* and *Var*(*Y*)=*np*(1−*p*) respectively. In a two-arm trial with binary outcomes, it is typically assumed that \(Y_{T_{1}},...,Y_{T_{k}}\) and \(Y_{C_{1}},...,Y_{C_{k}}\) are random samples from \(Y_{T_{i}} \sim Bin\left (n_{T_{i}},P_{T_{i}}\right)\) and \(Y_{C_{i}} \sim Bin\left (n_{C_{i}},P_{C_{i}}\right)\) respectively, where *k* is the number of studies. In a random effects meta analysis of these types of data, the effect size is assumed to vary from study to study. Random effects meta analysis assumes that study effects are a random sample from an underlying relevant distribution of effects, and the combined effect estimates the mean effect of this distribution.

There are a variety of different approaches to analyze these types of data as indicated by some recent literature. See Albert (1995) for various parametrization of binomial models for discrete data within Bayesian settings. Chang et al. (2001) use a mixed effects model to investigate between and within-study variation using rate difference and logit models. Gamalo et al. (2011) propose a Bayesian procedure for testing noninferiority in two-arm studies with a binary primary endpoint that allows the incorporation of historical data on an active control via the use of informative priors but did not consider excessive zeros. Carlin (1992) consider a Bayesian meta-analysis approach for two way contingency table data while Smith et al. (1995) discuss how a full Bayesian analysis can be used to deal with issues in meta-analysis in a natural way using the BUGS language. In this paper, we consider a Bayesian approach for binary data with excessive zeros in two-arm trials. More specifically, we model the excessive zeros using zero inflated binomial distribution and use the Dirichlet process Ferguson (1974) to handle the heterogeneity among studies. There are various zero inflated methods available in the literature. Hall (2000) introduced the framework for count data with many zeros using Poisson and binomial models and likelihood ratio tests based inference for zero inflated Poisson models are discussed in Huang et al. (2014). A Bayesian inference framework for zero inflated Poisson regression models is discussed in Ghosh et al. (2006). A rich class of nonparametric Bayesian priors for study effects and Bayesian nonparametric Polya tree mixture model are developed in Branscum and Hanson (2008) and Burr and Doss (2005).

In Section 2, we describe Bayesian model specification used in the paper. The likelihood function and the priors are described. Study effects have a Dirichlet process prior distribution for capturing heterogeneity among studies. We then obtain posterior summary statistics which describe key features in the model. In particular, posterior expectations are approximated through Markov chain Monte Carlo (MCMC) methods. In Section 3, the model is applied to a large dataset available in the literature Nissen and Wolski (2007). We perform the model selection using the log-pseudo marginal likelihood (LPML) comparing the Binomial and zero-inflated Binomial (ZIB). The results suggest that when the data has a high percentage of observed zeros, ZIB model is a more appropriate model to use. Furthermore, the use of Dirichlet process has advantage over the more commonly used random effects model with normally distributed random effects based on DerSimonian-Laird approach DerSimonian and Laird (1986) or a Bayesian approach using normal priors, in terms of its inherent clustering property resulting in the studies with similar effects to cluster, and thus providing more robust estimates. We also test the approach using simulation studies in Section 4 and study the effect of excessive zeros in the ZIB models. We conclude with a short discussion in Section 5.

## Model development

*k*is the number of studies. Then the joint likelihood \(L=L\left (y_{T_{1}},\ldots,y_{T_{k}},y_{C_{1}},\ldots,y_{C_{k}}|\mu,P_{T},P_{C}\right)\) is

In random effects meta-analysis formulation, we assume that *P*_{T} and *P*_{C} follow logistic models, and define

\(P_{T_{i}} = \frac {exp\left \{\mu + Tr + \alpha _{i} + e_{i} \right \}}{1 + exp\left \{\mu + Tr + \alpha _{i} + e_{i} \right \}}\) and \(P_{C_{i}} = \frac {exp\left \{\mu + e_{i} \right \}}{1 + exp\left \{\mu + e_{i} \right \}}\).

*p*,

*μ*is the intercept,

*Tr*is the treatment effect,

*α*

_{i}and

*e*

_{i}are the study effects and error terms. As proposed by Muthukumarana and Tiwari in Muthukumarana and Tiwari (2016), consider a Bayesian approach and assume that {

*α*

_{i};

*i*=1,…,

*k*} is a sample from a Dirichlet process with concentration parameter

*ρ*and the baseline distribution

*H*. We assume that the baseline distribution

*H*is \(N\left (0,\sigma ^{2}_{H}\right)\). More specifically, we assume that

*a,b,c*and

*d*are assumed to be known. We obtain the the posterior characterizations of parameters under zero inflated binomial likelihood as follows.

where \(u_{j} = \left \{\begin {array}{ll} 1, & y_{T_{j}} = 0 \\ 0, & y_{T_{j}} = 1 \end {array}\right. \) and \(w_{j} = \left \{\begin {array}{ll} 1, & y_{C_{j}} = 0 \\ 0, & y_{C_{j}} = 1. \end {array}\right. \)

We investigate the suitability of the zero inflated binomial distribution using the log pseudo marginal likelihood (LPML) Gelfand et al. (1992) in Section 4.

## Data analysis

*y*

_{−i}denotes the observation vector

*y*with the

*i*

^{th}observation deleted. The model with larger value of LPML is preferred. The estimates of

*μ*,

*Tr*and the LPML values are given in Table 1. The LPML prefers binomial model over the ZIB model and the two models estimate the parameter

*Tr*differently.

Parameter estimates with each model along with LPML

Myocardial infarction | Cardiovascular causes | |||
---|---|---|---|---|

Parameter | Binomial model | ZIB model | Binomial model | ZIB model |

| 0.0394 (0.0277) | 0.0709 (0.0503) | 0.0709 (0.0503) | 0.1235 (0.0876) |

| -1.1989 (0.3945) | -3.3612 (0.4870) | -3.3612 (0.4870) | -4.5339 (0.5707) |

LPML | -173.5474 | -179.7584 | -156.3964 | -125.0447 |

*μ*,

*Tr*and the LPML values are given in Table 1. In this case, the LPML strongly prefers ZIB model over the binomial model. This is in agreement with the fact that there are large amount of excessive zeros on death from cardiovascular causes relative to myocardial infarction.

*p*

_{0}and

*q*

_{0}on the analysis. The graphical posterior summaries of

*p*

_{0}and

*q*

_{0}on myocardial infarction and cardiovascular causes are given in Figs. 7, 8, 9 and 10. In addition, the numerical posterior summaries of

*p*

_{0}and

*q*

_{0}are given in Table 2. It is clear that the posterior distributions of

*p*

_{0}and

*q*

_{0}and their numerical summaries for myocardial infarction and cardiovascular causes make sense with respect to the percentages of zeros in the data. We also consider a

*Beta*(0.5,0.5) prior on

*p*

_{0}and

*q*

_{0}in order to investigate the prior sensitivity. The numerical posterior summaries of

*p*

_{0}and

*q*

_{0}under

*Beta*(0.5,0.5) prior are given in Table 3. We notice a magnitude change in estimates of

*p*

_{0}and

*q*

_{0}in this case but the estimates of primary parameters

*μ*and

*Tr*are very close indicating that odds ratios are not sensitive to the choice of prior settings. This indicates that inference on

*p*

_{0}and

*q*

_{0}will be sensitive to the choice of priors so one should select these priors carefully based on application specific apriori knowledge on zero inflated parameters.

Posterior mean and standard deviation (in parentheses) of *p*_{0} and *q*_{0}

Parameter | Myocardial infarction | Cardiovascular causes |
---|---|---|

| 0.0495 (0.044) | 0.231 (0.117) |

| 0.27 (0.118) | 0.566 (0.137) |

Posterior mean and standard deviation (in parentheses) of *p*_{0} and *q*_{0} under *Beta*(0.5,0.5) prior distribution

Parameter | Myocardial infarction | Cardiovascular causes |
---|---|---|

| 0.0254 (0.036) | 0.179 (0.125) |

| 0.247 (0.125) | 0.538 (0.161) |

| 0.073 (0.051) | 0.118 (0.084) |

| -3.357 (0.444) | -4.707 (0.632) |

*μ*under binomial model on Myocardial Infarction are given in Fig. 11. The trace plot appears to stabilize immediately and hence provides no indication of lack of convergence in the Markov chain. The autocorrelation plot also appears to dampen quickly. Trace plots of study effects on Myocardial Infarction are given in Fig. 12. The trace plots of study effects on death from cardiovascular causes indicate similar behavior. Similar plots were obtained for all of the parameters under each model and provide the evidence of the convergence of the Markov chains.

*α*

_{i}in place of the DP prior. We now re-analyze the data assuming that study effects are arising from a \(N\left (0,\sigma ^{2}_{H}\right)\) prior distribution. We remark that this is the baseline distribution of the DP prior in (2). In this case, forest plots of odds ratios for each model are given in Fig. 13. The estimates of primary parameters of interest and LPML values are given in Table 4. The LPML model selection criteria clearly indicates that the DP prior in (2) is superior than the conventional parametric prior.

Parameter estimates for various models under \(N\left (0,\sigma ^{2}_{H}\right)\) prior on study effects

Model | | | LPML | Overall odds ratio with 95% C.I. |
---|---|---|---|---|

Myocardial - Bin | 0.039 (0.028) | -1.148 (0.369) | -179.5883 | 1.04 (0.979, 1.101) |

Myocardial - ZIB | 0.073 (0.052) | -3.311 (0.435) | -182.5765 | 1.07 (0.961, 1.194) |

Cardiovascular - Bin | 0.036 (0.025) | -1.892 (0.427) | -161.3744 | 1.04 (0.983, 1.094) |

Cardiovascular - ZIB | 0.123 (0.089) | -4.585 (0.568) | -125.5402 | 1.13 (0.917, 1.356) |

*p*

_{0},

*q*

_{0}and the overall odds ratio (OR). For example, the treatment can be declared is to be safer than the control, if

*OR*≤1, and

*p*

_{0}>

*q*

_{0}. Also notice that estimates of (

*p*

_{0},

*q*

_{0}) are independent of the odds ratio because the counts cannot be in “true" zero arms and “Binomial" arms. We combine the two metrics, conditional OR and (

*p*

_{0},

*q*

_{0}), to come up with an overall unconditional odds ratio. We define it to be modified odds ratio= OR ×(1−

*p*

_{0})/(1−

*q*

_{0}). Note that when

*p*

_{0}=

*q*

_{0}, modified odds ratio is same as OR. If

*p*

_{0}>

*q*

_{0}, this adjusts OR, by multiplying by a factor less than 1, and if

*p*

_{0}<

*q*

_{0}, it adjust OR by multiplying by a factor >1. This factor,

*h*(

*p*

_{0},

*q*

_{0})=(1−

*p*

_{0})/(1−

*q*

_{0}) is the ratio of probabilities of observing Bernoulli counts in the two arms, and can be considered as odds for observing Bernoulli counts in the two arms. In frequentist setup, \(h(\hat {p}_{0},\hat {q}_{0})\) is independent of \(\hat {\mu }\), and hence independent of conditional odds ratio. In fact, \(\hat {p}_{0}\) and \(\hat {q}_{0}\) converge to

*p*

_{0}and

*q*

_{0}with probability 1, and hence \(h(\hat {p}_{0},\hat {q}_{0})\) also converge to

*h*(

*p*

_{0},

*q*

_{0}) with probability 1, as h is a continuous function (from Slutsky’s theorem). So, the estimated modified odds ratio is a consistent estimator for unconditional odds ratio defined as

*OR*×(1−

*p*

_{0})/(1−

*q*

_{0}). We provide the estimates of the modified odds ratio for various models in Table 5. As estimate of

*p*

_{0}is less than

*q*

_{0}for both examples (Myocardial Infarction and cardiovascular causes), the modified OR values are higher than the corresponding OR values.

Modified odds ratios, standard deviations (in parentheses) and credible intervals under DP and normal priors

Model | Modified odds ratio | 95% credible interval |
---|---|---|

Myocardial - DP Prior | 1.448 (0.277) | (1.05, 2.11) |

Myocardial - Normal Prior | 1.446 (0.275) | (1.05, 2.11) |

Cardiovascular - DP Prior | 2.209 (0.830) | (1.15, 4.27) |

Cardiovascular - Normal Prior | 2.224 (0.837) | (1.16, 4.33) |

## Results from simulation studies

*p*

_{0}and

*q*

_{0}in the model, different simulation studies were carried out. For this purpose, we generate random ZIB values with empirical binomial parameters. We first generate 42 pairs of independent binary, 0 and 1, variables from Bernoulli (

*p*

_{0}) and Bernoulli (

*q*

_{0}) where

*p*

_{0}and

*q*

_{0}are from the set of values {(0.1,0.1),...,(0.9,0.9)}. We then assign the true-zeros at the places with 1s, and generate binomial outcomes from \(B(\bar {n}_{T}, \hat {P}_{T_{i}})\) and from \(B(\bar {n}_{C}, \hat {P}_{C_{i}})\), where \(\bar {n}_{T}, \bar {n}_{C}, \hat {P}_{T_{i}}\) and \(\hat {P}_{C_{i}}\) are empirical estimates. Then, MCMC sampling scheme described in Section 2 was carried out using

*R*to obtain the posterior estimate of

*p*

_{0}and

*q*

_{0}. This was done 1000 times for each pair to obtain the mean and standard error of each estimate. For various scenarios of excessive zeros, the results are given in Table 6. The results indicate that when true values of

*p*

_{0}is small and the observed values of zeros in the simulated data in treatment arm (control arm) is also small (large), the estimated values of

*p*

_{0}and

*q*

_{0}are also small (large), whereas when the values of

*p*

_{0}and

*q*

_{0}are large the simulated data has large proportion of zeros in both the arms, this results in large estimated values of

*p*

_{0}and

*q*

_{0}. In both the situations, the estimated values of

*p*

_{0}and

*q*

_{0}are in conformity with the observed percentages of zeros in the simulated data. The estimates of

*p*

_{0}and

*q*

_{0}remain high in spite of their true choices from the parameter values. Note that our primary interest is on alphas and on treatment arm not on the control arm, so we may not need to investigate q0 very well as it can be trated as nuisance parameter. In practice, one should have a very good apriori knowledge of q0 which will allow to assign an informative prior as it is reflecting the zeros in the control arm. This indicates that the use of ZIB is more appropriate when there are excessive zeros in the data.

Simulation studies for the myocardial infarction data

Initial pair ( | Mean of the posterior means of | Standard error of | Mean of the posterior means of | Standard error of |
---|---|---|---|---|

(0.1,0.1) | 0.19888206 | 0.0907763 | 0.52069774 | 0.08299897 |

(0.2,0.2) | 0.29206103 | 0.10530334 | 0.59059405 | 0.05131143 |

(0.3,0.3) | 0.33836514 | 0.09115085 | 0.62706227 | 0.06062431 |

(0.4,0.4) | 0.3587838 | 0.10839263 | 0.63652 | 0.06102938 |

(0.5,0.5) | 0.4977337 | 0.1267577 | 0.7200964 | 0.04900527 |

(0.6,0.6) | 0.6180761 | 0.1255134 | 0.7705962 | 0.06610487 |

(0.7,0.7) | 0.6525938 | 0.1885614 | 0.8096615 | 0.07434429 |

(0.8,0.8) | 0.796206 | 0.09240875 | 0.8696273 | 0.06536744 |

(0.9,0.9) | 0.8958989 | 0.0501579 | 0.935196 | 0.03310256 |

## Discussion

Binary data naturally arise in clinical trials in health sciences. In some cases, they arise with excessive zeros. In this paper, we have provided a random effects meta analysis approach for binary data with excessive zeros in two-arm trials. The suitability of the binomial and zero inflated binomial model was assessed in the presence of Dirichlet process as the prior for the study effects. The approach can be used as a template for meta analysis of binary data and a user may choose the proper model using log pseudo marginal likelihood. We have shown that our approach is superior than DerSimonian- Laird random effects model when there is heterogeneity among studies and LPML model selection criteria can be used to selection the best model among the Bayesian models (not including DerSimonian-Laird model) for a given data set.

The Bayesian approaches discussed in this paper allowed to incorporate the zero-studies in the likelihood, and we found that the point estimates of the overall odds-ratio from these methods, were lower than the estimates reported in the literature Nissen and Wolski (2007). The use of ZIB model was to identify the percentage of excessive zeros, that is, the studies where the events could not occur, from the (Binomially) modeled zeros where the zero events occurred. Note that under ZIB, some zeros are observed with probability *p*_{0} and some from Binomial model, making the probability of zero-event to be *p*_{0}+(1−*p*_{0})(1−*P*_{T})^{nT} in the treatment arm. With the use of ZIB model, the Bayes estimates of the odds-ratio went slightly up than with the use of Binomial model, but still they were lower than the results from DerSimonian-Laird random effects model and the resulting estimates in Nissen and Wolski (2007). Note also that DP model being discrete with probability 1, has a clustering property, where the study effects, that are alike, fall in the same cluster. We also investigated the suitability of the DP prior over the conventional parametric normal prior on study effects. The LPML model selection indicated that DP prior is superior than the conventional parametric normal prior. Finally, as the results from ZIB model on the parameters *p*_{0},*q*_{0} and OR need to be interpreted together, a modified OR was introduced.

As a future direction of research, we would like to extend the approach discussed in this article for ordinal category data. For example, in some applications, the clinical trial end point could be a response variable in an ordinal scale with multiple categories such as Good/Moderate/Critical etc. This type of ordinal response data can be viewed as multivariate responses arising from continuous latent variables with cut-points. We assume that there is a continuous latent outcome behind these ordinal outcomes such that *X*_{i}=(*X*_{i1},…,*X*_{im})^{′}∼Normal(*μ*,*Σ*) where *X*’s are the latent outcomes and *m* is the number of ordinal categories. Then the latent variables *X*_{ij}’s can be converted to the observed *Y*_{ij} using a cut-point vector *λ*. However the choice of cut-points and their priors need to be carefully selected as there are two arms and the counts on categories could be sparse. In this case, one can consider an objective Bayes approach following the development in Bayarri et al. (2008). Yet another extension of the proposed model is where there are multinomial data with some particular cell(s) being observed excessively. This kind of data may arise from trials with patient reported outcomes.

## Notes

### Acknowledgments

The authors thank Editor-in-Chief and three anonymous reviewers whose comments helped to improve the manuscript. This article reflects the views of the authors and should not be attributed to FDA’s views or policies.

### Authors’ contributions

All authors have contributed equally to the work and approved the final version of the paper.

### Funding

Muthukumarana’s research has been partially supported by a Discovery grant from the Natural Sciences and Engineering Research Council of Canada. Martell’s research internship was funded by Mitacs Globalink program.

### Competing interests

The authors declare that they have no competing interests.

## References

- Albert, J.: Teaching Inference about Proportions Using Bayes and Discrete Models. J. Stat. Educ. 3 (1995). https://doi.org/10.1080/10691898.1995.11910494.Google Scholar
- Bayarri, M. J., Berger, J. O., Datta, G. S.: Objective Bayes testing of Poisson versus inflated poisson models. Inst. Math. Stat. 3, 105–121 (2008).MathSciNetGoogle Scholar
- Branscum, A. J., Hanson, T. E.: Bayesian nonparametric meta-analysis using Polya tree mixture models. Biometrics. 64, 825–833 (2008).MathSciNetzbMATHCrossRefGoogle Scholar
- Burr, D., Doss, H.: A Bayesian semiparametric model for random-effects meta-analysis. J. Am. Stat. Assoc. 100, 242–251 (2005).MathSciNetzbMATHCrossRefGoogle Scholar
- Carlin, J. B.: Meta-analysis for 2 ×2 tables: A bayesian approach. Stat. Med. 11, 141–158 (1992).MathSciNetCrossRefGoogle Scholar
- Chang, B. H., Waternaux, C., Lipsitz, S.: Meta-analysis of binary data: which within study variance estimate to use?Stat. Med. 20, 1947–1956 (2001).CrossRefGoogle Scholar
- DerSimonian, R., Laird, N.: Meta-analysis in clinical trials. Control. Clin. Trials. 7, 177–188 (1986).CrossRefGoogle Scholar
- Ferguson, T. S.: Prior distributions on spaces of probability measures. Ann. Stat. 2, 615–629 (1974).MathSciNetzbMATHCrossRefGoogle Scholar
- Gamalo, M., Wu, R., Tiwari, R.: Bayesian approach to noninferiority trials for proportions. J. Biopharm. Stat. 21, 902–919 (2011).MathSciNetCrossRefGoogle Scholar
- Geisser, S.: Predictive Inference: An Introduction. Chapman and Hall, London (1993).zbMATHCrossRefGoogle Scholar
- Gelfand, A. E., Dey, D. K.: Bayesian Model Choice: Asymptotics and Exact Calculations. J. R. Stat. Soc. Ser. B. 56, 501–514 (1994).MathSciNetzbMATHGoogle Scholar
- Gelfand, A. E., Dey, D. K., Chang, H.: Model determination using predictive distributions with implementation via sampling-based methods (with discussion).
*Bayesian Statistics 4*(Bernardo, J. M., Berger, J. O., Dawid, A. P., Smith, A. F. M., eds.)Oxford University Press (1992).Google Scholar - Ghosh, S. K., Mukhopadhyay, P., Lu, J. C.: Bayesian analysis of zero-inflated regression models. J. Stat. Plan. Infer. 136(4), 1360–1375 (2006).MathSciNetzbMATHCrossRefGoogle Scholar
- Hall, D. B.: Zero-Inflated Poisson and Binomial Regression with Random Effects: A Case Study. Biometrics. 56, 1030–1039 (2000).MathSciNetzbMATHCrossRefGoogle Scholar
- Huang, L., Zheng, D., Zalkikar, J., Tiwari, R.: Zero-inflated Poisson model based likelihood ratio test for drug safety signal detection. Stat. Methods Med. Res. (2014). https://doi.org/10.1177/0962280214549590.MathSciNetCrossRefGoogle Scholar
- Muthukumarana, S., Tiwari, R.: Meta-analysis using dirichlet process. Stat. Methods Med. Res. 25(1), 352–365 (2016).MathSciNetCrossRefGoogle Scholar
- Neal, RM: Markov Chain Sampling Methods for Dirichlet Process Mixture Models. J. Comput. Graph. Stat. 9(2), 249–265 (2000).MathSciNetGoogle Scholar
- Nissen, S. E., Wolski, K.: Effect of rosiglitazone on the risk of myocardial infarction and death from cardiovascular causes. New Eng. J. Med. 356, 2457–2471 (2007).CrossRefGoogle Scholar
- Smith, T. C., Spiegelhalter, D. J., Thomas, A.: Bayesian approaches to random-effects meta-analysis: a comparative study. Stat. Med. 14, 2685–2699 (1995).CrossRefGoogle Scholar

## Copyright information

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.