Continuity Corrected Wilson Interval for the Difference of Two Independent Proportions

Confidence interval for the difference of two proportions has been studied for decades. Many methods were developed to improve the approximation of the limiting distribution of test statistics, such as the profile likelihood method, the score method, and the Wilson method. For the Wilson interval developed by Beal (Biometrics 43:941, 1987), the approximation of the Z test statistic to the standard normal distribution may be further improved by utilizing the continuity correction, in the observation of anti-conservative intervals from the Wilson interval. We theoretically prove that the Wilson interval is nested in the continuity corrected Wilson interval under mild conditions. We compare the continuity corrected Wilson interval with the commonly used methods with regards to coverage probability, interval width, and mean squared error of coverage probability. The proposed interval has good performance in many configurations. An example from a Phase II cancer trial is used to illustrate the application of these methods.


Introduction
In a randomized parallel study to compare two treatments, sample sizes for each group are often pre-specified. When the outcome is binary (e.g., response or not), risk difference (RD) is frequently used as the parameter of interest to evaluate the effectiveness of a new treatment as compared to the gold standard. In addition to p-value, confidence interval for RD is always required in the report [1,2]. This classical and important statistical problem has been studied for decades [3,4] and many methods were developed to construct confidence intervals for RD between two independent groups [5,6].
Newcombe [7] evaluated 11 methods to construct confidence intervals for RD between two independent groups. The traditional Wald method utilizes the Z test statistic with a traditional variance estimate of RD. This approach is easy to implement, but its performance highly relies on whether the true distribution of the Z test statistic is close to the normal distribution, which may not be the case with small sample sizes or unbalanced sample sizes or extreme rates. To improve the performance of intervals for RD, other methods were developed, including the profile likelihood method based on the asymptotic limiting distribution of the log-likelihood ratio test [8]. Miettinen and Nurminen [8] developed a score interval by using the estimated rates under the null hypothesis. Beal [9] proposed a Wilson interval by solving the confidence interval from a quadratic equation. Newcombe [7] utilized the individual Wilson intervals from each group to construct a hybrid interval for RD.
When sample size is small, exact methods could be alternatively used to control for the coverage probability. Exact methods compute the confidence interval by using a test statistic to order the sample space which is the collection of all possible samples. For a study with sample sizes of n 1 and n 2 in the first group and the second group, the size of the sample space is (n 1 + 1)(n 2 + 1) . The estimated RD was used by Santner and Snell [10] to order the sample space. Later, Chan and Zhang [11] proposed using the standardized Z test with a constrained maximum likelihood estimation (MLE) of the variance to order the sample space, and found that their exact approach outperforms the exact interval based on the estimated RD. In addition to RD and the Z test statistic, other measures could be used, such as the asymptotic lower and upper limits. Recently, Wang proposed the exact limit for RD based on a stochastic ordering of the sample size to improve the exact approaches based on existing test statistics. That limit is computationally intensive, although it is associated with good statistical properties. Exact interval is often criticized for its conservativeness with regards to coverage probability and computational resources needed in calculation.
The rest of the article is organized as follows. In Sect. 2, we first give a brief introduction of the existing intervals for the difference of two independent proportions. We then propose the continuity corrected Wilson interval. In Sect. 3, we compare the performance of these intervals using extensive simulations, with regards to coverage probability, interval width, and mean squared error of coverage probabilities. A Phase II study from a cancer trial is used to illustrate the application of these intervals. Lastly, we provide some comments.

Methods
Many statistical methods have been developed for confidence intervals for RD between two independent groups with the parameter of interest: = p 2 − p 1 , where p i is the rate of the i-th group (i=1, 2). We first introduce five commonly used and recommended confidence interval methods to construct a 100(1 − )% confidence interval, and then propose a new interval based on the Wilson approach with continuity correction for RD [12]. Suppose x i is the observed number of responses in the i-th group ( i = 1, 2 ). Then, the response rate in the i− th group can be estimated as The Wald confidence interval is calculated from the estimated variance of ̂ as: where ̂=p 2 −p 1 , and z ∕2 is the (1 − ∕2) th quantile of the standard normal distribution (e.g., z 0.05∕2 = 1.96 when = 0.05 ). This Wald confidence interval can be computed by using the R function, prop.test. Newcombe [7] developed a new confidence interval for RD based on the onesample Wilson confidence intervals using data from each group. Suppose ( l i , u i ) is the Wilson confidence interval for p i in the i-th group (i=1, 2). These individual Wilson intervals for each group can be computed by using the R function scoreci from the R package PropCIs. We refer to this as the Newcombe's hybrid (NH) interval, which is calculated as: The third interval is the one proposed by Miettinen and Nurminen [8] (referred to be as the MN interval). The MN method computes interval by inverting the score test as where = (n 1 + n 2 )∕(n 1 + n 2 − 1) and the detailed formulas for p 1 and p 2 may be found in Miettinen and Nurminen [8] and Newcombe [7]. Under the null hypothesis with = 0 , p 2 is equal to p 1 . The MN interval can be computed by using the R function scoreci from the R package ratesci.
Profile likelihood (PL) confidence interval is also commonly used in practice. It is based on the log-likelihood function as the collection of satisfying where p i is a function of , and p i can be obtained from the MN method. We name this interval as the PL interval. This is a one-parameter search problem for each limit, and the R function uniroot is used to calculate the lower limit and the upper limit of the PL interval. It should be noted that p i and 1 −p i are in the denominator of the likelihood function. For that reason, the left part of the equation will not be defined when p i = 0 or 1. To overcome that challenge, Newcombe [7] presented the detailed solutions for these special situations.
Beal [9] proposed a Wilson interval by solving the confidence interval for from a quadratic equation (referred to be as the WI interval). The confidence interval for is the collection of satisfying: The two parameters p 1 and p 2 can be re-parametrized with and ( = p 2 + p 1 ), where p 1 = ( − )∕2 and p 2 = ( + )∕2 . Here, is treated as the nuisance parameter. Equation (1) can be rewritten as: After some straightforward algebra from a quadratic equation ( u 2 + v + w = 0 ), the 100(1-)% WI interval for is where and where ̂=p 2 +p 1 .
The performance of the WI interval depends on how close is the sampling distribution of the statistic on the left side of Equation (1) to the standard normal distribution. To improve that approximation, we propose adapting the continuity correction [13] to construct a new confidence interval for .

3
The Wilson confidence interval with continuity correction is defined as the collection of values satisfying where N = n 1 + n 2 is the total sample size in a study. Similar to the WI interval, the WIC interval is calculated as We present the following theorem for the relationship between the WI interval and the WIC interval. (4). Specifically, when p 1 and p 2 do not take boundary values of 0 and We start the proof of this theorem with the following Lemma.

Lemma 2.1 Under conditions of Theorem
Proof The Wilson lower limit L and its upper limit U are two solutions of the equation u 2 + v + w = 0 , which is equivalent to . By proof of contradiction, we are going to show that It is easy to show the bounds of ̂+L 2 and ̂−L 2 as By using the non-boundary value assumption for p 1 and p 2 , we have It is always true that 1 using nearly the same argument as above, which implies that U >p 2 −p 1 + 1 N . ◻ The results from Lemma 2.1. are used to prove Theorem 2.1.
where the last inequality follows from the Lemma result. Therefore, f ( ) has a zero between −∞ and L. Since L c is the smaller solution of f ( ) = 0 , we have L c < L. Similarly, we let g( ) = u 2 + v 2 + w 2 . Because uU 2 + vU + w = 0 , we have Therefore, g( ) has a zero between U and ∞ . Since U c is the larger solution of g( ) = 0 , we have U < U c . ◻ Theorem 2.1 is true under the condition that x i is not on the boundary ( x i = 0 or n i ). When x 1 = 0 and x 2 = n 2 , the estimated U c could be less than U from the WI interval. Similar results are observed when x 1 = n 1 and x 2 = 0.
When x 1 = x 2 = 0 , the WI interval can be computed with the lower limit or the upper limit being zero for an unbalanced study, and zeros for both limits when n 1 = n 2 . In such case, the WIC interval can't estimated or at least one of the limits is not estimable because a negative value of v 2 1 − 4uw 1 or v 2 2 − 4uw 2 . In such cases, the WIC limits are assumed to be the same as the WI limits. Similar arrangements will be made when x 1 = n 1 and x 2 = n 2 . It should be noted that these four cases are very rare in practice.

Results
We conduct extensive simulation studies to evaluate the performance of these intervals for a study with sample sizes from 40 to 5000 in each group n i =(40, For a study with design parameters, ( n 1 , n 2 , p 1 , p 2 ), we simulate S =20,000 data points X s = (x 1s , x 2s ) , where x is follows a binomial distribution with parameters n i and p i , and s = 1, 2, ⋯ , S . The aforementioned six methods are then used to calculate confidence intervals, CI(X s ) , for each simulated data.

Coverage probability
We first compare the coverage probability (CP) from S = 20, 000 simulations for each given design parameter ( n 1 , n 2 , p 1 , p 2 ): where I() is the index function with I() = 1 when ∈ CI(X s ) , and 0 otherwise. The computed CP can be viewed as the weighted CP because data are simulated from binomial distributions, not enumerated all possible data with equal probability. The coverage probability comparison is presented in Fig. 1 when p 1 = 10% and 80%. The Wald interval does not have satisfactory coverage probability when sample size is small, or a study with very unbalanced sample sizes (e.g., n 2 ≫ n 1 ), or is very large. All other methods have good coverages when is small. It should be noted that the proposed WIC method often has a higher coverage probability as compared to others when sample size is not too large (e.g., 5000) or too small (e.g. 40). In the majority of these cases, the average coverage probabilities of the WIC method are slightly over 95%. When is increased to 85% for the cases in the second column of the figure, the MN method and the NH method have better coverages than others when n 2 is small. The PL method could have the coverage probability below 90% for a study with very unbalanced sample sizes. The WIC method could be slightly conservative as compared with other when sample sizes are small, with the coverage probability being higher than the nominal level. When p 1 = 80% , the findings are similar to the cases when p 1 = 10% and = 25% . The proposed WIC method has better coverage probability CP(n 1 , n 2 , p 1 , Fig. 1 Coverage probability at the nominal level of 95% when p 1 is small and large than the WI interval, except the cases with small sample sizes in both groups where the WIC interval is conservative with the average coverage probability close to 96% while the WI interval has the coverage close to 95%. We also present the coverage probability in Figure 2 when p 1 is between 20% and 60%. The coverage probability of the Wald interval is closer to 95% as compared the Wald interval in Fig. 1 when p 1 is small, but the Wald interval is still anti-conservative with the actual coverage below the nominal level in many configurations (e.g., a study with unbalanced sample sizes). The proposed WIC interval often has the coverage probability higher than others when sample size is not too large, although it could be slightly anti-conservative. When sample sizes in both groups are large enough, all methods have similar coverage probabilities.
For a given set of ( p 1 , p 2 ), there are a total of 81 combinations of sample sizes (n 1 , n 2 ) . Let Ω be the space of all combinations of n 1 and n 2 . The coverage probabilities of these cases, CP(n 1 , n 2 |p 1 , p 2 ) , are used to calculate the proportion of guaranteed coverage probability from these 81 cases: where |Ω| = 81 is the size of Ω. A method with a high proportion of guaranteed coverage probability is preferable. In Fig. 3, we show that proportion for each given (p 1 , ) . The Wald interval has the lowest proportion, followed by the PL method and the WI method. The remaining three methods (WIC, MN, and NH) often have higher guaranteed coverage probability than others. The NH method often performs the best when p 1 is small (e.g., 1% and 5%). As p 1 goes up, the proposed WIC method has the highest guaranteed coverage probability. When p 1 is large (e.g., 80%), the NH method could be slightly better than the other two methods.
We plot detailed coverage probability comparison between the NH method and the WIC method in Fig. 4 when p 1 = 0.05 and = 0.1 . The average difference of coverage probability between these two methods is 17%. We add the label of the minimum sample size ( min(n 1 , n 2 ) ) in the figure, and find that the 17% difference is mainly due to the cases with small sample sizes (e.g., min(n 1 , n 2 ) = 40 ). After studies with min(n 1 , n 2 ) = 40 are removed, their difference is reduced to 4% (73.5% for the WIC interval VS 77.6% for the NH interval). On the right side of the figure, we compare the coverage probability between the WIC method and the MN method when p 1 = 40% and = 25% . As p 1 is medium to large, the WIC method is often the best method with the highest guaranteed coverage probability.

Mean squared error
Mean squared error (MSE) is also used to compare the performance of these methods: When a method has a small MSE, its coverage probabilities are generally close to the nominal level. We present the computed MSE in Fig. 5. It can be seen that MSE is much larger for the cases when p 1 = 0.1% . The WI method has the largest MSE when p 1 is very small. As p 1 is increased to 5%, the PL method is the worst. When p 1 is medium to large (e.g., between 20% and 80%), the WIC method often has the largest MSE, followed by the PL method, although the difference in MSE between these methods is very small. In such cases, the WIC method has the highest guaranteed coverage probability which leads to conservative confidence intervals in some settings. These conservative cases increase the guaranteed coverage probability, meanwhile they increase the MSE.

Interval width
We then compare the interval width from S simulations for each ( n 1 , n 2 , p 1 , p 2 ): where L() and U() are the lower limit and the upper limit of a confidence interval. In Fig. 6, we show the interval width of the six methods when p 1 = 10% , 40%, and 80%, and n 1 = 2000 , 500, and 40. It can be seen that these methods have similar interval widths when sample sizes are not too small in both group. When p 1 is small (e.g., 10%) and n 2 is fewer than n 1 , the MN method and the NH method have wider intervals than others. When the sample size in the control group is small (e.g., n 1 = 40 ), as n 2 increases, the WIC interval becomes shorter than the NH interval the MN interval. As p 1 is increased to 40%, the NH method and the MN method are better than others when n 1 is small. When p 1 is very large (e.g., 80%), all methods have similar interval width, except the configurations with small sample sizes in both groups. In such cases, the WI method and the WIC method have slight shorter width than others.

Examples
A randomized phase II clinical trial is used to illustrate the application of the methods considered in this article. That was a trial to test whether pazopanib has efficacy comparable to doxorubicin in elderly patients with soft tissue sarcoma (STS), where doxorubicin is a standard of care [14]. In that trial, 39 patients were treated with doxorubicin n 1 = 39 , while there were 81 patients in the pazopanib group ( n 2 = 81 ). The observed responses were x 1 = 6 and x 2 = 10 in the doxorubicin group and the pazopanib group, respectively. The estimated response rates and the computed confidence intervals for = p 2 − p 1 are presented in Table 1.
As expected, the WIC interval is slightly wider than the WI interval. The PL interval is similar to the WIC interval. The NH interval has the longest width among these intervals, although the difference is small. The WI interval and the WIC interval have shorter width than the MN interval, the NH interval, and the PL interval. If this was a non-inferiority test problem with the pre-specified margin of -10%, the Wald method is the only method that fails to conclude non-inferiority as its lower limit is less than the non-inferiority margin.

Discussion
In the observation of the anti-conservative coverage probability from the WI interval for RD, we propose the continuity corrected Wilson interval to improve the coverage probability. In addition to the coverage probability improvement, the guaranteed coverage probability using the WIC interval is almost doubled as compared to that of the WI interval [15][16][17]. The proposed WIC interval is recommended to be used when a study has the rates from medium to large.
When sample size is small, exact approaches could be used to have guaranteed coverage probability. For exact intervals, the exact lower limit and the exact upper limit are computed separately by using either one test statistic or two different test statistics in the sample space ordering [18][19][20][21]. For a 100(1-)% two sided interval, the nominal level for a one-sided limit is 100(1 − ∕2)% . When both lower and upper limits control for the nominal level, the final two-sided intervals could be conservative with the actual coverage probability much above 100(1-)%. That leads to a high proportion of guaranteed coverage, but the MSE would be large as expected. Alternatively, Bayesian credible intervals may be considered [22][23][24][25]. Under the Table 1 A Phase II clinical trial to compare the response rates between two treatmentŝ Bayesian setting, the observed number of responses is assumed to be a random variable that follows a pre-specified prior distribution.
We will extend the WIC method to other test statistics for comparing two treatments with binary outcome: relative risk (RR) and odds ratio (OR) [15,[26][27][28]. In the computation of RR and OR, the test statistic could be undefined when the estimated value in the denominator is zero [16,29,30]. In such cases, special arrangements should be utilized to properly compute the intervals for these data points.