Abstract
In clinical trials, two or more binary responses obtained by dichotomizing continuous responses are often employed as multiple primary endpoints. Testing procedures for multiple binary variables with latent distribution have not yet been adequately discussed. Based on the association measure among latent variables, we provide a statistic for testing the superiority of at least one binary endpoint. In addition, we propose a testing procedure with a framework in which the trial efficacy is confirmed only when there is superiority of at least one endpoint and noninferiority of the remaining endpoints. The performance of the proposed procedure is evaluated through simulations.
Introduction
In confirmatory clinical trials, several correlated binary response variables are used to assess the efficacy and safety of new treatments. The ICH E9 guideline [1] recommends that the primary endpoint should consist of only one variable that provides strong scientific evidence of treatment efficacy. However, in clinical trials for a variety of diseases, it is often useful to evaluate efficacy using multiple primary endpoints. For example, in clinical trials of patients with rheumatoid arthritis, the percentage of patients achieving shortterm improvement of 20 percent in the American College of Rheumatology criteria (ACR20) and the percentage achieving longterm low disease activity [Disease Activity Score (DAS28ESR) ≤ 3.2] are often used as primary endpoints (e.g., [2]). In clinical trials of patients with psoriasis, short and longterm improvements are simultaneously assessed based on the percentage of patients with at least 75 percent improvement in the psoriasis areaandseverity index (DASI) score (e.g., [3]). In particular, binary endpoints are often used when it is more meaningful to diagnose improvement beyond clear standards rather than to assess the disease state using continuous variables. In such trials, we can consider that all primary endpoints are binary and often have a continuous latent distribution.
Most trials use multiple endpoints only to evaluate noninferiority or superiority, but some trials have been conducted to confirm the noninferiority and superiority of all endpoints. For example, a clinical trial to confirm the efficacy of fourfactor prothrombin complex concentrate (4FPCC) included two primary endpoints [4], namely the percentage of patients with a hemostatic effect and the percentage with a decrease in the international normalized ratio (INR). In the above trial, superiority was only evaluated if there was noninferiority for both endpoints. When we confirm not only noninferiority but also superiority, the use of the closed testing procedure [5] for the primary analysis is reasonable, and in this case, no adjustment is needed to control the type I error rate. Sozu et al. [6,7,8] have already proposed a testing method for dealing with several endpoints in a trial and a method for calculating the sample size. Their theory is based on the framework of recognizing a treatment effect when the superiority of all primary endpoints is confirmed. Such a setting of endpoints is called “coprimary endpoints”. In general, however, it is difficult to demonstrate the superiority of two or more endpoints because the power decreases as the number of endpoints increases. On the other hand, “multiple endpoints” are used in trials that recognize a treatment effect if it is superior to at least one of the endpoints. Developing a procedure that can confirm the superiority of at least one binary endpoint with latent distribution is a challenge for statisticians in the design and analysis of clinical trials. Thus, the aim of this paper was to define a testing procedure within a framework in which the efficacy of a test treatment is confirmed only when the superiority of the treatment relative to control is evidenced for at least one endpoint, and noninferiority is demonstrated for the remaining endpoints.
For multiple continuous endpoints, Perlman and Wu [9] proposed a testing procedure that is applicable to the framework mentioned above. Nakazuru et al. [10] proposed a more powerful testing procedure using the approximate likelihood ratio test (ALRT) defined by Glimm et al. [11]. However, there has been inadequate development of methods for multiple binary endpoints. In the same framework as this study, Ishihara and Yamamoto [12] proposed a method using multiple binary endpoints, however, they were unable to assume a latent distribution for the binary variables. A statistic for testing the superiority of binary coprimary endpoints assuming latent variables has been developed by Sozu et al. [6]. Therefore, we herein propose a testing procedure that is appropriate when all endpoints are binary and have a latent distribution. Our procedure is based on the intersectionunion test (IUT), proposed by Nakazuru et al. [10], and a method by Sozu et al. [6] for estimating correlations between binary endpoints that have latent variables. In particular, we consider two statistics estimated under the null and alternative hypotheses for the test of superiority. Since there has not yet been any discussion on whether the statistics obtained under the null hypothesis or the alternative hypothesis are better in practical terms, another purpose of this study is to discuss this issue.
This article is structured as follows. In Sect. 2, we define several assumptions, including the hypothesis regarding the testing procedure and the association between correlated binary endpoints when taking latent variables into consideration. In Sect. 3, we give two IUT statistics for the superiority of at least one endpoint and the noninferiority of the remaining endpoints when all endpoints are binary and have a latent distribution. In Sect. 4, we provide a numerical experiment using Monte Carlo simulation to illustrate the behavior of the power and type I error rate of the proposed test. Regarding the power, the proposed statistics was compared with the closed testing procedure. In Sect. 5, we provide the results of IUT based on an actual clinical trial. Finally, in Sect. 6, we summarize our findings and present concluding remarks. Based on the proposed statistics and conducted simulations, we suggest how to use the statistics in a real clinical trial.
Assumption and Hypotheses
Statistical Setting
In this article, we focus on a randomized clinical trial comparing \(p \ (\ge 2)\) endpoints with two treatment groups. There are \(n_1\) subjects in the test group and \(n_2\) subject in the control group. Let \(Y_{ijk} \ (i=1,2; j=1,\ldots ,p; k=1,\ldots ,n_i)\) denote the binary response variable of the jth primary endpoint of the ith treatment in the kth subject. Suppose that the vectors of binary response variables \(\varvec{Y}_{ik} = (Y_{i1k},\ldots ,Y_{ipk})^{t} \ (i=1,2; k=1,\ldots ,n_i)\) are independently distributed as a pvariate Bernoulli distribution with \(\mathrm {E}(Y_{ijk}) = \pi _{ij}\), \(\mathrm {V}(Y_{ijk}) = \pi _{ij}(1\pi _{ij})\), and \(\mathrm {Corr}(Y_{ijk},Y_{ij'k}) = \rho _{(i)jj'}\) for all \(j \ne j'\), where the superscript \(``t''\) denotes transpose. In this setting, the correlation coefficient \(\rho _{(i)jj'}\) of the multivariate Bernoulli distribution is expressed as
where \(\phi _{(i)jj'}\) is the joint probability of two response variables \((Y_{ijk}, Y_{ij'k})\). Note that the range of \(\rho _{(i)jj'}\) is equal to or less than \((1, 1)\) depending on the value of \(\pi _{ij}\) and \(\pi _{ij'}\) (Bahadur [13]). That is, \(\rho _{(i)jj'}\)is bounded below by
and above by
Furthermore, we assume that \(\varvec{Y}_{ik}\) are dichotomized random variables of continuous unobservable response \(\varvec{Z}_{ik} = (Z_{i1k},\ldots ,Z_{ipk})^{t} \ (i=1,2; k=1,\ldots ,n_i)\). We also assume that \(\varvec{Z}_{ik}\) are independently distributed as a standardized pvariate normal distribution with \(\mathrm {Corr}(Z_{ijk},Z_{ij'k}) = \gamma _{(i)jj'}\) for all \(j \ne j'\). For each variable \(\varvec{Z}_{ik}\), there is a single threshold \(g_{ij} = \Phi ^{1}(1\pi _{ij}) \ (i=1,2; j=1,\ldots ,p)\) that partitions the latent distribution, where \(\Phi ^{1}\) is the inverse function of the standard normal cumulative distribution function. Then, the binary response \(Y_{ijk} \ (i =1,2; j=1,\ldots ,p; k=1,\ldots ,n_i)\) can be defined as
Set \(\varvec{X} = (X_1,\ldots ,X_p)^{t}\) with \(X_j = (\overline{Y}_{1j}\overline{Y}_{2j}) \ (j=1,\ldots ,p)\), where \(\overline{Y}_{ij} \ (i =1,2; j=1,\ldots ,p)\) is the sample proportion for the jth endpoint of the ith treatment. Let the true proportion vector be the ith treatment \(\varvec{\pi }_i = (\pi _{i1},\ldots ,\pi _{ip})^t\) with difference of proportion \(\varvec{\Delta } = (\delta _1,\ldots ,\delta _p )^t = \varvec{\pi }_1  \varvec{\pi }_2\) and the covariance matrix \(\varvec{\Sigma }\) that is defined as follows:
where \(\varvec{\Sigma }^{(i)} (i=1,2)\) is the covariance matrix of \((\overline{Y}_{i1},\ldots ,\overline{Y}_{ip})^{t}\). Note that \(\varvec{X}\) is approximately normally distributed with mean \(\varvec{\Delta }\) and covariance matrix \(\varvec{\Sigma }\).
Hypotheses
Without loss of generality, we assume that test treatment superiority is recognized when the proportion of responses to the test treatment is greater than that to the control treatment. That is, a maximum value of \(\delta _j (j=1,\ldots ,p)\) greater than 0 indicates an improvement of at least one endpoint of the test treatment compared to the control treatment. In the framework dealt with in this study, a test treatment effect is recognized only when the null hypothesis for the superiority of at least one endpoint and the null hypothesis for all noninferiority are all rejected simultaneously. In such a framework, the null hypotheses of superiority and noninferiority are represented by a union. Therefore, we consider the combined hypothesis for the superiority of at least one endpoint and the noninferiority of the remaining endpoints. We consider a null hypothesis \(H_0\) and an alternative hypothesis expressed by
where \(\epsilon _{j} > 0 \ (j=1,\ldots ,p)\) is the noninferiority margin of the jth endpoint that denotes a prespecified positive constant. Furthermore, the noninferiority part of \(H_0\) can be expressed as a union of null hypotheses of noninferiority for individual endpoints, since it means that either \(\delta _{j} + \epsilon _{j} \ (j=1,\ldots ,p)\) is less than or equal to zero. Therefore, \(H_{0}\) is also expressed as
which defines the sub hypothesis of superiority “\(H^{(0)}_{0} : \displaystyle \max _{1 \le j \le p} \delta _{j} \le 0\)” and the sub hypothesis of noninferiority “\(H^{(j)}_{0} : \delta _{j} \le \epsilon _{j}\)”, for \(j=1,\ldots ,p\). \(H^{(0)}_{0}\) is adaptable to the onesided ALRT, and the IUT (Berger, 1982) [14] can be applied to test \(H_{0}\).
A New Test Statistics
To determine the IUT statistics, we need to estimate \({g}_{ij}\), \(\pi _{ij}\) and \(\phi _{(i)jj'}\) considering the latent variable of \(Y_{ijk}\). In Subsect. 3.1 below, we propose an estimation procedure for those parameters. While Sozu et al. (2010) used a sample proportion to obtain an estimator of \(g_{ij}\), we propose a new procedure for estimating \(g_{ij}\) under the subnull hypothesis \(H^{(0)}_{0}\). In Subsect. 3.2, we propose a new testing procedure that extends the procedure proposed by Nakazuru et al. (2014) using the parameters estimated in Subsect. 3.1.
Proposed Estimating Procedure
For the sake of simplicity, the process of estimating parameters is divided into the following two steps.
Step 1. Estimating the CutOff Point \(\varvec{g_{ij}}\)
We assume that \(\hat{g}_{ij}\) is the estimator of the latent cutoff point \(g_{ij}\), and is estimated as \(\hat{g}_{ij} = \Phi ^{1}(1\widetilde{\pi }_{ij})\), where \(\widetilde{\pi }_{ij}\) is the maximum likelihood estimator (MLE) of the marginal probability derived by the pvariate Bernoulli distribution. Let the probability mass function of the pvariate Bernoulli distribution be
where \(\theta _{(i) 0, 0,\ldots , 0},\ldots , \theta _{(i) 1, 1,\ldots , 1}\) are joint probabilities when \(\varvec{Y}_{ik}\) takes values from \((0,\ldots , 0), \ldots ,(1,\ldots ,1)\), respectively, and \(\theta _{(i) 0, 0,\ldots , 0}+ \ldots +\theta _{(i) 1, 1,\ldots , 1} = 1\). \(\widetilde{\pi }_{ij}\) can be expressed as \(\widetilde{\pi }_{ij} = \sum _{\begin{array}{c} (s_1, s_2,\ldots , s_{p}) \in S, s_{j}=1 \end{array}}\hat{\theta }_{(i) s_1, s_2,\ldots , s_{p}}\) using the estimator of \(\theta _{(i) s_1, s_2,\ldots , s_{p}}\) , where \(S = \{ (s_1, s_2, \ldots s_p)  s_j = 0, 1, j=1,\ldots ,p \}\) is a set whose elements consist of all pairs of response values.
In addition, \(\widetilde{\pi }_{ij}\) (and \(\hat{g}_{ij}\)) can be given in two ways depending on the estimation of \(\theta _{(i) s_1, s_2,\ldots , s_{p}}\) under the subnull hypothesis \(H^{(0)}_{0}\) or the sub alternative hypothesis \(H^{(0)}_{1}:\mathrm {not} \ H^{(0)}_{0}\). In particular, under the sub null hypothesis \(H^{(0)}_{0}\), the Lagrange multiplier method is useful to obtain the MLE. On the other hand, under the subalternative hypothesis \(H^{(0)}_{1}\), the estimator \(\hat{\theta }_{(i) s_1, s_2,\ldots , s_{p}}\) is obtained in a closed form as a sample proportion.
Step 2. Estimating the Joint and Marginal Probabilities
The estimator of the joint probability \(\phi _{(i)jj'}\) in (1) is also given in two ways depending on \(\hat{g}_{ij}\), which is obtained by the estimation of \(\theta _{(i) s_1, s_2,\ldots , s_{p}}\) constructing \(\widetilde{\pi }_{ij}\). \(\hat{\phi }_{(i)jj'}\) can be given by
where \(f(z_1,\ldots , z_p; \hat{\gamma }_{(i)jj'})\) is the joint density function of \(\varvec{Z}_{ik}\) and \(z_1,\ldots , z_p\) are random variables following the standard pvariate normal distribution wherein \(\hat{\gamma }_{(i)jj'}\) is the Person’s tetrachoric correlation (Pearson [15]) calculated from \((Y_{ij1},\ldots , Y_{ijn_i})\) and \((Y_{ij'1},\ldots , Y_{ij'n_i})\). Therefore, if the latency of the binary response is assumed to have a standardized multivariate normal distribution, \(\hat{\phi }_{(i)jj'}\) is determined by \(\hat{\gamma }_{(i)jj'}\) and the cutoff point given in Step 1. Furthermore, the estimator of the marginal probability \(\pi _{ij}\) constructing \(\Sigma \) and \(\rho _{(i)jj'}\) in (1) should not be \(\widetilde{\pi }_{ij}\), which is obtained from the pvariate Bernoulli distribution, so as to take into account the latent distribution function. Let \(\hat{\pi }_{ij}\) denote the estimator of \(\pi _{ij}\) and be given by
For example, with \(p=2\) endpoints, the estimator of \(\phi _{(i)12}\) is written as
Furthermore, the marginal probabilities \(\pi _{i1}\) and \(\pi _{i2}\) are described as follows:
Along with the estimation of the joint and marginal probabilities, the estimated covariance matrix \(\hat{\varvec{\Sigma }}\) is defined as follows:
Proposed Test Statistics
We consider the following new IUT statistics to test hypothesis \(H_{0}\) versus \(H_{1}\).
where \(\acute{\pi }_{ij}\) is the MLE of \(\pi _{ij}\) derived under the sub null hypothesis of noninferiority \(H^{(j)}_{0}\), and \(T^{(j)}\) is a statistic commonly used in noninferiority tests of binary endpoints (Farrington and Manning [16]). \(T^{(0)}\) and \(T^{(j)}\) are test statistics corresponding to the null hypotheses for superiority \(H_{0}^{(0)}\) and noninferiority \(H_{0}^{(1)},\ldots ,H_{0}^{(p)}\), respectively. The proposed IUT rejects \(H_{0}\) if and only if \(T^{(0)} > c\) and \(T^{(j)} > z_\alpha \), where c is a constant determined by
here, \(\chi _{j}^2\) denotes the \(\chi ^2\) distribution with j degrees of freedom, and \(\chi ^2_{0}\) is defined as the constant zero. \(\alpha \) is the nominal significance level, and \(z_{\alpha }\) is the upper 100\(\alpha \)th percentile of the standard normal distribution. See Appendix for the derivation of \(\overline{u}_{A}\) and \(\overline{u}_{B}\).
Note that test statistics \(T^{(0)}\) can be considered in two ways. One is provided by \(\hat{\varvec{\Sigma }}\) derived under the subnull hypothesis \(H^{(0)}_{0}\), and the other is provided by \(\hat{\varvec{\Sigma }}\) derived under the subalternative hypothesis \(H^{(0)}_{1}\). Thus, we consider that there are also two types of IUT statistics. Let the IUT statistics using \(T^{(0)}\) estimated under the subnull hypothesis \(H^{(0)}_{0}\) be the \(T^{(0)}_{0}\) test type, and the those using \(T^{(0)}\) estimated under the subnull hypothesis \(H^{(0)}_{1}\) be the \(T^{(0)}_{1}\) test type.
Simulation Study
In order to evaluate the performance of the proposed IUT, we use Monte Carlo simulation to calculate the type I error rate and the powers of the \(T_{0}^{(0)}\) and \(T_{1}^{(0)}\) test types. In all simulations, we consider \(n_{1}=n_{2}=50, 100, 200\), \(\epsilon _{1}=\epsilon _{2}=0.2\), and \(\alpha =0.05\). The random numbers are generated from a standardized pvariate normal distribution, and response variables are obtained by dichotomizing random numbers using \(\varvec{\pi }_{i}=({\pi }_{i1},\ldots , {\pi }_{ip})^t\). The correlation between the latent variables assumes \(\rho =0, 0.4, 0.8\).
Type I Error Rate
We compare the type I error rate of the \(T_{0}^{(0)}\) test type and the \(T_{1}^{(0)}\) test type in the case \(p=2\). The generation of simulated data is repeated 1,000,000 times.
Table 1 shows the type I error rates for the two test types in the case \(p=2\). The type I error rate is greater than the nominal significance level for the \(T^{(0)}_{1}\) test type when the correlation between the endpoints is zero with a large sample size. The \(T^{(0)}_{0}\) test type is more conservative than the \(T^{(0)}_{1}\) test type. The type I error rate is lower when \(\pi _{ij}\) is close to zero than when \(\pi _{ij}\) is 0.5. In the case where at least one of the differences in \(\pi _{ij}\) is less than zero and is within the noninferiority margin, the type I error rate is less than when the difference in \(\pi _{ij}\) is zero, and it markedly decreases as the difference in \(\pi _{ij}\) becomes closer to the noninferiority margin. The type I error rate is much smaller when there are inferior endpoints.
In particular, for the scenario where inflation is likely to occur in the case \(p=2\), we generated simulation data 100,000 times and also checked the type I error rate in the case \(p=3\).
Table 2 shows the type I error rates for the two test types in the case \(p=3\). When the sample size is large and the correlations among the three endpoints is high, the type I error rate for both \(T^{(0)}_{0}\) and \(T^{(0)}_{1}\) is greater than the nominal significance level.
Power
We compare the powers of the \(T_{0}^{(0)}\) and the \(T_{1}^{(0)}\) test type for the proposed IUT, in the case \(p=2\). We also compare the power of the proposed IUT with that of a closed testing procedure that confirms the superiority of at least one of the two endpoints after the noninferiority of both of the two endpoints is confirmed. The Bonferronicorrected pvalue (Bonferroni [17]) is used to test for superiority in the closed testing procedure. The generation of simulated data is repeated 100,000 times.
Table 3 shows the empirical powers of the proposed IUT and the closed testing procedure. The power of the \(T^{(0)}_{1}\) test type is greater than that of the \(T^{(0)}_{0}\) test type, and it becomes larger as the correlation between the endpoints increases with the small sample size. Even when the difference between the endpoints increases, the relationship between the power of the \(T^{(0)}_{0}\) test type and that of the \(T^{(0)}_{1}\) test type does not change much. On the other hand, as the sample size increases, the power of the \(T^{(0)}_{0}\) test type becomes similar to that of the \(T^{(0)}_{1}\) test type. Furthermore, the power of the proposed IUT is always greater than that of the closed testing procedure.
Even if the differences in \(\pi _{ij}\) are identical to each other, the power tends to be large when \(\pi _{ij}\) is close to zero, except when the correlation is zero and the sample size is small. On the other hand, if at least one endpoint is superior and the differences of all remaining endpoints are less than zero and within the noninferiority margin, the power is lower than when all of the differences in \(\pi _{ij}\) are greater than zero. Furthermore, when there is an endpoint that is inferior, the power becomes quite small. The power does not monotonically increase or decrease depending on the correlation coefficient. The results for \(p=2\) and \(p=3\) in Subsect. 4.3 below show that as the number of endpoints that differ between the two groups increases, the power of the closed testing procedure is noticeably lower than that of the proposed IUT.
Power is Reduced when NonInferiority Test is Added to a Superiority Test
We also compare the performance of the proposed IUT and a test excluding the noninferiority in the case of \(p=2\) and \(p=3\). The generation of simulated data is repeated 100,000 times.
Table 4 shows a power comparison between the proposed IUT and the superiority test alone for the case of \(p=2\). If all the differences in \(\pi _{ij}\) are greater than or equal to zero, as the sample size increases, regardless of the value of the correlation coefficient, the powers of the superiority test alone and the IUT remain similar. On the other hand, when the differences in \(\pi _{ij}\) between the two groups are partially within the noninferiority margin, the power of IUT is low compared to the superiority test alone. When \(\pi _{ij}\) is close to zero, the powers of the superiority test alone and the IUT are more similar than when \(\pi _{ij}\) is close to 0.5.
Table 5 shows a power comparison between the proposed IUT and the superiority test alone for the case of \(p=3\). As with the case of \(p=2\), if all the differences in \(\pi _{ij}\) are greater than or equal to zero, adding a noninferiority test does not reduce the power much when the sample size is large. Conversely, larger the number of endpoints for which the differences in \(\pi _{ij}\) are partially within the noninferiority margin, the greater the reduction in power of the IUT.
Numerical Example
We present the results of applying the proposed IUT to an actual trial which confirm the efficacy of 4FPCC [4]. The clinical trial is a multicentre, randomized, openlabel, phase III trial on patients aged 18 years or older needing rapid vitamin K antagonist reversal before an urgent surgical or invasive procedure. As mentioned in Sect. 1, this study includes two primary endpoints, that is, (i) the percentage of patients with a hemostatic effect defined as binary variable based on predicted blood loss, and (ii) the percentage with a decrease in the INR. The analyses were intended to evaluate, in a hierarchical fashion, first noninferiority for both endpoints, then superiority if noninferiority was achieved. Based on the result of the study (for details, see Fig. 3 in [4]), we consider the case of \(\hat{\pi }_{11} = 0.9\) and \(\hat{\pi }_{21} = 0.55\) for the percentage of the hemostatic effect, and \(\hat{\pi }_{12} = 0.75\) and \(\hat{\pi }_{22} = 0.1\) for the percentage of decreasing INR. The noninferiority margin is set at 0.1 for both endpoints, and sample size of each group is \(n_1 = 87\) and \(n_2 = 81\). Since correlations between two variables cannot be derived from the reported results, we consider the case of \(\hat{\gamma }_{(1)12}=\hat{\gamma }_{(2)12}=0, 0.2, 0.4, 0.6, 0.8\). Since only the statistics calculated under the subnull hypothesis \(H^{(0)}_{1}\) from the reported information, Table 6 provides the value of \(T^{(0)}_{1}\), \(T^{(1)}\) and \(T^{(2)}\) with results of IUT (rejected or accepted) at significance level \(\alpha \) = 0.05.
Although showing the noninferiority for all endpoints had been the first step in the actual trial, even if the test had simultaneously taken into account showing the superiority for at least one endpoint, the same conclusion was obtained for any estimate of the correlation coefficient.
Concluding Remarks
In this article, we developed a testing procedure for studies with multiple binary endpoints and a latent distribution. This was performed within a framework in which the efficacy of a test treatment is recognized when at least one endpoint demonstrates superiority and the remaining endpoints demonstrate noninferiority. We derived two types of test statistics using cutoff points estimated under the sub null hypothesis \(H^{(0)}_{0}\) and the subalternative hypothesis \(H^{(0)}_{1}\), and these procedures were compared in a numerical experiment using a Monte Carlo simulation.
The numerical experiment clearly demonstrated that the \(T^{(0)}_{1}\) test type was always more powerful than the \(T^{(0)}_{0}\) test type. However, \(\alpha \)violation occurred in the \(T^{(0)}_{1}\) test type when a sample size was large and the correlation coefficient was zero in the case \(p=2\). As the number of endpoints increased, we also found that \(\alpha \)violation was more likely to occur in the scenario where all differences of the endpoints were zero. On the other hand, \(\alpha \)violation did not occur when any of the endpoints was inferior. We believe that this will not be a fatal problem in practice because the framework of this study assumes that any of the endpoints are inferior. However, since \(\alpha \)violation is a serious issue in confirmatory clinical trials, it is necessary to develop a method that does not cause \(\alpha \)violation, or we should choose a noninflationary test when the number of endpoints is large. Since there was not a large difference in power between the \(T^{(0)}_{0}\) and \(T^{(0)}_{1}\) test types, it may be reasonable to preferentially use the \(T^{(0)}_{0}\) test type, especially if the correlation coefficients between the endpoints have not been investigated. Furthermore, this study showed a marked decrease in power as the number of differences in endpoints within the noninferiority margin increased. Even if the number of superior endpoints was the same, power did not decrease when the number of endpoints was increased.
The power does not monotonically increase or decrease depending on the correlation coefficient between the endpoints. When determining the sample size, it is difficult to know whether the power will increase if the correlation is changed under the fixed difference in endpoints. A further nonnegligible result is that type I error rates tend to increase when the correlation is low and the sample size is large. We also found that as the number of endpoints increases, the higher the correlation, the more likely the type I error rate will inflate. For a less problematic sample size determination, the correlations between endpoints should be accurately investigated before trial planning. However, in practice, it is difficult to estimate the correlation coefficients between endpoints during trial planning. Therefore, we recommend that power be simulated assuming several correlation coefficients under fixed proportions for each endpoint, and that the most conservative sample size among all scenarios be used in the trial. According to Offen et al. [18] and Sankoh et al. [19], the correlation coefficients between multiple endpoints in clinical trials are approximately equal to 0.4 and range from 0.2 to 0.8, which may help in setting up the simulation scenarios. For example, if we assume that the proportions of the two responses are 0.6 for the treatment group and 0.5 for the control group, the power will exceed 0.8 if the correlation coefficient is zero, but the power will fall below 0.8 if the correlation coefficient is 0.4 to 0.8. Therefore, assuming a correlation of 0.8 in this case would be a conservative design.
Incidentally, like the proposed testing procedure, a closed testing procedure can be used for multiple endpoints where the familywise error rate is kept below the nominal significance level. In the framework of this study, the proposed IUT was shown to be more powerful than the closed testing procedure regardless of the correlation coefficient between endpoints, the difference between endpoints, the number of noticeably different endpoints, the sample size, and whether or not the proportions are close to zero. Although the closed testing procedure has a significant advantage in that it does not require control of the type I error rate in individual tests when there are inclusion relationships between null hypotheses, it may be more reasonable to use the proposed IUT in the framework of this study, where the superiority of at least one endpoint and the noninferiorities of the remaining endpoints are confirmed simultaneously.
We also demonstrated a power reduction when the noninferiority test was added to the superiority test. Our simulations showed that there was only a minimal decrease in power when the proportions of responses to the test treatment were all or somewhat higher than that to the control treatment. When the proportions were close to zero, the power was almost the same for the superiority test alone and the proposed IUT. By contrast, the power was reduced when the treatment effects were partially within the noninferiority margin for a small sample size. In particular, the power decreased remarkably with the number of differences in the primary endpoints between the two groups that fell within the noninferiority margin. Note that in such a situation, large sample size is needed to detect the difference. The development of more efficient methods with a higher power in such cases is required in the future.
Furthermore, the smaller the correlation coefficient, the lower the power of the proposed method in comparison to a procedure that tested only superiority. Therefore, in a primary analysis using the proposed testing procedure for certain sample sizes, assuming differences in proportions and correlations between endpoints, if all endpoints are binary and have a continuous latent distribution then it is ideal in practice to confirm not only the superiority of at least one endpoint, but also the noninferiority of all remaining endpoints.
In the fields of economics, business, and education, a paradigm shift with methods that do not use pvalues to establish statistical evidence continues to be proposed (Bhatti and Kim [20]). Currently, due to drug approval regulations, it will be unavoidable to determine the efficacy of a drug by the p value of the confirmatory trial. However, in the future, we will also need to develop methods that do not rely on pvalues in order to conclude even in the exploratory phase.
Abbreviations
 ACR:

American College of Rheumatology criteria
 ACR20:

20% improvement in American College of Rheumatology criteria
 ALRT:

Approximate likelihood ratio test
 DAS28:

Disease Activity Score modified to include the 28 diarthrodial joint count
 ESR:

Erythrocyte sedimentation rate
 4FPCC:

Fourfactor prothrombin complex concentrate
 ICH:

International council for Harmonisation of technical requirements for pharmaceuticals for human use
 INR:

International normalized ratio
 IUT:

Intersectionunion test
 MLE:

Maximum likelihood estimator
References
 1.
International Conference on Harmonization (ICH) of technical requirements for regulations of pharmaceuticals for human use: ICH Tripartite Guideline E9 Documents (1998)
 2.
Smolen, J.S., Burmester, G.R., Combe, B., Curtis, J.R., Hall, S., Haraoui, B., van Vollenhoven, R., Cioffi, C., Ecoffet, C., Gervitz, L., Lonescu, L., Peterson, L., Fleischmann, R.: Headtohead comparison of certolizumab pegol versus adalimumab in rheumatoid arthritis: 2year efficacy and safety results from the randomised EXXELERATE study. Lancet 388(10061), 2763–2774 (2016). https://doi.org/10.1016/S01406736(16)316518
 3.
Reich, K., Langlay, R.G., Papp, K.A., Ortonne, J.P., Unnebrink, K., Kaul, M., Valdes, J.M.: A 52week trial comparing briakinumab with methotrexate in patients with psoriasis. N. Engl. J. Med. 365(17), 1586–1596 (2011). https://doi.org/10.1056/NEJMoa1010858
 4.
Goldstein, J.N., Refaai, M.A., Jr., Milling, T.J., Lewis, B., GoldbergAlberts, R., Hug, B.A., Sarode, R.: Fourfactor prothrombin complex concentrate versus plasma for rapid vitamin K antagonist reversal in patients needing urgent surgical or invasive interventions: a phase 3b, openlabel, noninferiority, randomised trial. Lancet 385(9982), 2077–2087 (2015). https://doi.org/10.1016/S01406736(14)616858
 5.
Marcus, R., Peritz, E., Gabriel, K.R.: On closed testing procedures with special reference to ordered analysis of variance. Biometrika 63, 655–660 (1976). https://doi.org/10.2307/2335748
 6.
Perlman, M.D., Wu, L.: A note on onesided tests with multiple endpoints. Biometrics 60, 276–280 (2004). https://doi.org/10.1111/j.0006341X.2004.00159.x
 7.
Sozu, T., Sugimoto, T., Hamasaki, T.: Sample size determination in clinical trials with multiple coprimary binary endpoints. Stat. Med. 29(21), 2169–2179 (2010). https://doi.org/10.1002/sim.3972
 8.
Sozu, T., Sugimoto, T., Hamasaki, T.: Sample size determination in superiority clinical trials with multiple coprimary correlated endpoints. J. Biopharm. Stat. 21(4), 650–668 (2011). https://doi.org/10.1080/10543406.2011.551329
 9.
Sozu, T., Sugimoto, T., Hamasaki, T.: Sample size determination in clinical trials with multiple coprimary endpoints including mixed continuous and binary variables. Biom. J. 54(5), 716–729 (2012). https://doi.org/10.1002/bimj.201100221
 10.
Nakazuru, Y., Sozu, T., Hamada, C., Yoshimura, I.: A new procedure of onesided test in clinical trials with multiple endpoints. Jpn. J. Biom. 35, 17–35 (2014). https://doi.org/10.5691/jjb.35.17
 11.
Glimm, E., Srivastava, M., Lauter, J.: Multivariate tests of normal mean vectors with restricted alternatives. Commun. Stat. B: Simul. Comput. 31, 589–604 (2002)
 12.
Ishihara, T., Yamamoto, K.: A testing procedure in clinical trials with multiple binary endpoints. Commun. Stat. Theory Methods. Advance Online Publication (2021)
 13.
Bahadur, R.R.: In studies in item analysis and prediction, Stanford Mathematical Studies in the Social Sciences. In: Solomon H (ed) Stanford University Press, Stanford, pp 158–168 (1961)
 14.
Berger, R.L.: Multiparameter hypothesis testing and acceptance sampling. Technometrics 24, 295–300 (1982). https://doi.org/10.1080/00401706.1982.10487790
 15.
Bonferroni, C.E.: Teoria statistica delle classi e calcolo delle probabilità, Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze (1936)
 16.
Farrington, C.P., Manning, G.: Test statistics and sample size formulae for comparative binomial trials with null hypothesis of nonzero risk difference or nonunity relative risk. Stat. Med. 9, 1447–1454 (1990). https://doi.org/10.1002/sim.4780091208
 17.
Pearson, K., III.: Mathematical contributions to the theory of evolution.—VIII. On the inheritance of characters not capable of exact quantitative measurement.—Part I. Introductory. Part II. On the inheritance of coatcolour in horses. Part III. On the inheritance of eyecolour in man. Philos. Trans. R. Soc. A. 195, 1–47 (1900). https://doi.org/10.1098/rsta.1900.0024
 18.
...Offen, W., ChuangStein, C., Dmitrienko, A., Littman, G., Maca, J., Meyerson, L., Muirhead, R., Stryszak, P., Boddy, A., Chen, K., CopleyMerriman, K., Dere, W., Givens, S., Hall, D., Henry, D., Jackson, J.D., Krishen, A., Liu, T., Ryder, S., Sankoh, A.J., Wang, J., Yeh, C.H.: Multiple coprimary endpoints: medical and statistical solutions. Drug Inform. J. 41, 31–46 (2007). https://doi.org/10.1177/009286150704100105
 19.
Sankoh, A.J., Huque, M.F., Russell, H.K., D’Agostino, R.B.: Global twogroup multiple endpoint adjustment methods applied to clinical trials. Ther. Innov. Regul. Sci. 33, 119–140 (1999). https://doi.org/10.1177/009286159903300115
 20.
Bhatti, M.I., Kim, J.H.: Towards a new paradigm for statistical evidence in the use of pvalue. Econometrics 9(1), 2 (2021). https://doi.org/10.3390/econometrics9010002
Acknowledgements
The authors would like to sincerely thank the editorinchief, associate editor and referees for their valuable comments about our paper.
Funding
The authors have solely funded the research by themselves.
Author information
Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have declared no conflict of interest.
Appendix
Appendix
We consider an ALRT for \(H^{(0)}_{0}\). Let \(\varvec{A}\) be the positive definite matrix such that \(\varvec{A}^{t}\varvec{A} = \hat{\varvec{\Sigma }}^{1}\), where \(\varvec{\Sigma }^{1}\) is the inverse matrix of \(\varvec{\Sigma }\). The statistic \(\varvec{u}_A = (u_{A1},\ldots ,u_{Ap})^{t} = \varvec{AX}\) is approximately distributed as a pvariate normal distribution with mean \(\varvec{A\Delta }\) and covariance matrix \(\varvec{I}\) (the identity matrix). For simplicity, to represent \(\varvec{A}\) we use the set of eigenvectors multiplied by the square root of the corresponding eigenvalue, because \(\varvec{A}\) is not uniquely determined. Furthermore, according to the procedure of Nakazuru et al. (2014), \(\varvec{B}\) is defined as the matrix substituting the offdiagonal elements of \(\varvec{A}\) with their absolute values. Consider the two transformations such that
In these assumptions, the ALRT rejects \(H^{(0)}_{0}\) if and only if
where \(\overline{u}_{A}^{2}\) and \(\overline{u}_{B}^{2}\) are defined by
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ishihara, T., Yamamoto, K. A Test for Multiple Binary Endpoints with Continuous Latent Distribution in Clinical Trials. J Stat Theory Appl 20, 463–480 (2021). https://doi.org/10.1007/s44199021000033
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s44199021000033
Keywords
 Latent variable
 Noninferiority
 Superiority