A Test for Multiple Binary Endpoints with Continuous Latent Distribution in Clinical Trials

In clinical trials, two or more binary responses obtained by dichotomizing continuous responses are often employed as multiple primary endpoints. Testing procedures for multiple binary variables with latent distribution have not yet been adequately discussed. Based on the association measure among latent variables, we provide a statistic for testing the superiority of at least one binary endpoint. In addition, we propose a testing procedure with a framework in which the trial efficacy is confirmed only when there is superiority of at least one endpoint and non-inferiority of the remaining endpoints. The performance of the proposed procedure is evaluated through simulations.


Introduction
In confirmatory clinical trials, several correlated binary response variables are used to assess the efficacy and safety of new treatments. The ICH E9 guideline [1] recommends that the primary endpoint should consist of only one variable that provides strong scientific evidence of treatment efficacy. However, in clinical trials for a variety of diseases, it is often useful to evaluate efficacy using multiple primary endpoints. For example, in clinical trials of patients with rheumatoid arthritis, the percentage of patients achieving short-term improvement of 20 percent in the American College of Rheumatology criteria (ACR20) and the percentage achieving long-term low disease activity [Disease Activity Score (DAS28-ESR) ≤ 3.2] are often used as primary endpoints (e.g., [2]). In clinical trials of patients with psoriasis, short-and long-term improvements are simultaneously assessed based on the percentage of patients with at least 75 percent improvement in the psoriasis area-and-severity index (DASI) score (e.g., [3]). In particular, binary endpoints are often used when it is more meaningful to diagnose improvement beyond clear standards rather than to assess the disease state using continuous variables. In such trials, we can consider that all primary endpoints are binary and often have a continuous latent distribution.
Most trials use multiple endpoints only to evaluate non-inferiority or superiority, but some trials have been conducted to confirm the non-inferiority and superiority of all endpoints. For example, a clinical trial to confirm the efficacy of four-factor prothrombin complex concentrate (4F-PCC) included two primary endpoints [4], namely the percentage of patients with a hemostatic effect and the percentage with a decrease in the international normalized ratio (INR). In the above trial, superiority was only evaluated if there was non-inferiority for both endpoints. When we confirm not only non-inferiority but also superiority, the use of the closed testing procedure [5] for the primary analysis is reasonable, and in this case, no adjustment is needed to control the type I error rate. Sozu et al. [6][7][8] have already proposed a testing method for dealing with several endpoints in a trial and a method for calculating the sample size. Their theory is based on the framework of recognizing a treatment effect when the superiority of all primary endpoints is confirmed. Such a setting of endpoints is called "co-primary endpoints". In general, however, it is difficult to demonstrate the superiority of two or more endpoints because the power decreases as the number of endpoints increases. On the other hand, "multiple endpoints" are used in trials that recognize a treatment effect if it is superior to at least one of the endpoints. Developing a procedure that can confirm the superiority of at least one binary endpoint with latent distribution is a challenge for statisticians in the design and analysis of clinical trials. Thus, the aim of this paper was to define a testing procedure within a framework in which the efficacy of a test treatment is confirmed only when the superiority of the treatment relative to control is evidenced for at least one endpoint, and non-inferiority is demonstrated for the remaining endpoints.
For multiple continuous endpoints, Perlman and Wu [9] proposed a testing procedure that is applicable to the framework mentioned above. Nakazuru et al. [10] proposed a more powerful testing procedure using the approximate likelihood ratio test (ALRT) defined by Glimm et al. [11]. However, there has been inadequate development of methods for multiple binary endpoints. In the same framework as this study, Ishihara and Yamamoto [12] proposed a method using multiple binary endpoints, however, they were unable to assume a latent distribution for the binary variables. A statistic for testing the superiority of binary co-primary endpoints assuming latent variables has been developed by Sozu et al. [6]. Therefore, we herein propose a testing procedure that is appropriate when all endpoints are binary and have a latent distribution. Our procedure is based on the intersection-union test (IUT), proposed by Nakazuru et al. [10], and a method by Sozu et al. [6] for estimating correlations between binary endpoints that have latent variables. In particular, we consider two statistics estimated under the null and alternative hypotheses for the test of superiority. Since there has not yet been any discussion on whether the statistics obtained under the null hypothesis or the alternative hypothesis are better in practical terms, another purpose of this study is to discuss this issue.
This article is structured as follows. In Sect. 2, we define several assumptions, including the hypothesis regarding the testing procedure and the association between correlated binary endpoints when taking latent variables into consideration. In Sect. 3, we give two IUT statistics for the superiority of at least one endpoint and the non-inferiority of the remaining endpoints when all endpoints are binary and have a latent distribution. In Sect. 4, we provide a numerical experiment using Monte Carlo simulation to illustrate the behavior of the power and type I error rate of the proposed test. Regarding the power, the proposed statistics was compared with the closed testing procedure. In Sect. 5, we provide the results of IUT based on an actual clinical trial. Finally, in Sect. 6, we summarize our findings and present concluding remarks. Based on the proposed statistics and conducted simulations, we suggest how to use the statistics in a real clinical trial.

Statistical Setting
In this article, we focus on a randomized clinical trial comparing p (≥ 2) endpoints with two treatment groups. There are n 1 subjects in the test group and n 2 subject in the control group. Let Y ijk (i = 1, 2;j = 1, … , p;k = 1, … , n i ) denote the binary response variable of the jth primary endpoint of the ith treatment in the kth subject. Suppose that the vectors of binary response vari- where the superscript }}t �� denotes transpose. In this setting, the correlation coefficient (i)jj � of the multivariate Bernoulli distribution is expressed as where (i)jj � is the joint probability of two response variables (Y ijk , Y ij � k ) . Note that the range of (i)jj � is equal to or less than (−1, 1) depending on the value of ij and ij ′ (Bahadur [13]). That is, (i)jj � is bounded below by and above by Furthermore, we assume that Y ik are dichotomized random variables of continuous unobservable response Z ik = (Z i1k , … , Z ipk ) t (i = 1, 2;k = 1, … , n i ) . We also assume that Z ik are independently distributed as a standardized p-variate normal distribution with Corr(Z ijk , Z ij � k ) = (i)jj � for all j ≠ j ′ . For each variable Z ik , there is a single threshold g ij = Φ −1 (1 − ij ) (i = 1, 2;j = 1, … , p) that partitions the latent distribution, where Φ −1 is the inverse function of the standard normal cumulative distribution function. Then, the binary response Y ijk (i = 1, 2;j = 1, … , p;k = 1, … , n i ) can be defined as where Y ij (i = 1, 2;j = 1, … , p) is the sample proportion for the jth endpoint of the ith treatment. Let the true proportion vector be the ith treatment i = ( i1 , … , ip ) t with difference of proportion = ( 1 , … , p ) t = 1 − 2 and the covariance matrix that is defined as follows: Note that X is approximately normally distributed with mean and covariance matrix .

Hypotheses
Without loss of generality, we assume that test treatment superiority is recognized when the proportion of responses to the test treatment is greater than that to the control treatment. That is, a maximum value of j (j = 1, … , p) greater than 0 indicates an improvement of at least one endpoint of the test treatment compared to the control treatment. In the framework dealt with in this study, a test treatment effect is recognized only when the null hypothesis for the superiority of at least one endpoint and the null hypothesis for all non-inferiority are all rejected simultaneously. In such a framework, the null hypotheses of superiority and non-inferiority are represented by a union. Therefore, we consider the combined hypothesis for the superiority of at least one endpoint and the non-inferiority of the remaining endpoints. We consider a null hypothesis H 0 and an alternative hypothesis expressed by where j > 0 (j = 1, … , p) is the non-inferiority margin of the jth endpoint that denotes a prespecified positive constant. Furthermore, the non-inferiority part of H 0 can be expressed as a union of null hypotheses of non-inferiority for individual endpoints, since it means that either j + j (j = 1, … , p) is less than or equal to zero. Therefore, H 0 is also expressed as which defines the sub hypothesis of superiority " H (0) 0 ∶ max 1≤j≤p j ≤ 0 " and the sub hypothesis of non-inferiority " H 0 is adaptable to the one-sided ALRT, and the IUT (Berger, 1982) [14] can be applied to test H 0 .

A New Test Statistics
To determine the IUT statistics, we need to estimate g ij , ij and (i)jj � considering the latent variable of Y ijk . In Subsect. 3.1 below, we propose an estimation procedure for those parameters. While Sozu et al.

Proposed Estimating Procedure
For the sake of simplicity, the process of estimating parameters is divided into the following two steps.

Step 1. Estimating the Cut-Off Point g ij
We assume that ĝ ij is the estimator of the latent cut-off point g ij , and is estimated as where ̃ ij is the maximum likelihood estimator (MLE) of the marginal probability derived by the p-variate Bernoulli distribution. Let the probability mass function of the p-variate Bernoulli distribution be p} is a set whose elements consist of all pairs of response values.
In addition, ̃ ij (and ĝ ij ) can be given in two ways depending on the estimation of (i)s 1 ,s 2 ,…,s p under the sub-null hypothesis H (0) 0 or the sub alternative hypothesis In particular, under the sub null hypothesis H (0) 0 , the Lagrange multiplier method is useful to obtain the MLE. On the other hand, under the sub-alternative hypothesis H (0) 1 , the estimator ̂( i)s 1 ,s 2 ,…,s p is obtained in a closed form as a sample proportion.

Step 2. Estimating the Joint and Marginal Probabilities
The estimator of the joint probability (i)jj � in (1) is also given in two ways depending on ĝ ij , which is obtained by the estimation of (i)s 1 ,s 2 ,…,s p constructing ̃ ij . ̂( i)jj � can be given by is the joint density function of Z ik and z 1 , … , z p are random variables following the standard p-variate normal distribution wherein ̂( i)jj � is the Person's tetrachoric correlation (Pearson [15] . Therefore, if the latency of the binary response is assumed to have a standardized multivariate normal distribution, ̂( i)jj � is determined by ̂( i)jj � and the cut-off point given in Step 1. Furthermore, the estimator of the marginal probability ij constructing Σ and (i)jj � in (1) should not be ̃ ij , which is obtained from the p-variate Bernoulli distribution, so as to take into account the latent distribution function. Let ̂i j denote the estimator of ij and be given by For example, with p = 2 endpoints, the estimator of (i)12 is written as Furthermore, the marginal probabilities i1 and i2 are described as follows: Along with the estimation of the joint and marginal probabilities, the estimated covariance matrix ̂ is defined as follows:

Proposed Test Statistics
We consider the following new IUT statistics to test hypothesis H 0 versus H 1 .
where ́i j is the MLE of ij derived under the sub null hypothesis of non-inferiority H (j) 0 , and T (j) is a statistic commonly used in non-inferiority tests of binary endpoints (Farrington and Manning [16]). T (0) and T (j) are test statistics corresponding to the null hypotheses for superiority H (0) 0 and non-inferiority and The proposed IUT rejects H 0 if and only if T (0) > c and T (j) > z , where c is a constant determined by here, 2 j denotes the 2 distribution with j degrees of freedom, and 2 0 is defined as the constant zero. is the nominal significance level, and z is the upper 100 th percentile of the standard normal distribution. See Appendix for the derivation of u A and u B .
Note that test statistics T (0) can be considered in two ways. One is provided by ̂ derived under the sub-null hypothesis H (0) 0 , and the other is provided by ̂ derived under the sub-alternative hypothesis H (0) 1 . Thus, we consider that there are also two types of IUT statistics. Let the IUT statistics using T (0) estimated under the sub-null hypothesis H (0) 0 be the T (0) 0 test type, and the those using T (0) estimated under the subnull hypothesis H (0) 1 be the T (0) 1 test type.

Simulation Study
In order to evaluate the performance of the proposed IUT, we use Monte Carlo simulation to calculate the type I error rate and the powers of the T (0) 0 and T (0) 1 test types. In all simulations, we consider n 1 = n 2 = 50, 100, 200 , 1 = 2 = 0.2 , and = 0.05 . The random numbers are generated from a standardized p-variate normal distribution, and response variables are obtained by dichotomizing random numbers using i = ( i1 , … , ip ) t . The correlation between the latent variables assumes = 0, 0.4, 0.8.

Type I Error Rate
We compare the type I error rate of the T (0) 0 test type and the T (0) 1 test type in the case p = 2 . The generation of simulated data is repeated 1,000,000 times. Table 1 shows the type I error rates for the two test types in the case p = 2 . The type I error rate is greater than the nominal significance level for the T (0) 1 test type when the correlation between the endpoints is zero with a large sample size. The T (0) 0 test type is more conservative than the T (0) 1 test type. The type I error rate is lower when ij is close to zero than when ij is 0.5. In the case where at least one of the differences in ij is less than zero and is within the non-inferiority margin, the type I error rate is less than when the difference in ij is zero, and it markedly decreases as the difference in ij becomes closer to the non-inferiority margin. The type I error rate is much smaller when there are inferior endpoints.
In particular, for the scenario where inflation is likely to occur in the case p = 2 , we generated simulation data 100,000 times and also checked the type I error rate in the case p = 3.
Journal of Statistical Theory and Applications (2021) 20:463-480     Table 2 shows the type I error rates for the two test types in the case p = 3 . When the sample size is large and the correlations among the three endpoints is high, the type I error rate for both T (0) 0 and T (0) 1 is greater than the nominal significance level.

Power
We compare the powers of the T (0) 0 and the T (0) 1 test type for the proposed IUT, in the case p = 2 . We also compare the power of the proposed IUT with that of a closed testing procedure that confirms the superiority of at least one of the two endpoints after the non-inferiority of both of the two endpoints is confirmed. The Bonferronicorrected p-value (Bonferroni [17]) is used to test for superiority in the closed testing procedure. The generation of simulated data is repeated 100,000 times. Table 3 shows the empirical powers of the proposed IUT and the closed testing procedure. The power of the T (0) 1 test type is greater than that of the T (0) 0 test type, and it becomes larger as the correlation between the endpoints increases with the small sample size. Even when the difference between the endpoints increases, the relationship between the power of the T (0) 0 test type and that of the T (0) 1 test type does not change much. On the other hand, as the sample size increases, the power of the T (0) 0 test type becomes similar to that of the T (0) 1 test type. Furthermore, the power of the proposed IUT is always greater than that of the closed testing procedure.
Even if the differences in ij are identical to each other, the power tends to be large when ij is close to zero, except when the correlation is zero and the sample size is small. On the other hand, if at least one endpoint is superior and the differences of all remaining endpoints are less than zero and within the non-inferiority margin, the power is lower than when all of the differences in ij are greater than zero. Furthermore, when there is an endpoint that is inferior, the power becomes quite small. The power does not monotonically increase or decrease depending on the correlation coefficient. The results for p = 2 and p = 3 in Subsect. 4.3 below show that as the number of endpoints that differ between the two groups increases, the power of the closed testing procedure is noticeably lower than that of the proposed IUT.

Power is Reduced when Non-Inferiority Test is Added to a Superiority Test
We also compare the performance of the proposed IUT and a test excluding the non-inferiority in the case of p = 2 and p = 3 . The generation of simulated data is repeated 100,000 times. Table 4 shows a power comparison between the proposed IUT and the superiority test alone for the case of p = 2 . If all the differences in ij are greater than or equal to zero, as the sample size increases, regardless of the value of the correlation coefficient, the powers of the superiority test alone and the IUT remain similar. On the other hand, when the differences in ij between the two groups are partially within the non-inferiority margin, the power of IUT is low compared to the superiority test alone. When ij is close to zero, the powers of the superiority test alone and the IUT are more similar than when ij is close to 0.5. Table 5 shows a power comparison between the proposed IUT and the superiority test alone for the case of p = 3 . As with the case of p = 2 , if all the differences in ij are greater than or equal to zero, adding a non-inferiority test does not reduce the power much when the sample size is large. Conversely, larger the number of endpoints for which the differences in ij are partially within the noninferiority margin, the greater the reduction in power of the IUT.

Numerical Example
We present the results of applying the proposed IUT to an actual trial which confirm the efficacy of 4F-PCC [4]. The clinical trial is a multicentre, randomized, open-label, phase III trial on patients aged 18 years or older needing rapid vitamin K antagonist reversal before an urgent surgical or invasive procedure. As mentioned in Sect. 1, this study includes two primary endpoints, that is, (i) the percentage of patients with a hemostatic effect defined as binary variable based on predicted blood loss, and (ii) the percentage with a decrease in the INR. The analyses were intended to evaluate, in a hierarchical fashion, first non-inferiority for both endpoints, then superiority if non-inferiority was achieved. Based on the result of the study (for details, see  Table 6 provides the value of T (0) 1 , T (1) and T (2) with results of IUT (rejected or accepted) at significance level = 0.05.
Although showing the non-inferiority for all endpoints had been the first step in the actual trial, even if the test had simultaneously taken into account showing the superiority for at least one endpoint, the same conclusion was obtained for any estimate of the correlation coefficient.

Concluding Remarks
In this article, we developed a testing procedure for studies with multiple binary endpoints and a latent distribution. This was performed within a framework in which the efficacy of a test treatment is recognized when at least one endpoint demonstrates superiority and the remaining endpoints demonstrate non-inferiority. We derived two types of test statistics using cut-off points estimated under the sub null hypothesis H (0) 0 and the sub-alternative hypothesis H (0) 1 , and these procedures were compared in a numerical experiment using a Monte Carlo simulation.
The numerical experiment clearly demonstrated that the T (0) 1 test type was always more powerful than the T (0) 0 test type. However, -violation occurred in the T (0) 1 test type when a sample size was large and the correlation coefficient was zero in the case p = 2 . As the number of endpoints increased, we also found that -violation was more likely to occur in the scenario where all differences of the endpoints were zero. On the other hand, -violation did not occur when any of the endpoints was inferior. We believe that this will not be a fatal problem in practice because the framework of this study assumes that any of the endpoints are inferior. However, since -violation is a serious issue in confirmatory clinical trials, it is necessary to develop a method that does not cause -violation, or we should choose a non-inflationary test when the number of endpoints is large. Since there was not a large difference in power between the T (0) 0 and T (0) 1 test types, it may be reasonable to preferentially use the T (0) 0 test type, especially if the correlation coefficients between the endpoints have not been investigated. Furthermore, this study showed a marked decrease in power as the number of differences in endpoints within the non-inferiority margin increased. Even if the number of superior endpoints was the same, power did not decrease when the number of endpoints was increased.
The power does not monotonically increase or decrease depending on the correlation coefficient between the endpoints. When determining the sample size, it is difficult to know whether the power will increase if the correlation is changed under the fixed difference in endpoints. A further non-negligible result is that type I error rates tend to increase when the correlation is low and the sample size is large. We also found that as the number of endpoints increases, the higher the correlation, the more likely the type I error rate will inflate. For a less problematic sample size determination, the correlations between endpoints should be accurately investigated before trial planning. However, in practice, it is difficult to estimate the correlation coefficients between endpoints during trial planning. Therefore, we recommend that power be simulated assuming several correlation coefficients under fixed proportions for each endpoint, and that the most conservative sample size among all scenarios be used in the trial. According to Offen et al. [18] and Sankoh et al. [19], the correlation coefficients between multiple endpoints in clinical trials are approximately equal to 0.4 and range from 0.2 to 0.8, which may help in setting up the simulation scenarios. For example, if we assume that the proportions of the two responses are 0.6 for the treatment group and 0.5 for the control group, the power will exceed 0.8 if the correlation coefficient is zero, but the power will fall below 0.8 if the correlation coefficient is 0.4 to 0.8. Therefore, assuming a correlation of 0.8 in this case would be a conservative design.
Incidentally, like the proposed testing procedure, a closed testing procedure can be used for multiple endpoints where the familywise error rate is kept below the nominal significance level. In the framework of this study, the proposed IUT was shown to be more powerful than the closed testing procedure regardless of the correlation coefficient between endpoints, the difference between endpoints, the number of noticeably different endpoints, the sample size, and whether or not the proportions are close to zero. Although the closed testing procedure has a significant advantage in that it does not require control of the type I error rate in individual tests when there are inclusion relationships between null hypotheses, it may be more reasonable to use the proposed IUT in the framework of this study, where the superiority of at least one endpoint and the non-inferiorities of the remaining endpoints are confirmed simultaneously.
We also demonstrated a power reduction when the non-inferiority test was added to the superiority test. Our simulations showed that there was only a minimal decrease in power when the proportions of responses to the test treatment were all or somewhat higher than that to the control treatment. When the proportions were close to zero, the power was almost the same for the superiority test alone and the proposed IUT. By contrast, the power was reduced when the treatment effects were partially within the non-inferiority margin for a small sample size. In particular, the power decreased remarkably with the number of differences in the primary endpoints between the two groups that fell within the non-inferiority margin. Note that in such a situation, large sample size is needed to detect the difference. The development of more efficient methods with a higher power in such cases is required in the future.
Furthermore, the smaller the correlation coefficient, the lower the power of the proposed method in comparison to a procedure that tested only superiority. Therefore, in a primary analysis using the proposed testing procedure for certain sample sizes, assuming differences in proportions and correlations between endpoints, if all endpoints are binary and have a continuous latent distribution then it is ideal in practice to confirm not only the superiority of at least one endpoint, but also the noninferiority of all remaining endpoints.
In the fields of economics, business, and education, a paradigm shift with methods that do not use p-values to establish statistical evidence continues to be proposed (Bhatti and Kim [20]). Currently, due to drug approval regulations, it will be unavoidable to determine the efficacy of a drug by the p value of the confirmatory trial. However, in the future, we will also need to develop methods that do not rely on p-values in order to conclude even in the exploratory phase. distribution with mean A and covariance matrix I (the identity matrix). For simplicity, to represent A we use the set of eigenvectors multiplied by the square root of the corresponding eigenvalue, because A is not uniquely determined. Furthermore, according to the procedure of Nakazuru