Skip to main content

A Test for Multiple Binary Endpoints with Continuous Latent Distribution in Clinical Trials

Abstract

In clinical trials, two or more binary responses obtained by dichotomizing continuous responses are often employed as multiple primary endpoints. Testing procedures for multiple binary variables with latent distribution have not yet been adequately discussed. Based on the association measure among latent variables, we provide a statistic for testing the superiority of at least one binary endpoint. In addition, we propose a testing procedure with a framework in which the trial efficacy is confirmed only when there is superiority of at least one endpoint and non-inferiority of the remaining endpoints. The performance of the proposed procedure is evaluated through simulations.

Introduction

In confirmatory clinical trials, several correlated binary response variables are used to assess the efficacy and safety of new treatments. The ICH E9 guideline [1] recommends that the primary endpoint should consist of only one variable that provides strong scientific evidence of treatment efficacy. However, in clinical trials for a variety of diseases, it is often useful to evaluate efficacy using multiple primary endpoints. For example, in clinical trials of patients with rheumatoid arthritis, the percentage of patients achieving short-term improvement of 20 percent in the American College of Rheumatology criteria (ACR20) and the percentage achieving long-term low disease activity [Disease Activity Score (DAS28-ESR) ≤ 3.2] are often used as primary endpoints (e.g., [2]). In clinical trials of patients with psoriasis, short- and long-term improvements are simultaneously assessed based on the percentage of patients with at least 75 percent improvement in the psoriasis area-and-severity index (DASI) score (e.g., [3]). In particular, binary endpoints are often used when it is more meaningful to diagnose improvement beyond clear standards rather than to assess the disease state using continuous variables. In such trials, we can consider that all primary endpoints are binary and often have a continuous latent distribution.

Most trials use multiple endpoints only to evaluate non-inferiority or superiority, but some trials have been conducted to confirm the non-inferiority and superiority of all endpoints. For example, a clinical trial to confirm the efficacy of four-factor prothrombin complex concentrate (4F-PCC) included two primary endpoints [4], namely the percentage of patients with a hemostatic effect and the percentage with a decrease in the international normalized ratio (INR). In the above trial, superiority was only evaluated if there was non-inferiority for both endpoints. When we confirm not only non-inferiority but also superiority, the use of the closed testing procedure [5] for the primary analysis is reasonable, and in this case, no adjustment is needed to control the type I error rate. Sozu et al. [6,7,8] have already proposed a testing method for dealing with several endpoints in a trial and a method for calculating the sample size. Their theory is based on the framework of recognizing a treatment effect when the superiority of all primary endpoints is confirmed. Such a setting of endpoints is called “co-primary endpoints”. In general, however, it is difficult to demonstrate the superiority of two or more endpoints because the power decreases as the number of endpoints increases. On the other hand, “multiple endpoints” are used in trials that recognize a treatment effect if it is superior to at least one of the endpoints. Developing a procedure that can confirm the superiority of at least one binary endpoint with latent distribution is a challenge for statisticians in the design and analysis of clinical trials. Thus, the aim of this paper was to define a testing procedure within a framework in which the efficacy of a test treatment is confirmed only when the superiority of the treatment relative to control is evidenced for at least one endpoint, and non-inferiority is demonstrated for the remaining endpoints.

For multiple continuous endpoints, Perlman and Wu [9] proposed a testing procedure that is applicable to the framework mentioned above. Nakazuru et al. [10] proposed a more powerful testing procedure using the approximate likelihood ratio test (ALRT) defined by Glimm et al. [11]. However, there has been inadequate development of methods for multiple binary endpoints. In the same framework as this study, Ishihara and Yamamoto [12] proposed a method using multiple binary endpoints, however, they were unable to assume a latent distribution for the binary variables. A statistic for testing the superiority of binary co-primary endpoints assuming latent variables has been developed by Sozu et al. [6]. Therefore, we herein propose a testing procedure that is appropriate when all endpoints are binary and have a latent distribution. Our procedure is based on the intersection-union test (IUT), proposed by Nakazuru et al. [10], and a method by Sozu et al. [6] for estimating correlations between binary endpoints that have latent variables. In particular, we consider two statistics estimated under the null and alternative hypotheses for the test of superiority. Since there has not yet been any discussion on whether the statistics obtained under the null hypothesis or the alternative hypothesis are better in practical terms, another purpose of this study is to discuss this issue.

This article is structured as follows. In Sect. 2, we define several assumptions, including the hypothesis regarding the testing procedure and the association between correlated binary endpoints when taking latent variables into consideration. In Sect. 3, we give two IUT statistics for the superiority of at least one endpoint and the non-inferiority of the remaining endpoints when all endpoints are binary and have a latent distribution. In Sect. 4, we provide a numerical experiment using Monte Carlo simulation to illustrate the behavior of the power and type I error rate of the proposed test. Regarding the power, the proposed statistics was compared with the closed testing procedure. In Sect. 5, we provide the results of IUT based on an actual clinical trial. Finally, in Sect. 6, we summarize our findings and present concluding remarks. Based on the proposed statistics and conducted simulations, we suggest how to use the statistics in a real clinical trial.

Assumption and Hypotheses

Statistical Setting

In this article, we focus on a randomized clinical trial comparing \(p \ (\ge 2)\) endpoints with two treatment groups. There are \(n_1\) subjects in the test group and \(n_2\) subject in the control group. Let \(Y_{ijk} \ (i=1,2; j=1,\ldots ,p; k=1,\ldots ,n_i)\) denote the binary response variable of the jth primary endpoint of the ith treatment in the kth subject. Suppose that the vectors of binary response variables \(\varvec{Y}_{ik} = (Y_{i1k},\ldots ,Y_{ipk})^{t} \ (i=1,2; k=1,\ldots ,n_i)\) are independently distributed as a p-variate Bernoulli distribution with \(\mathrm {E}(Y_{ijk}) = \pi _{ij}\), \(\mathrm {V}(Y_{ijk}) = \pi _{ij}(1-\pi _{ij})\), and \(\mathrm {Corr}(Y_{ijk},Y_{ij'k}) = \rho _{(i)jj'}\) for all \(j \ne j'\), where the superscript \(``t''\) denotes transpose. In this setting, the correlation coefficient \(\rho _{(i)jj'}\) of the multivariate Bernoulli distribution is expressed as

$$ \rho _{{(i)j{j^{\prime }} }} = \frac{{\phi _{{(i)j{j^{\prime }} }} - \pi _{{ij}} \pi _{{i{j^{\prime }} }} }}{{\sqrt {\pi _{{ij}} (1 - \pi _{{ij}} )} \sqrt {\pi _{{i{j^{\prime }} }} (1 - \pi _{{ij^{\prime } }} )} }}, $$
(1)

where \(\phi _{(i)jj'}\) is the joint probability of two response variables \((Y_{ijk}, Y_{ij'k})\). Note that the range of \(\rho _{(i)jj'}\) is equal to or less than \((-1, 1)\) depending on the value of \(\pi _{ij}\) and \(\pi _{ij'}\) (Bahadur [13]). That is, \(\rho _{(i)jj'}\)is bounded below by

$$\begin{aligned} \displaystyle \max \left( -\sqrt{\frac{\pi _{ij}\pi _{ij'}}{(1-\pi _{ij})(1-\pi _{ij'})}}, -\sqrt{\frac{(1-\pi _{ij})(1-\pi _{ij'})}{\pi _{ij}\pi _{ij'}}} \right) , \end{aligned}$$
(2)

and above by

$$\begin{aligned} \displaystyle \min \left( \sqrt{\frac{\pi _{ij}(1-\pi _{ij'})}{\pi _{ij'}(1-\pi _{ij})}}, \sqrt{\frac{\pi _{ij'}(1-\pi _{ij})}{\pi _{ij}(1-\pi _{ij'})}} \right) . \end{aligned}$$
(3)

Furthermore, we assume that \(\varvec{Y}_{ik}\) are dichotomized random variables of continuous unobservable response \(\varvec{Z}_{ik} = (Z_{i1k},\ldots ,Z_{ipk})^{t} \ (i=1,2; k=1,\ldots ,n_i)\). We also assume that \(\varvec{Z}_{ik}\) are independently distributed as a standardized p-variate normal distribution with \(\mathrm {Corr}(Z_{ijk},Z_{ij'k}) = \gamma _{(i)jj'}\) for all \(j \ne j'\). For each variable \(\varvec{Z}_{ik}\), there is a single threshold \(g_{ij} = \Phi ^{-1}(1-\pi _{ij}) \ (i=1,2; j=1,\ldots ,p)\) that partitions the latent distribution, where \(\Phi ^{-1}\) is the inverse function of the standard normal cumulative distribution function. Then, the binary response \(Y_{ijk} \ (i =1,2; j=1,\ldots ,p; k=1,\ldots ,n_i)\) can be defined as

$$\begin{aligned} Y_{ijk}=\left\{ \begin{array}{ll} 1, &{} Z_{ijk} \ge g_{ij} \\ 0, &{} Z_{ijk} < g_{ij}. \\ \end{array} \right. \end{aligned}$$

Set \(\varvec{X} = (X_1,\ldots ,X_p)^{t}\) with \(X_j = (\overline{Y}_{1j}-\overline{Y}_{2j}) \ (j=1,\ldots ,p)\), where \(\overline{Y}_{ij} \ (i =1,2; j=1,\ldots ,p)\) is the sample proportion for the jth endpoint of the ith treatment. Let the true proportion vector be the ith treatment \(\varvec{\pi }_i = (\pi _{i1},\ldots ,\pi _{ip})^t\) with difference of proportion \(\varvec{\Delta } = (\delta _1,\ldots ,\delta _p )^t = \varvec{\pi }_1 - \varvec{\pi }_2\) and the covariance matrix \(\varvec{\Sigma }\) that is defined as follows:

$$\begin{aligned} \varvec{\Sigma }= & {} \varvec{\Sigma }^{(1)} + \varvec{\Sigma }^{(2)} \\= & {} \frac{1}{n_{1}}\left( \begin{array}{ccc} \pi _{11}(1-\pi _{11}) &{} \cdots &{} \rho _{(1)1p} \sqrt{\pi _{11}(1-\pi _{11})} \sqrt{\pi _{1p}(1-\pi _{1p})}\\ \vdots &{} \ddots &{} \vdots \\ \rho _{(1)p1} \sqrt{\pi _{11}(1-\pi _{11})} \sqrt{\pi _{1p}(1-\pi _{1p})} &{} \cdots &{} \pi _{1p}(1-\pi _{1p}) \\ \end{array} \right) \\+ & {} \frac{1}{n_{2}}\left( \begin{array}{ccc} \pi _{21}(1-\pi _{21}) &{} \cdots &{} \rho _{(2)1p} \sqrt{\pi _{21}(1-\pi _{21})} \sqrt{\pi _{2p}(1-\pi _{2p})}\\ \vdots &{} \ddots &{} \vdots \\ \rho _{(2)p1} \sqrt{\pi _{21}(1-\pi _{21})} \sqrt{\pi _{2p}(1-\pi _{2p})} &{} \cdots &{} \pi _{2p}(1-\pi _{2p}) \\ \end{array} \right) , \end{aligned}$$

where \(\varvec{\Sigma }^{(i)} (i=1,2)\) is the covariance matrix of \((\overline{Y}_{i1},\ldots ,\overline{Y}_{ip})^{t}\). Note that \(\varvec{X}\) is approximately normally distributed with mean \(\varvec{\Delta }\) and covariance matrix \(\varvec{\Sigma }\).

Hypotheses

Without loss of generality, we assume that test treatment superiority is recognized when the proportion of responses to the test treatment is greater than that to the control treatment. That is, a maximum value of \(\delta _j (j=1,\ldots ,p)\) greater than 0 indicates an improvement of at least one endpoint of the test treatment compared to the control treatment. In the framework dealt with in this study, a test treatment effect is recognized only when the null hypothesis for the superiority of at least one endpoint and the null hypothesis for all non-inferiority are all rejected simultaneously. In such a framework, the null hypotheses of superiority and non-inferiority are represented by a union. Therefore, we consider the combined hypothesis for the superiority of at least one endpoint and the non-inferiority of the remaining endpoints. We consider a null hypothesis \(H_0\) and an alternative hypothesis expressed by

$$\begin{aligned} H_{0} : \left\{ \max _{1 \le j \le p} \delta _{j} \le 0 \right\} \cup \left\{ \min _{1 \le j \le p} (\delta _{j} + \epsilon _{j}) \le 0 \right\} \ \mathrm {versus} \ H_{1} : \mathrm {not} \ H_{0}, \end{aligned}$$

where \(\epsilon _{j} > 0 \ (j=1,\ldots ,p)\) is the non-inferiority margin of the jth endpoint that denotes a prespecified positive constant. Furthermore, the non-inferiority part of \(H_0\) can be expressed as a union of null hypotheses of non-inferiority for individual endpoints, since it means that either \(\delta _{j} + \epsilon _{j} \ (j=1,\ldots ,p)\) is less than or equal to zero. Therefore, \(H_{0}\) is also expressed as

$$\begin{aligned} H_{0} \equiv H^{(0)}_{0} \cup \left\{ H^{(1)}_{0} \cup \cdots \cup H^{(p)}_{0} \right\} , \end{aligned}$$

which defines the sub hypothesis of superiority “\(H^{(0)}_{0} : \displaystyle \max _{1 \le j \le p} \delta _{j} \le 0\)” and the sub hypothesis of non-inferiority “\(H^{(j)}_{0} : \delta _{j} \le -\epsilon _{j}\)”, for \(j=1,\ldots ,p\). \(H^{(0)}_{0}\) is adaptable to the one-sided ALRT, and the IUT (Berger, 1982) [14] can be applied to test \(H_{0}\).

A New Test Statistics

To determine the IUT statistics, we need to estimate \({g}_{ij}\), \(\pi _{ij}\) and \(\phi _{(i)jj'}\) considering the latent variable of \(Y_{ijk}\). In Subsect. 3.1 below, we propose an estimation procedure for those parameters. While Sozu et al. (2010) used a sample proportion to obtain an estimator of \(g_{ij}\), we propose a new procedure for estimating \(g_{ij}\) under the sub-null hypothesis \(H^{(0)}_{0}\). In Subsect. 3.2, we propose a new testing procedure that extends the procedure proposed by Nakazuru et al. (2014) using the parameters estimated in Subsect. 3.1.

Proposed Estimating Procedure

For the sake of simplicity, the process of estimating parameters is divided into the following two steps.

Step 1. Estimating the Cut-Off Point \(\varvec{g_{ij}}\)

We assume that \(\hat{g}_{ij}\) is the estimator of the latent cut-off point \(g_{ij}\), and is estimated as \(\hat{g}_{ij} = \Phi ^{-1}(1-\widetilde{\pi }_{ij})\), where \(\widetilde{\pi }_{ij}\) is the maximum likelihood estimator (MLE) of the marginal probability derived by the p-variate Bernoulli distribution. Let the probability mass function of the p-variate Bernoulli distribution be

$$\begin{aligned} \mathrm {P}(Y_{i1k}=y_{i1k},\ldots , Y_{ipk}=y_{ipk})&= \theta ^{\prod _{j=1}^p (1-y_{ijk})}_{(i) 0, 0,\ldots , 0} \times \theta ^{y_{i1k}\prod _{j=2}^p (1-y_{ijk})}_{(i) 1, 0,\ldots , 0}\\&\quad \times \theta ^{(1-y_{i1k})y_{i2k}\prod _{j=3}^p (1-y_{ijk})}_{(i) 0, 1,\ldots , 0} \times \quad \cdots \quad \times \theta ^{\prod _{j=1}^p y_{ijk}}_{(i) 1, 1,\ldots , 1}, \end{aligned}$$

where \(\theta _{(i) 0, 0,\ldots , 0},\ldots , \theta _{(i) 1, 1,\ldots , 1}\) are joint probabilities when \(\varvec{Y}_{ik}\) takes values from \((0,\ldots , 0), \ldots ,(1,\ldots ,1)\), respectively, and \(\theta _{(i) 0, 0,\ldots , 0}+ \ldots +\theta _{(i) 1, 1,\ldots , 1} = 1\). \(\widetilde{\pi }_{ij}\) can be expressed as \(\widetilde{\pi }_{ij} = \sum _{\begin{array}{c} (s_1, s_2,\ldots , s_{p}) \in S, s_{j}=1 \end{array}}\hat{\theta }_{(i) s_1, s_2,\ldots , s_{p}}\) using the estimator of \(\theta _{(i) s_1, s_2,\ldots , s_{p}}\) , where \(S = \{ (s_1, s_2, \ldots s_p) | s_j = 0, 1, j=1,\ldots ,p \}\) is a set whose elements consist of all pairs of response values.

In addition, \(\widetilde{\pi }_{ij}\) (and \(\hat{g}_{ij}\)) can be given in two ways depending on the estimation of \(\theta _{(i) s_1, s_2,\ldots , s_{p}}\) under the sub-null hypothesis \(H^{(0)}_{0}\) or the sub alternative hypothesis \(H^{(0)}_{1}:\mathrm {not} \ H^{(0)}_{0}\). In particular, under the sub null hypothesis \(H^{(0)}_{0}\), the Lagrange multiplier method is useful to obtain the MLE. On the other hand, under the sub-alternative hypothesis \(H^{(0)}_{1}\), the estimator \(\hat{\theta }_{(i) s_1, s_2,\ldots , s_{p}}\) is obtained in a closed form as a sample proportion.

Step 2. Estimating the Joint and Marginal Probabilities

The estimator of the joint probability \(\phi _{(i)jj'}\) in (1) is also given in two ways depending on \(\hat{g}_{ij}\), which is obtained by the estimation of \(\theta _{(i) s_1, s_2,\ldots , s_{p}}\) constructing \(\widetilde{\pi }_{ij}\). \(\hat{\phi }_{(i)jj'}\) can be given by

$$\begin{aligned} \hat{\phi }_{(i)jj'}= & {} \mathrm {Prob}(Z_{ij} \ge \hat{g}_{ij}, Z_{ij'} \ge \hat{g}_{ij'}) \\= & {} \int _{-\infty }^\infty \cdots \int _{\hat{g}_{ij}}^\infty \cdots \int _{\hat{g}_{ij'}}^\infty \cdots \int _{-\infty }^\infty f(z_1,\ldots , z_p; \hat{\gamma }_{(i)jj'}) \mathrm {d}z_1 \cdots \mathrm {d}z_p \ \mathrm {for} \ \mathrm {all} \ j \ne j', \end{aligned}$$

where \(f(z_1,\ldots , z_p; \hat{\gamma }_{(i)jj'})\) is the joint density function of \(\varvec{Z}_{ik}\) and \(z_1,\ldots , z_p\) are random variables following the standard p-variate normal distribution wherein \(\hat{\gamma }_{(i)jj'}\) is the Person’s tetrachoric correlation (Pearson [15]) calculated from \((Y_{ij1},\ldots , Y_{ijn_i})\) and \((Y_{ij'1},\ldots , Y_{ij'n_i})\). Therefore, if the latency of the binary response is assumed to have a standardized multivariate normal distribution, \(\hat{\phi }_{(i)jj'}\) is determined by \(\hat{\gamma }_{(i)jj'}\) and the cut-off point given in Step 1. Furthermore, the estimator of the marginal probability \(\pi _{ij}\) constructing \(\Sigma \) and \(\rho _{(i)jj'}\) in (1) should not be \(\widetilde{\pi }_{ij}\), which is obtained from the p-variate Bernoulli distribution, so as to take into account the latent distribution function. Let \(\hat{\pi }_{ij}\) denote the estimator of \(\pi _{ij}\) and be given by

$$\begin{aligned} \hat{\pi }_{ij}= & {} \mathrm {Prob}(Z_{ij} \ge \hat{g}_{ij}) \\= & {} \int _{-\infty }^\infty \cdots \int _{\hat{g}_{ij}}^\infty \cdots \int _{-\infty }^\infty f(z_1,\ldots , z_p; \hat{\gamma }_{(i)jj'}) \mathrm {d}z_1 \cdots \mathrm {d}z_p \ \mathrm {for} \ \mathrm {all} \ j. \end{aligned}$$

For example, with \(p=2\) endpoints, the estimator of \(\phi _{(i)12}\) is written as

$$\begin{aligned} \hat{\phi }_{(i)12}= & {} \int _{\hat{g}_{i2}}^\infty \int _{\hat{g}_{i1}}^\infty f(z_1, z_2; \hat{\gamma }_{(i)12}) \mathrm {d}z_1 \mathrm {d}z_2. \end{aligned}$$

Furthermore, the marginal probabilities \(\pi _{i1}\) and \(\pi _{i2}\) are described as follows:

$$\begin{aligned} \hat{\pi }_{i1}\,= \,& {} \mathrm {Prob}(Z_{i1} \ge \hat{g}_{i1}) = \int _{-\infty }^\infty \int _{\hat{g}_{i1}}^\infty f(z_1, z_2; \hat{\gamma }_{(i)12}) \mathrm {d}z_1 \mathrm {d}z_2, \\ \hat{\pi }_{i2} = \, & {} \mathrm {Prob}(Z_{i2} \ge \hat{g}_{i2}) = \int _{-\infty }^\infty \int _{\hat{g}_{i2}}^\infty f(z_1, z_2; \hat{\gamma }_{(i)12}) \mathrm {d}z_2 \mathrm {d}z_1. \end{aligned}$$

Along with the estimation of the joint and marginal probabilities, the estimated covariance matrix \(\hat{\varvec{\Sigma }}\) is defined as follows:

$$\begin{aligned} \hat{\varvec{\Sigma }}= & {} \hat{\varvec{\Sigma }}^{(1)} + \hat{\varvec{\Sigma }}^{(2)} \\= & {} \frac{1}{n_{1}}\left(\begin{array}{ccc} \hat{\pi }_{11}(1-\hat{\pi }_{11}) &{} \cdots &{} \hat{\phi }_{(1)1p} - \hat{\pi }_{11}\hat{\pi }_{1p}\\ \vdots &{} \ddots &{} \vdots \\ \hat{\phi }_{(1)p1} - \hat{\pi }_{11}\hat{\pi }_{1p} &{} \cdots &{} \hat{\pi }_{1p}(1-\hat{\pi }_{1p}) \\ \end{array} \right) \\&+ \frac{1}{n_{2}}\left( \begin{array}{ccc} \hat{\pi }_{21}(1-\hat{\pi }_{21}) &{} \cdots &{} \hat{\phi }_{(2)1p} - \hat{\pi }_{21}\hat{\pi }_{2p}\\ \vdots &{} \ddots &{} \vdots \\ \hat{\phi }_{(2)p1} - \hat{\pi }_{21}\hat{\pi }_{2p} &{} \cdots &{} \hat{\pi }_{2p}(1-\hat{\pi }_{2p}) \\ \end{array} \right) . \end{aligned}$$

Proposed Test Statistics

We consider the following new IUT statistics to test hypothesis \(H_{0}\) versus \(H_{1}\).

$$\begin{aligned} T^{(0)}&:\,\,\mathrm {min}(\overline{u}_{A}^{2},\overline{u}_{B}^{2}) \ \ \ \mathrm {and} \\ T^{(j)}&:\,\,\frac{X_{j} + \epsilon _{j}}{\sqrt{\frac{\acute{\pi }_{1j}(1 - \acute{\pi }_{1j})}{n_{1}} + \frac{\acute{\pi }_{2j}(1 - \acute{\pi }_{2j})}{n_{2}}}} \ \ \mathrm {for} \ \ \ j = 1,\ldots ,p, \end{aligned}$$

where \(\acute{\pi }_{ij}\) is the MLE of \(\pi _{ij}\) derived under the sub null hypothesis of non-inferiority \(H^{(j)}_{0}\), and \(T^{(j)}\) is a statistic commonly used in non-inferiority tests of binary endpoints (Farrington and Manning [16]). \(T^{(0)}\) and \(T^{(j)}\) are test statistics corresponding to the null hypotheses for superiority \(H_{0}^{(0)}\) and non-inferiority \(H_{0}^{(1)},\ldots ,H_{0}^{(p)}\), respectively. The proposed IUT rejects \(H_{0}\) if and only if \(T^{(0)} > c\) and \(T^{(j)} > z_\alpha \), where c is a constant determined by

$$\sum _{j=0}^{p}\frac{p!}{j!(p-j)!}\frac{1}{2^p}Pr(\chi ^2_{j} > c) = \alpha .$$

here, \(\chi _{j}^2\) denotes the \(\chi ^2\) distribution with j degrees of freedom, and \(\chi ^2_{0}\) is defined as the constant zero. \(\alpha \) is the nominal significance level, and \(z_{\alpha }\) is the upper 100\(\alpha \)th percentile of the standard normal distribution. See Appendix for the derivation of \(\overline{u}_{A}\) and \(\overline{u}_{B}\).

Note that test statistics \(T^{(0)}\) can be considered in two ways. One is provided by \(\hat{\varvec{\Sigma }}\) derived under the sub-null hypothesis \(H^{(0)}_{0}\), and the other is provided by \(\hat{\varvec{\Sigma }}\) derived under the sub-alternative hypothesis \(H^{(0)}_{1}\). Thus, we consider that there are also two types of IUT statistics. Let the IUT statistics using \(T^{(0)}\) estimated under the sub-null hypothesis \(H^{(0)}_{0}\) be the \(T^{(0)}_{0}\) test type, and the those using \(T^{(0)}\) estimated under the sub-null hypothesis \(H^{(0)}_{1}\) be the \(T^{(0)}_{1}\) test type.

Simulation Study

In order to evaluate the performance of the proposed IUT, we use Monte Carlo simulation to calculate the type I error rate and the powers of the \(T_{0}^{(0)}\) and \(T_{1}^{(0)}\) test types. In all simulations, we consider \(n_{1}=n_{2}=50, 100, 200\), \(\epsilon _{1}=\epsilon _{2}=0.2\), and \(\alpha =0.05\). The random numbers are generated from a standardized p-variate normal distribution, and response variables are obtained by dichotomizing random numbers using \(\varvec{\pi }_{i}=({\pi }_{i1},\ldots , {\pi }_{ip})^t\). The correlation between the latent variables assumes \(\rho =0, 0.4, 0.8\).

Type I Error Rate

We compare the type I error rate of the \(T_{0}^{(0)}\) test type and the \(T_{1}^{(0)}\) test type in the case \(p=2\). The generation of simulated data is repeated 1,000,000 times.

Table 1 shows the type I error rates for the two test types in the case \(p=2\). The type I error rate is greater than the nominal significance level for the \(T^{(0)}_{1}\) test type when the correlation between the endpoints is zero with a large sample size. The \(T^{(0)}_{0}\) test type is more conservative than the \(T^{(0)}_{1}\) test type. The type I error rate is lower when \(\pi _{ij}\) is close to zero than when \(\pi _{ij}\) is 0.5. In the case where at least one of the differences in \(\pi _{ij}\) is less than zero and is within the non-inferiority margin, the type I error rate is less than when the difference in \(\pi _{ij}\) is zero, and it markedly decreases as the difference in \(\pi _{ij}\) becomes closer to the non-inferiority margin. The type I error rate is much smaller when there are inferior endpoints.

Table 1 Type I error rates for \(p=2\)

In particular, for the scenario where inflation is likely to occur in the case \(p=2\), we generated simulation data 100,000 times and also checked the type I error rate in the case \(p=3\).

Table 2 shows the type I error rates for the two test types in the case \(p=3\). When the sample size is large and the correlations among the three endpoints is high, the type I error rate for both \(T^{(0)}_{0}\) and \(T^{(0)}_{1}\) is greater than the nominal significance level.

Table 2 Type I error rates for \(p=3\)

Power

We compare the powers of the \(T_{0}^{(0)}\) and the \(T_{1}^{(0)}\) test type for the proposed IUT, in the case \(p=2\). We also compare the power of the proposed IUT with that of a closed testing procedure that confirms the superiority of at least one of the two endpoints after the non-inferiority of both of the two endpoints is confirmed. The Bonferroni-corrected p-value (Bonferroni [17]) is used to test for superiority in the closed testing procedure. The generation of simulated data is repeated 100,000 times.

Table 3 shows the empirical powers of the proposed IUT and the closed testing procedure. The power of the \(T^{(0)}_{1}\) test type is greater than that of the \(T^{(0)}_{0}\) test type, and it becomes larger as the correlation between the endpoints increases with the small sample size. Even when the difference between the endpoints increases, the relationship between the power of the \(T^{(0)}_{0}\) test type and that of the \(T^{(0)}_{1}\) test type does not change much. On the other hand, as the sample size increases, the power of the \(T^{(0)}_{0}\) test type becomes similar to that of the \(T^{(0)}_{1}\) test type. Furthermore, the power of the proposed IUT is always greater than that of the closed testing procedure.

Table 3 Empirical powers

Even if the differences in \(\pi _{ij}\) are identical to each other, the power tends to be large when \(\pi _{ij}\) is close to zero, except when the correlation is zero and the sample size is small. On the other hand, if at least one endpoint is superior and the differences of all remaining endpoints are less than zero and within the non-inferiority margin, the power is lower than when all of the differences in \(\pi _{ij}\) are greater than zero. Furthermore, when there is an endpoint that is inferior, the power becomes quite small. The power does not monotonically increase or decrease depending on the correlation coefficient. The results for \(p=2\) and \(p=3\) in Subsect. 4.3 below show that as the number of endpoints that differ between the two groups increases, the power of the closed testing procedure is noticeably lower than that of the proposed IUT.

Power is Reduced when Non-Inferiority Test is Added to a Superiority Test

We also compare the performance of the proposed IUT and a test excluding the non-inferiority in the case of \(p=2\) and \(p=3\). The generation of simulated data is repeated 100,000 times.

Table 4 shows a power comparison between the proposed IUT and the superiority test alone for the case of \(p=2\). If all the differences in \(\pi _{ij}\) are greater than or equal to zero, as the sample size increases, regardless of the value of the correlation coefficient, the powers of the superiority test alone and the IUT remain similar. On the other hand, when the differences in \(\pi _{ij}\) between the two groups are partially within the non-inferiority margin, the power of IUT is low compared to the superiority test alone. When \(\pi _{ij}\) is close to zero, the powers of the superiority test alone and the IUT are more similar than when \(\pi _{ij}\) is close to 0.5.

Table 4 Evaluation of power reduction when adding the non-inferiority test for \(p=2\)

Table 5 shows a power comparison between the proposed IUT and the superiority test alone for the case of \(p=3\). As with the case of \(p=2\), if all the differences in \(\pi _{ij}\) are greater than or equal to zero, adding a non-inferiority test does not reduce the power much when the sample size is large. Conversely, larger the number of endpoints for which the differences in \(\pi _{ij}\) are partially within the non-inferiority margin, the greater the reduction in power of the IUT.

Table 5 Evaluation of power reduction when adding the non-inferiority test for \(p=3\)

Numerical Example

We present the results of applying the proposed IUT to an actual trial which confirm the efficacy of 4F-PCC [4]. The clinical trial is a multicentre, randomized, open-label, phase III trial on patients aged 18 years or older needing rapid vitamin K antagonist reversal before an urgent surgical or invasive procedure. As mentioned in Sect. 1, this study includes two primary endpoints, that is, (i) the percentage of patients with a hemostatic effect defined as binary variable based on predicted blood loss, and (ii) the percentage with a decrease in the INR. The analyses were intended to evaluate, in a hierarchical fashion, first non-inferiority for both endpoints, then superiority if non-inferiority was achieved. Based on the result of the study (for details, see Fig. 3 in [4]), we consider the case of \(\hat{\pi }_{11} = 0.9\) and \(\hat{\pi }_{21} = 0.55\) for the percentage of the hemostatic effect, and \(\hat{\pi }_{12} = 0.75\) and \(\hat{\pi }_{22} = 0.1\) for the percentage of decreasing INR. The non-inferiority margin is set at 0.1 for both endpoints, and sample size of each group is \(n_1 = 87\) and \(n_2 = 81\). Since correlations between two variables cannot be derived from the reported results, we consider the case of \(\hat{\gamma }_{(1)12}=\hat{\gamma }_{(2)12}=0, 0.2, 0.4, 0.6, 0.8\). Since only the statistics calculated under the sub-null hypothesis \(H^{(0)}_{1}\) from the reported information, Table 6 provides the value of \(T^{(0)}_{1}\), \(T^{(1)}\) and \(T^{(2)}\) with results of IUT (rejected or accepted) at significance level \(\alpha \) = 0.05.

Table 6 Numerical example

Although showing the non-inferiority for all endpoints had been the first step in the actual trial, even if the test had simultaneously taken into account showing the superiority for at least one endpoint, the same conclusion was obtained for any estimate of the correlation coefficient.

Concluding Remarks

In this article, we developed a testing procedure for studies with multiple binary endpoints and a latent distribution. This was performed within a framework in which the efficacy of a test treatment is recognized when at least one endpoint demonstrates superiority and the remaining endpoints demonstrate non-inferiority. We derived two types of test statistics using cut-off points estimated under the sub null hypothesis \(H^{(0)}_{0}\) and the sub-alternative hypothesis \(H^{(0)}_{1}\), and these procedures were compared in a numerical experiment using a Monte Carlo simulation.

The numerical experiment clearly demonstrated that the \(T^{(0)}_{1}\) test type was always more powerful than the \(T^{(0)}_{0}\) test type. However, \(\alpha \)-violation occurred in the \(T^{(0)}_{1}\) test type when a sample size was large and the correlation coefficient was zero in the case \(p=2\). As the number of endpoints increased, we also found that \(\alpha \)-violation was more likely to occur in the scenario where all differences of the endpoints were zero. On the other hand, \(\alpha \)-violation did not occur when any of the endpoints was inferior. We believe that this will not be a fatal problem in practice because the framework of this study assumes that any of the endpoints are inferior. However, since \(\alpha \)-violation is a serious issue in confirmatory clinical trials, it is necessary to develop a method that does not cause \(\alpha \)-violation, or we should choose a non-inflationary test when the number of endpoints is large. Since there was not a large difference in power between the \(T^{(0)}_{0}\) and \(T^{(0)}_{1}\) test types, it may be reasonable to preferentially use the \(T^{(0)}_{0}\) test type, especially if the correlation coefficients between the endpoints have not been investigated. Furthermore, this study showed a marked decrease in power as the number of differences in endpoints within the non-inferiority margin increased. Even if the number of superior endpoints was the same, power did not decrease when the number of endpoints was increased.

The power does not monotonically increase or decrease depending on the correlation coefficient between the endpoints. When determining the sample size, it is difficult to know whether the power will increase if the correlation is changed under the fixed difference in endpoints. A further non-negligible result is that type I error rates tend to increase when the correlation is low and the sample size is large. We also found that as the number of endpoints increases, the higher the correlation, the more likely the type I error rate will inflate. For a less problematic sample size determination, the correlations between endpoints should be accurately investigated before trial planning. However, in practice, it is difficult to estimate the correlation coefficients between endpoints during trial planning. Therefore, we recommend that power be simulated assuming several correlation coefficients under fixed proportions for each endpoint, and that the most conservative sample size among all scenarios be used in the trial. According to Offen et al. [18] and Sankoh et al. [19], the correlation coefficients between multiple endpoints in clinical trials are approximately equal to 0.4 and range from 0.2 to 0.8, which may help in setting up the simulation scenarios. For example, if we assume that the proportions of the two responses are 0.6 for the treatment group and 0.5 for the control group, the power will exceed 0.8 if the correlation coefficient is zero, but the power will fall below 0.8 if the correlation coefficient is 0.4 to 0.8. Therefore, assuming a correlation of 0.8 in this case would be a conservative design.

Incidentally, like the proposed testing procedure, a closed testing procedure can be used for multiple endpoints where the familywise error rate is kept below the nominal significance level. In the framework of this study, the proposed IUT was shown to be more powerful than the closed testing procedure regardless of the correlation coefficient between endpoints, the difference between endpoints, the number of noticeably different endpoints, the sample size, and whether or not the proportions are close to zero. Although the closed testing procedure has a significant advantage in that it does not require control of the type I error rate in individual tests when there are inclusion relationships between null hypotheses, it may be more reasonable to use the proposed IUT in the framework of this study, where the superiority of at least one endpoint and the non-inferiorities of the remaining endpoints are confirmed simultaneously.

We also demonstrated a power reduction when the non-inferiority test was added to the superiority test. Our simulations showed that there was only a minimal decrease in power when the proportions of responses to the test treatment were all or somewhat higher than that to the control treatment. When the proportions were close to zero, the power was almost the same for the superiority test alone and the proposed IUT. By contrast, the power was reduced when the treatment effects were partially within the non-inferiority margin for a small sample size. In particular, the power decreased remarkably with the number of differences in the primary endpoints between the two groups that fell within the non-inferiority margin. Note that in such a situation, large sample size is needed to detect the difference. The development of more efficient methods with a higher power in such cases is required in the future.

Furthermore, the smaller the correlation coefficient, the lower the power of the proposed method in comparison to a procedure that tested only superiority. Therefore, in a primary analysis using the proposed testing procedure for certain sample sizes, assuming differences in proportions and correlations between endpoints, if all endpoints are binary and have a continuous latent distribution then it is ideal in practice to confirm not only the superiority of at least one endpoint, but also the non-inferiority of all remaining endpoints.

In the fields of economics, business, and education, a paradigm shift with methods that do not use p-values to establish statistical evidence continues to be proposed (Bhatti and Kim [20]). Currently, due to drug approval regulations, it will be unavoidable to determine the efficacy of a drug by the p value of the confirmatory trial. However, in the future, we will also need to develop methods that do not rely on p-values in order to conclude even in the exploratory phase.

Abbreviations

ACR:

American College of Rheumatology criteria

ACR20:

20% improvement in American College of Rheumatology criteria

ALRT:

Approximate likelihood ratio test

DAS28:

Disease Activity Score modified to include the 28 diarthrodial joint count

ESR:

Erythrocyte sedimentation rate

4F-PCC:

Four-factor prothrombin complex concentrate

ICH:

International council for Harmonisation of technical requirements for pharmaceuticals for human use

INR:

International normalized ratio

IUT:

Intersection-union test

MLE:

Maximum likelihood estimator

References

  1. 1.

    International Conference on Harmonization (ICH) of technical requirements for regulations of pharmaceuticals for human use: ICH Tripartite Guideline E-9 Documents (1998)

  2. 2.

    Smolen, J.S., Burmester, G.R., Combe, B., Curtis, J.R., Hall, S., Haraoui, B., van Vollenhoven, R., Cioffi, C., Ecoffet, C., Gervitz, L., Lonescu, L., Peterson, L., Fleischmann, R.: Head-to-head comparison of certolizumab pegol versus adalimumab in rheumatoid arthritis: 2-year efficacy and safety results from the randomised EXXELERATE study. Lancet 388(10061), 2763–2774 (2016). https://doi.org/10.1016/S0140-6736(16)31651-8

    Article  Google Scholar 

  3. 3.

    Reich, K., Langlay, R.G., Papp, K.A., Ortonne, J.P., Unnebrink, K., Kaul, M., Valdes, J.M.: A 52-week trial comparing briakinumab with methotrexate in patients with psoriasis. N. Engl. J. Med. 365(17), 1586–1596 (2011). https://doi.org/10.1056/NEJMoa1010858

    Article  Google Scholar 

  4. 4.

    Goldstein, J.N., Refaai, M.A., Jr., Milling, T.J., Lewis, B., Goldberg-Alberts, R., Hug, B.A., Sarode, R.: Four-factor prothrombin complex concentrate versus plasma for rapid vitamin K antagonist reversal in patients needing urgent surgical or invasive interventions: a phase 3b, open-label, non-inferiority, randomised trial. Lancet 385(9982), 2077–2087 (2015). https://doi.org/10.1016/S0140-6736(14)61685-8

    Article  Google Scholar 

  5. 5.

    Marcus, R., Peritz, E., Gabriel, K.R.: On closed testing procedures with special reference to ordered analysis of variance. Biometrika 63, 655–660 (1976). https://doi.org/10.2307/2335748

    MathSciNet  Article  MATH  Google Scholar 

  6. 6.

    Perlman, M.D., Wu, L.: A note on one-sided tests with multiple endpoints. Biometrics 60, 276–280 (2004). https://doi.org/10.1111/j.0006-341X.2004.00159.x

    MathSciNet  Article  MATH  Google Scholar 

  7. 7.

    Sozu, T., Sugimoto, T., Hamasaki, T.: Sample size determination in clinical trials with multiple co-primary binary endpoints. Stat. Med. 29(21), 2169–2179 (2010). https://doi.org/10.1002/sim.3972

    MathSciNet  Article  MATH  Google Scholar 

  8. 8.

    Sozu, T., Sugimoto, T., Hamasaki, T.: Sample size determination in superiority clinical trials with multiple co-primary correlated endpoints. J. Biopharm. Stat. 21(4), 650–668 (2011). https://doi.org/10.1080/10543406.2011.551329

    MathSciNet  Article  MATH  Google Scholar 

  9. 9.

    Sozu, T., Sugimoto, T., Hamasaki, T.: Sample size determination in clinical trials with multiple co-primary endpoints including mixed continuous and binary variables. Biom. J. 54(5), 716–729 (2012). https://doi.org/10.1002/bimj.201100221

    MathSciNet  Article  MATH  Google Scholar 

  10. 10.

    Nakazuru, Y., Sozu, T., Hamada, C., Yoshimura, I.: A new procedure of one-sided test in clinical trials with multiple endpoints. Jpn. J. Biom. 35, 17–35 (2014). https://doi.org/10.5691/jjb.35.17

    Article  Google Scholar 

  11. 11.

    Glimm, E., Srivastava, M., Lauter, J.: Multivariate tests of normal mean vectors with restricted alternatives. Commun. Stat. B: Simul. Comput. 31, 589–604 (2002)

    MathSciNet  Article  Google Scholar 

  12. 12.

    Ishihara, T., Yamamoto, K.: A testing procedure in clinical trials with multiple binary endpoints. Commun. Stat. Theory Methods. Advance Online Publication (2021)

  13. 13.

    Bahadur, R.R.: In studies in item analysis and prediction, Stanford Mathematical Studies in the Social Sciences. In: Solomon H (ed) Stanford University Press, Stanford, pp 158–168 (1961)

  14. 14.

    Berger, R.L.: Multiparameter hypothesis testing and acceptance sampling. Technometrics 24, 295–300 (1982). https://doi.org/10.1080/00401706.1982.10487790

    MathSciNet  Article  MATH  Google Scholar 

  15. 15.

    Bonferroni, C.E.: Teoria statistica delle classi e calcolo delle probabilità, Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze (1936)

  16. 16.

    Farrington, C.P., Manning, G.: Test statistics and sample size formulae for comparative binomial trials with null hypothesis of non-zero risk difference or non-unity relative risk. Stat. Med. 9, 1447–1454 (1990). https://doi.org/10.1002/sim.4780091208

    Article  Google Scholar 

  17. 17.

    Pearson, K., III.: Mathematical contributions to the theory of evolution.—VIII. On the inheritance of characters not capable of exact quantitative measurement.—Part I. Introductory. Part II. On the inheritance of coat-colour in horses. Part III. On the inheritance of eye-colour in man. Philos. Trans. R. Soc. A. 195, 1–47 (1900). https://doi.org/10.1098/rsta.1900.0024

    Article  Google Scholar 

  18. 18.

    ...Offen, W., Chuang-Stein, C., Dmitrienko, A., Littman, G., Maca, J., Meyerson, L., Muirhead, R., Stryszak, P., Boddy, A., Chen, K., Copley-Merriman, K., Dere, W., Givens, S., Hall, D., Henry, D., Jackson, J.D., Krishen, A., Liu, T., Ryder, S., Sankoh, A.J., Wang, J., Yeh, C.H.: Multiple co-primary endpoints: medical and statistical solutions. Drug Inform. J. 41, 31–46 (2007). https://doi.org/10.1177/009286150704100105

    Article  Google Scholar 

  19. 19.

    Sankoh, A.J., Huque, M.F., Russell, H.K., D’Agostino, R.B.: Global two-group multiple endpoint adjustment methods applied to clinical trials. Ther. Innov. Regul. Sci. 33, 119–140 (1999). https://doi.org/10.1177/009286159903300115

    Article  Google Scholar 

  20. 20.

    Bhatti, M.I., Kim, J.H.: Towards a new paradigm for statistical evidence in the use of p-value. Econometrics 9(1), 2 (2021). https://doi.org/10.3390/econometrics9010002

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to sincerely thank the editor-in-chief, associate editor and referees for their valuable comments about our paper.

Funding

The authors have solely funded the research by themselves.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Takuma Ishihara.

Ethics declarations

Conflict of interest

The authors have declared no conflict of interest.

Appendix

Appendix

We consider an ALRT for \(H^{(0)}_{0}\). Let \(\varvec{A}\) be the positive definite matrix such that \(\varvec{A}^{t}\varvec{A} = \hat{\varvec{\Sigma }}^{-1}\), where \(\varvec{\Sigma }^{-1}\) is the inverse matrix of \(\varvec{\Sigma }\). The statistic \(\varvec{u}_A = (u_{A1},\ldots ,u_{Ap})^{t} = \varvec{AX}\) is approximately distributed as a p-variate normal distribution with mean \(\varvec{A\Delta }\) and covariance matrix \(\varvec{I}\) (the identity matrix). For simplicity, to represent \(\varvec{A}\) we use the set of eigenvectors multiplied by the square root of the corresponding eigenvalue, because \(\varvec{A}\) is not uniquely determined. Furthermore, according to the procedure of Nakazuru et al. (2014), \(\varvec{B}\) is defined as the matrix substituting the off-diagonal elements of \(\varvec{A}\) with their absolute values. Consider the two transformations such that

$$\begin{aligned} \varvec{u}_{A} \equiv (u_{A1},\ldots ,u_{Ap})^t\,= \,& {} {\varvec{AX}} \ \ \ \mathrm {and} \\ \varvec{u}_{B} \equiv (u_{B1},\ldots ,u_{Bp})^t= & {} \left( \frac{\mathrm {det}{\varvec{A}}}{\mathrm {det}{\varvec{B}}}\right) ^{2/p}{\varvec{BX}}. \end{aligned}$$

In these assumptions, the ALRT rejects \(H^{(0)}_{0}\) if and only if

$$\begin{aligned} T^{(0)}&:&\mathrm {min}(\overline{u}_{A}^{2},\overline{u}_{B}^{2}) >c, \end{aligned}$$

where \(\overline{u}_{A}^{2}\) and \(\overline{u}_{B}^{2}\) are defined by

$$\begin{aligned} \overline{u}_{A}^{2}= & {} \sum _{j=1}^p \mathrm {max}(u_{Aj},0)^2, \\ \overline{u}_{B}^{2}= & {} \sum _{j=1}^p \mathrm {max}(u_{Bj},0)^2. \end{aligned}$$

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ishihara, T., Yamamoto, K. A Test for Multiple Binary Endpoints with Continuous Latent Distribution in Clinical Trials. J Stat Theory Appl 20, 463–480 (2021). https://doi.org/10.1007/s44199-021-00003-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s44199-021-00003-3

Keywords

  • Latent variable
  • Non-inferiority
  • Superiority