Background

Treatment delivery for substance abuse has evolved from inpatient care to intensive outpatient care [1]. Although outpatient settings have increased the population of participants able to receive treatment, attrition is substantial in outpatient substance abuse treatment settings. Recent studies of substance abuse treatment clinical trials demonstrate considerable drop-out after first dose of treatment [26].

The high percentage of study participant attrition documented in substance abuse research interferes with the effectiveness of treatment programs and calls into question the validity of study analyses. Furthermore, poor outcomes are associated poor treatment retention [7]. Although missing data are rampant, it is often ignored in the presentation of clinical trials [4, 8] and statistical methods of longitudinal data analysis often used in the substance abuse literature, such as data deletion or single imputation, may be biased or otherwise invalidated in the presence of substantial missing data and/or when missing data that is not missing completely at random [8]. This is particularly true in substance abuse clinical trials where missing data in outcomes at a particular point in time may be dependent upon previous outcomes. For example, a participant is likely to drop out of a substance abuse treatment clinical trial at the time of relapse.

The statistical literature details many methods of longitudinal data analysis that handle missing data; many have demonstrated robustness to assumptions of the missing data mechanism. These methods include, but are not limited to, multiple imputation [9, 10], pattern mixture models [11], selection models and stratified summary statistics (SSS) [1216]. This article describes one of the methods, SSS, which may be used specifically for hypothesis testing of the treatment effect.

We first provide an overview of summary statistic and SSS methods. Next, we discuss modification and expansion of the SSS method using some of the methods often used in the statistical literature for data combination. Comparisons of these methods are made under different assumptions for the missing data – both mechanism and rate of attrition. Finally, we conclude by describing some of the strengths and limitations of SSS methods.

Summary Statistic Methods of Longitudinal Data Analysis and Missing Data

The summary statistic method of longitudinal data analysis is a technique by which each participant's multivariate outcome is reduced to a scalar summary measure. Comparisons of the scalar summary measures between treatments may then be analyzed using a variety of univariate statistical techniques [12, 15, 1719]. For example, a summary statistic (e.g. mean, slope) is calculated for each individual over time. Then the average summary statistic response for each treatment group is calculated and compared using an independent t-test.

As with any type of longitudinal data analysis, the summary statistic approach may need to be modified for losses to follow-up. Dawson and Han [14] studied the effect of missing data mechanism on summary statistics. For example, when the slope is used as a summary statistic and the missing data mechanism is considered to be completely random (MCAR) the variance of the slopes varies dependent on the amount of outcome data available [14]. However, if the missing data mechanism is missing at random or nonignorable (MNAR) and/or the trend is nonlinear then the mean of the slopes may vary dependent upon the amount of information available per individual.

If the missing data patterns differ between treatment arms, the summary test statistic approach may be invalid [20]. A method proposed by Dawson [1216] may be applied to a variety of summary statistics whereby each participant's summary response is stratified according to their missing data pattern. This method is called Stratified Summary Statistic (SSS) as one stratifies the analysis according to missingness patterns. This 'stratification by missingness pattern' may be appropriate when the mean and/or variance of the summary statistic is dependent upon the amount or timing of the outcome [1216].

The computation of SSS as described by Dawson [13] is detailed below.

Stratified Summary Statistic Calculations

  1. (1)

    Define an appropriate scalar measure (summary statistic) of the multivariate outcome (e.g. slope, mean,) and compute the summary statistic for each individual over time. For example, when an outcome is expected to linearly increase or decrease over time, a slope may be a good selection of a summary statistic [13]. Statistically, the slopes, S sj may be calculated for each participant over time, s = 1,..., t in each treatment group, j = 1, 2.

  2. (2)

    Stratify participant slopes by the missing data pattern; slopes are stratified by the timing of each participant's dropout, s = 1,..., t. For example, slopes in which subjects had two observations over time will be placed in one stratum; whereas, slopes in which subjects had three observations over time will be placed in a separate stratum, etc

  3. (3)

    Compute stratum-specific test statistics, e.g., a t-test comparing average treatment differences. Suppose that the null hypothesis of interest is to test whether the distribution functions of Ss 1and Ss 2are equal, H o : Fs 1(s) = Fs 2(s). Once slopes are calculated for each individual in each treatment arm, a stratum specific t-test may be defined where independent observations are available and their sizes are ns 1for Ss 1and ns 2for Ss 2. Assuming that that the distributions of Fs 1(s) and Fs 2(s) are normally distributed with equal variance, σ2, the random variable t s = n s 1 n s 2 n s 1 + n s 2 ( S ¯ s 1 S ¯ s 2 ) i = 1 n s 1 ( S i s 1 S ¯ s 1 ) 2 + j = 1 n s 2 ( S j s 2 S ¯ s 2 ) 2 n s 1 + n s 2 2 MathType@MTEF@5@5@+=feaafiart1ev1aqatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiDaq3aaSbaaSqaaiabdohaZbqabaGccqGH9aqpjuaGdaWcaaqaamaakaaabaWaaSaaaeaacqWGUbGBdaWgaaqaaiabdohaZjabigdaXaqabaGaemOBa42aaSbaaeaacqWGZbWCcqaIYaGmaeqaaaqaaiabd6gaUnaaBaaabaGaem4CamNaeGymaedabeaacqGHRaWkcqWGUbGBdaWgaaqaaiabdohaZjabikdaYaqabaaaaaqabaWaaeWaaeaadaqdaaqaaiabdofatbaadaWgaaqaaiabdohaZjabigdaXaqabaGaeyOeI0Yaa0aaaeaacqWGtbWuaaWaaSbaaeaacqWGZbWCcqaIYaGmaeqaaaGaayjkaiaawMcaaaqaamaakaaabaWaaSaaaeaadaaeWbqaamaabmaabaGaem4uam1aaSbaaeaacqWGPbqAcqWGZbWCcqaIXaqmaeqaaiabgkHiTmaanaaabaGaem4uamfaamaaBaaabaGaem4CamNaeGymaedabeaaaiaawIcacaGLPaaadaahaaqabeaacqaIYaGmaaGaey4kaSYaaabCaeaadaqadaqaaiabdofatnaaBaaabaGaemOAaOMaem4CamNaeGOmaidabeaacqGHsisldaqdaaqaaiabdofatbaadaWgaaqaaiabdohaZjabikdaYaqabaaacaGLOaGaayzkaaWaaWbaaeqabaGaeGOmaidaaaqaaiabdQgaQjabg2da9iabigdaXaqaaiabd6gaUnaaBaaabaGaem4CamNaeGOmaidabeaaaiabggHiLdaabaGaemyAaKMaeyypa0JaeGymaedabaGaemOBa42aaSbaaeaacqWGZbWCcqaIXaqmaeqaaaGaeyyeIuoaaeaacqWGUbGBdaWgaaqaaiabdohaZjabigdaXaqabaGaey4kaSIaemOBa42aaSbaaeaacqWGZbWCcqaIYaGmaeqaaiabgkHiTiabikdaYaaaaeqaaaaaaaa@8223@ has a t-distribution with ns 1+ ns 2- 2 degrees of freedom.

  4. (4)

    Weight each stratum-specific test statistic by the amount of data available. Dawson proposes a weight that will increase with the number of participants, ns 1and ns 2, within stratum and with the number of observations per person in a given stratum, g s [13].

    w s = g s n s 1 n s 2 n s 1 + n s 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcLbuacqWG3bWDjuaGdaWgaaqcbawaaKqzafGaem4Camhajeaybeaajugqbiabg2da9KqbaoaakaaabaWaaSaaaeaacqWGNbWzdaWgaaqaaiabdohaZbqabaGaemOBa42aaSbaaeaacqWGZbWCcqaIXaqmaeqaaiabd6gaUnaaBaaabaGaem4CamNaeGOmaidabeaaaeaacqWGUbGBdaWgaaqaaiabdohaZjabigdaXaqabaGaey4kaSIaemOBa42aaSbaaeaacqWGZbWCcqaIYaGmaeqaaaaaaKqaGfqaaaaa@4818@

For example, Table 1 demonstrates the number of subjects in each treatment arm for each stratum, where strata are defined by the number of visits each subject accumulates until dropout occurs (for this particular example, 22 subjects had 1 visit before drop-out, 17 subjects had 2 visits before drop-out, etc.). A weight for stratum 4 would be computed as w s = g s n s 1 n s 2 n s 1 + n s 2 w 4 = g 4 n 41 n 42 n 41 + n 42 = 4 * 15 * 5 15 + 5 = 3.87 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcLbuacqWG3bWDjuaGdaWgaaWcbaqcLbuacqWGZbWCaSqabaqcLbuacqGH9aqpjuaGdaGcaaqaamaalaaabaGaem4zaC2aaSbaaeaacqWGZbWCaeqaaiabd6gaUnaaBaaabaGaem4CamNaeGymaedabeaacqWGUbGBdaWgaaqaaiabdohaZjabikdaYaqabaaabaGaemOBa42aaSbaaeaacqWGZbWCcqaIXaqmaeqaaiabgUcaRiabd6gaUnaaBaaabaGaem4CamNaeGOmaidabeaaaaaaleqaaKqzafGaeyO0H4Taem4DaCxcfa4aaSbaaSqaaKqzafGaeGinaqdaleqaaKqzafGaeyypa0tcfa4aaOaaaeaadaWcaaqaaiabdEgaNnaaBaaabaGaeGinaqdabeaacqWGUbGBdaWgaaqaaiabisda0iabigdaXaqabaGaemOBa42aaSbaaeaacqaI0aancqaIYaGmaeqaaaqaaiabd6gaUnaaBaaabaGaeGinaqJaeGymaedabeaacqGHRaWkcqWGUbGBdaWgaaqaaiabisda0iabikdaYaqabaaaaaWcbeaajugqbiabg2da9KqbaoaakaaabaWaaSaaaeaacqaI0aancqGGQaGkcqaIXaqmcqaI1aqncqGGQaGkcqaI1aqnaeaacqaIXaqmcqaI1aqncqGHRaWkcqaI1aqnaaaaleqaaKqzafGaeyypa0JaeG4mamJaeiOla4IaeGioaGJaeG4naCdaaa@7251@ . Whereas, a weight for stratum 8 is computed as w 8 = 8 * 38 * 40 38 + 40 = 18.97 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcLbuacqWG3bWDjuaGdaWgaaWcbaqcLbuacqaI4aaoaSqabaqcLbuacqGH9aqpjuaGdaGcaaqaamaalaaabaGaeGioaGJaeiOkaOIaeG4mamJaeGioaGJaeiOkaOIaeGinaqJaeGimaadabaGaeG4mamJaeGioaGJaey4kaSIaeGinaqJaeGimaadaaaWcbeaajugqbiabg2da9iabigdaXiabiIda4iabc6caUiabiMda5iabiEda3aaa@44A2@ . The weight for stratum 8 is greater than that of stratum 4 because stratum 8 consists of a greater number of subjects (78 versus 20) as well as a larger number of longitudinal time points (8 versus 4) per subject until drop-out.

Table 1 Example of Subject Stratification for SSS, Rows Indicate the Treatment Arms, Columns Indicate Strata and Cell Values Indicate the Number of Participant
  1. (5)

    Combine weighted test statistics into an aggregate statistic (Dawson, 1994).

    ( a ) Z = s = 1 t w s t s s = 1 t w s 2 , s = 1 , ... , t . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcLbuafaqabeqadaaabaGaeiikaGIaemyyaeMaeiykaKcabaGaemOwaOLaeyypa0JcdaWcaaqaamaaqahabaGaem4DaC3aaSbaaSqaaiabdohaZbqabaGccqWG0baDdaWgaaWcbaGaem4CamhabeaaaeaacqWGZbWCcqGH9aqpcqaIXaqmaeaacqWG0baDa0GaeyyeIuoaaOqaamaakaaabaWaaabCaeaacqWG3bWDdaqhaaWcbaGaem4CamhabaGaeGOmaidaaaqaaiabdohaZjabg2da9iabigdaXaqaaiabdsha0bqdcqGHris5aaWcbeaaaaqcLbuacqGGSaalaeaacqqGZbWCcqGH9aqpcqaIXaqmcqGGSaalcqGGUaGlcqGGUaGlcqGGUaGlcqGGSaalcqWG0baDcqGGUaGlaaaaaa@565A@

The aggregate statistic in equation (a) is a weighted sum of the stratum specific test statistics where the weight, w s is defined in Step 4 and the test statistic, t s is defined in Step 3. This aggregate statistic is then compared to a standard normal distribution [12].

Modified SSS

The SSS aggregate statistic as contributed by Dawson [13] may need to be slightly modified when a t-test rather than a z-test is chosen for the stratum specific test. That is, the aggregate statistic may need to adjust for the degrees of freedom for each stratum specific t-test. One modification of SSS is to multiply each stratum specific t-test by the inverse variance of the linear combination of the t-test statistics.

Then the aggregate statistic is:

( b ) Z = s = 1 t w s t s s = 1 t w s 2 V a r ( t s ) = s = 1 t w s t s s = 1 t w s 2 v s v s 2 , s = 1 , ... , t . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcLbuafaqabeqadaaabaGaeiikaGIaemOyaiMaeiykaKcabaGaemOwaOLaeyypa0tcfa4aaSaaaeaadaaeWbqaaiabdEha3naaBaaabaGaem4CamhabeaacqWG0baDdaWgaaqaaiabdohaZbqabaaabaGaem4CamNaeyypa0JaeGymaedabaGaemiDaqhacqGHris5aaqaamaakaaabaWaaabCaeaacqWG3bWDdaqhaaqaaiabdohaZbqaaiabikdaYaaacqWGwbGvcqWGHbqycqWGYbGCdaqadaqaaiabdsha0naaBaaabaGaem4CamhabeaaaiaawIcacaGLPaaaaeaacqWGZbWCcqGH9aqpcqaIXaqmaeaacqWG0baDaiabggHiLdaabeaaaaqcLbuacqGH9aqpjuaGdaWcaaqaamaaqahabaGaem4DaC3aaSbaaeaacqWGZbWCaeqaaiabdsha0naaBaaabaGaem4CamhabeaaaeaacqWGZbWCcqGH9aqpcqaIXaqmaeaacqWG0baDaiabggHiLdaabaWaaOaaaeaadaaeWbqaaiabdEha3naaDaaabaGaem4CamhabaGaeGOmaidaamaalaaabaGaemODay3aaSbaaeaacqWGZbWCaeqaaaqaaiabdAha2naaBaaabaGaem4CamhabeaacqGHsislcqaIYaGmaaaabaGaem4CamNaeyypa0JaeGymaedabaGaemiDaqhacqGHris5aaqabaaaaOGaeiilaWcajugqbeaacqqGZbWCcqGH9aqpcqaIXaqmcqGGSaalcqGGUaGlcqGGUaGlcqGGUaGlcqGGSaalcqWG0baDcqGGUaGlaaaaaa@8165@

The variable v s is the number of degrees of freedom associated with each stratum specific t-test statistic.

Fisher's Combination of Probabilities from Independent Tests of Significance

The stratified summary statistic procedures described above are an example of combining independent test statistics. The statistical literature has supported many methods of combining independent data and includes combining estimates, test statistics or p-values [12, 2126]. A popular method for combining one-sided p-values was proposed by Fisher in 1950 which defines the following test statistic

( c ) T= 2 s = 1 t ln ( p s ) MathType@MTEF@5@5@+=feaafiart1ev1aqatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeqabeGaaaqaaiabcIcaOiabdogaJjabcMcaPaqaaiabbsfaujabb2da9iabgkHiTiabikdaYmaaqahabaGagiiBaWMaeiOBa42aaeWaaeaacqWGWbaCdaWgaaWcbaGaem4CamhabeaaaOGaayjkaiaawMcaaaWcbaGaem4CamNaeyypa0JaeGymaedabaGaemiDaqhaniabggHiLdaaaaaa@41B6@

where p s is the p-value for each stratum, s = 1,..., t. The test statistic is then compared to a chi-square with 2t degrees of freedom.

The sum of s = 1,..., t independent random variables where each variable has a chi-square distribution is also a random variable that is distributed chi-square. The 'degrees of freedom' for the summed random variable is calculated by summing the degrees of freedom of each of the s independent random variables. Using equation c, let T = s = 1 t T s MathType@MTEF@5@5@+=feaafiart1ev1aqatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcLbuacqWGubavcqGH9aqpjuaGdaaeWbGcbaqcLbuacqWGubavjuaGdaWgaaWcbaqcLbuacqWGZbWCaSqabaaabaqcLbuacqWGZbWCcqGH9aqpcqaIXaqmaSqaaKqzafGaemiDaqhacqGHris5aaaa@3C80@ where T S is distributed X 2 2 MathType@MTEF@5@5@+=feaafiart1ev1aqatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcLbuacqWGybawjuaGdaqhaaWcbaqcLbuacqaIYaGmaSqaaKqzafGaeGOmaidaaaaa@31C6@ . Given T S are independent: s = 1 t T s X 2 t 2 MathType@MTEF@5@5@+=feaafiart1ev1aqatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcfa4aaabCaOqaaKqzafGaemivaqvcfa4aaSbaaSqaaKqzafGaem4CamhaleqaaaqaaKqzafGaem4CamNaeyypa0JaeGymaedaleaajugqbiabdsha0bGaeyyeIuoacqWI8iIocqWGybawjuaGdaqhaaWcbaqcLbuacqaIYaGmcqWG0baDaSqaaKqzafGaeGOmaidaaaaa@4175@ . In order to combine data using Fisher's method, p-values must be one sided. Two sided p-values may be divided by two. Without loss of generality, the Fisher's approach would always use P(T > t*) [22].

Fisher's statistic has the advantage over Dawson's SSS in that the combined p-values will follow a chi-square distribution. Combinations of test statistics will depend upon the distribution of the test-statistics themselves, for example when combining t-test statistics the test may need to be modified to account for the degrees of freedom associated with each test as demonstrated above.

The Z Transformation Test and the Weighted Z Test

One disadvantage of the Fisher test is an asymmetrical transformation of p-values making it sensitive to data that reject the common null in contrast to data which support the null [24]. The z-transform test does not have this sensitivity [24]. The test transforms (one to one) the one-sided p-values from independent tests (s = 1,..., t) into a z-value, z s , from the standard normal distribution. The following statistic is then derived from the s, z-values

Z = s = 1 t z s t . MathType@MTEF@5@5@+=feaafiart1ev1aqatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcLbuacqWGAbGwcqGH9aqpjuaGdaWcaaqaamaaqahabaGaemOEaO3aaSbaaeaacqWGZbWCaeqaaaqaaiabdohaZjabg2da9iabigdaXaqaaiabdsha0bGaeyyeIuoaaeaadaGcaaqaaiabdsha0bqabaaaaiabc6caUaaa@3C26@

Under the null hypothesis, the test statistic is then compared to a standard normal distribution.

Furthermore, the Z-transformation test may be weighted according to the power of each individual test [25]. This weighted Z method has the following test statistic

( d ) Z w = s = 1 t w s z s s = 1 t w s . MathType@MTEF@5@5@+=feaafiart1ev1aqatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcLbuafaqabeqacaaabaGaeiikaGIaemizaqMaeiykaKcabaGaemOwaOLaem4DaCNaeyypa0tcfa4aaSaaaeaadaaeWbqaaiabdEha3naaBaaabaGaem4CamhabeaacqWG6bGEdaWgaaqaaiabdohaZbqabaaabaGaem4CamNaeyypa0JaeGymaedabaGaemiDaqhacqGHris5aaqaamaakaaabaWaaabCaeaacqWG3bWDdaWgaaqaaiabdohaZbqabaaabaGaem4CamNaeyypa0JaeGymaedabaGaemiDaqhacqGHris5aaqabaaaaKqzafGaeiOla4caaaaa@4CF9@

If each test has equal power and is given an equal weight, then the weighted z-transform test reduces to the z-transform test. A proposed w s for the test includes weights that are proportional to the inverse of the error variance of each test [25]. If t-tests are used then proposed weights are the degrees of freedom for each t-test, i.e. w s = v s [25].

The standard normal deviate, z s , corresponds to each one tailed p-value, p s . Also, the z s will have the same sign if the effects are in the same direction but different signs if effects are in opposite directions. That is, each z s should have the same sign as the corresponding t-value for each test [24, 25]. Once the normal deviates are computed and combined the resulting p-value of the aggregate test may be converted to either one or two sided.

Methods

A Monte Carlo study incorporating the general design of outpatient substance abuse clinical trials was used to assess the Type I error and power of hypothesis tests of the treatment effect. Assumptions of the simulated dataset were as follows: outcome is assumed to follow a multivariate normal distribution and within unit (subject) variation was assumed to follow a compound symmetry structure. A common correlation coefficient of 0.6 was estimated from the complete cases of previous substance abuse clinical trials [27]. Outcome was assumed to follow a linear trend, with participants in both treatments groups having similar outcome at the beginning of the study and then decreasing over time. For simulations of Type I error we let F placebo (y) = F treatment (y). Data was simulated as multivariate normal with mean vector [17 16 15 14 13 12 11 10] and σ (y j ) = 20 for j = 1,..., 8. For simulations of power we let F placebo (y) ≠ F treatment (y), the treatment effect was assumed to increase over time, i.e. the mean vector for treatment arm was set at [17 15.05 13.1 11.15 9.2 7.25 5.3 3.35] such that the power for the SSS analysis was approximately 80%. Since this is a study of longitudinal data analysis, each participant was assumed to have at least two measurements. A total sample size of n = 100 was assessed.

Missing data patterns were assumed monotonic; i.e. each subject was observed and data was recorded until withdrawal from the study and those who withdrew were not observed for the remainder of the study [28]. Missing data patterns in which subjects miss a visit and are lost thereafter are described as monotonic [27]. This complete 'loss to follow-up' gives rise to the probability that the missing data mechanism is not random and may be dependent upon observed and unobserved values of the outcome.

Several missing data mechanisms defined by their dependence on observed and unobserved values of the outcome have been classified by Rubin [9]. The specific case of monotonic missing data mechanisms for multivariate/longitudinal data has been further described by Schafer and Graham [29]. If we assume that the outcome variable, Y ij , can be measured for each individual, i = 1,..., n at several points in time, j = 1,..., t as defined by the design of the longitudinal study, missing data that are classified as missing completely at random (MCAR) are independent of any outcome variables and any covariates of interest. Missing at random [30] means that Y ij may be dependent on any of the outcomes observed until the time of the missed visit, for j = m, i.e. the missing data are dependent on outcomes Yi 1,..., Yi(m-1). Missing not at random (MNAR) means that Y im may be dependent on any outcome not observed due to missed visits. If m is defined as the time at which a subject drops out of a study and does not return, then the missing data may be dependent on any of the unobserved outcomes, Y im ,..., Y it .

Missing data due to withdrawal were tested under three missing data assumptions; i.e., missing data may be considered either MCAR, MAR or MNAR with respect to outcome. In order to simulate the missing data mechanism a complete data set was simulated. The probability of drop-out was assumed to follow a logistic regression model [3133] and was used to simulate the missing data in the complete dataset.

For example, missing data that are MAR in a longitudinal dataset are dependent on outcomes observed prior to the dropout. If we let the function h k (y1,..., y k ) where k = 1,...,(t - 1) be a covariate in a logistic regression model on the probability of drop-out we will have the following logit model: logit ( p k ) = log p k 1 p k = α + β h k MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeeiBaWMaee4Ba8Maee4zaCMaeeyAaKMaeeiDaq3aaeWaaeaacqWGWbaCdaWgaaWcbaGaem4AaSgabeaaaOGaayjkaiaawMcaaiabg2da9iGbcYgaSjabc+gaVjabcEgaNLqbaoaalaaabaGaemiCaa3aaSbaaeaacqWGRbWAaeqaaaqaaiabigdaXiabgkHiTiabdchaWnaaBaaabaGaem4AaSgabeaaaaGccqGH9aqpcqaHXoqycqGHRaWkcqaHYoGycqWGObaAdaWgaaWcbaGaem4AaSgabeaaaaa@4CC4@ , where α is the intercept and β is the slope of the logit.

The function h k , can be defined as the latest observed measurement, i.e. h k (y1,..., y k ) = y k . Using the latest observation in substance abuse trials may have validity since, much of the drop-out observed may be due to relapse or no change in response. Therefore, observed positive tests or high levels of cocaine (benzoylecognine) may be predictive of drop-out and the missing data mechanism can be classified as MAR. Using this function, the probability of dropout for each time point may be computed, p k = e α + β h k 1 + e α + β h k MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiCaa3aaSbaaSqaaiabdUgaRbqabaGccqGH9aqpjuaGdaWcaaqaaiabdwgaLnaaCaaabeqaaiabeg7aHjabgUcaRiabek7aIjabdIgaOnaaBaaabaGaem4AaSgabeaaaaaabaGaeGymaeJaey4kaSIaemyzau2aaWbaaeqabaGaeqySdeMaey4kaSIaeqOSdiMaemiAaG2aaSbaaeaacqWGRbWAaeqaaaaaaaaaaa@4329@ . If β = 0, then the missing data mechanism is MCAR.

In order to simulate a 10% missing data percentage with a MAR missing data mechanism under the null, we set α13 = -106, α14 = -105, α15 = -104, α16 = -103, α17 = -102, α18 = -101 and β = 2. For a 40% missing data percentage parameters were set as follows: α13 = -70, α14 = -69, α15 = -68, α16 = -67, α17 = -65, α18 = -64 and β = 2.

However, if the missing data mechanism is not ignorable then the logit model for each time point may be defined as: logit ( p j k ) = log p j k 1 p j k = α + β h k + γ y j MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeeiBaWMaee4Ba8Maee4zaCMaeeyAaKMaeeiDaq3aaeWaaeaacqWGWbaCdaWgaaWcbaGaemOAaOMaem4AaSgabeaaaOGaayjkaiaawMcaaiabg2da9iGbcYgaSjabc+gaVjabcEgaNLqbaoaalaaabaGaemiCaa3aaSbaaeaacqWGQbGAcqWGRbWAaeqaaaqaaiabigdaXiabgkHiTiabdchaWnaaBaaabaGaemOAaOMaem4AaSgabeaaaaGccqGH9aqpcqaHXoqycqGHRaWkcqaHYoGycqWGObaAdaWgaaWcbaGaem4AaSgabeaakiabgUcaRiabeo7aNjabdMha5naaBaaaleaacqWGQbGAaeqaaaaa@5672@ where time is defined j = 1,..., t and time before the last observation is defined k = 1,..., (t - 1) [31, 34]. If γ = 0 for each time point then the dropout model is MAR; whereas, if γ ≠ 0 for each time point then the missing data mechanism is MNAR. That is, unobserved outcome may be predictive of drop-out and the missing data mechanism may be MNAR.

To simulate a 10% missing data percentage with a MNAR missing data mechanism under the null, we set α13 = -106, α14 = -105, α15 = -104, α16 = -103, α17 = -102, α18 = -101, β = 0 and γ = 2. For a 40% missing data percentage with a MNAR missing data mechanism parameters were set as follows: α13 = -70, α14 = -69, α15 = -68, α16 = -67, α17 = -65, α18 = -64 β = 0 and γ = 2. In order to simulate 40% missing data with a combination of MAR and MNAR missing data mechanisms, we set α13 = -105, α14 = -104, α15 = -103, α16 = -102, α17 = -101, α18 = -100 β = 2 and γ = 2.

Two thousand simulations were preformed for each method for missing data percentages of 10% and 40% and missing mechanisms of MCAR, MAR, a combination of both MAR and MNAR, and MNAR. To meet the standards of computation-based analysis, the optimal number of simulations was calculated using the coverage probability of 95% around the estimated Type I error probability of .05 [35]. Using this method, the simulation sample size was approximately 2,000. This simulation size also results in both Type I error estimates and power estimates which had standard errors less than or equal to .01.

Several methods of combining independent data were used to analyze each data set. Specifically, participants were stratified into mutually exclusive missingness categories. Stratum specific independent t-tests were computed using slope means for each treatment arm. Each t-statistic or p-value was weighted. Stratum specific t-statistics or p-values were then combined into an aggregate statistic and compared to the standard normal distribution. Empirical size and power for each method of analysis was compared over 2000 simulations.

Choice of Weights

For a t-test, power should be maximized when w s is proportional to the noncentrality parameter of the distribution of each stratified test statistic, Z s , for a given model [12, 16, 36]. A general weight that is proportional to the non-centrality parameter is n 1 n 2 n 1 + n 2 MathType@MTEF@5@5@+=feaafiart1ev1aqatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaOaaaKqbagaadaWcaaqaaiabd6gaUnaaBaaabaGaeGymaedabeaacqWGUbGBdaWgaaqaaiabikdaYaqabaaabaGaemOBa42aaSbaaeaacqaIXaqmaeqaaiabgUcaRiabd6gaUnaaBaaabaGaeGOmaidabeaaaaaaleqaaaaa@374D@ .

A variety of weights may be chosen to increase the power of the test. Estimates of population weights may also be utilized [11]. The population weights for each stratum can be defined: w s = n s N MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4DaC3aaSbaaSqaaiabdohaZbqabaGccqGH9aqpjuaGdaWcaaqaaiabd6gaUnaaBaaabaGaem4CamhabeaaaeaacqWGobGtaaaaaa@34AF@ , where s = 1 t w s = 1 MathType@MTEF@5@5@+=feaafiart1ev1aqatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaabCaeaacqWG3bWDdaWgaaWcbaGaem4Camhabeaakiabg2da9iabigdaXaWcbaGaem4CamNaeyypa0JaeGymaedabaGaemiDaqhaniabggHiLdaaaa@3800@ . The population weights will weight the t-tests produced from a larger proportion of the sample heavier than those with smaller sample size. Choice of weights will affect the power of the test, any weight that weights a more efficient estimate heavier than a less efficient estimate will produce a more powerful test.

Another weight may incorporate the Sum of Squares for Time. Generally t-tests are uniformly most powerful tests; however, the t-tests do not incorporate the efficiency gain by measuring participants over a number of longitudinal time points. One way to improve efficiency may be to weight each t-test by the source of variation due to time. The Sum of Squares for time may be calculated, S S T i m e = k n j ( Y ¯ . j . Y ¯ ... ) 2 MathType@MTEF@5@5@+=feaafiart1ev1aqatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4uamLaem4uamLaemivaqLaemyAaKMaemyBa0MaemyzauMaeyypa0Jaem4AaSMaemOBa42aaabuaeaadaqadaqaamaanaaabaGaemywaKfaamaaBaaaleaacqGGUaGlcqWGQbGAcqGGUaGlaeqaaOGaeyOeI0Yaa0aaaeaacqWGzbqwaaWaaSbaaSqaaiabc6caUiabc6caUiabc6caUaqabaaakiaawIcacaGLPaaaaSqaaiabdQgaQbqab0GaeyyeIuoakmaaCaaaleqabaGaeGOmaidaaaaa@4734@ for each stratum and used to weight each t-test.

Z = s = 1 t S S T i m e s t s s = 1 t S S T i m e s 2 v s v s 2 , s = 1 , ... , t . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcLbuafaqabeqacaaabaGaemOwaOLaeyypa0tcfa4aaSaaaeaadaaeWbqaaiabdofatjabdofatjabdsfaujabdMgaPjabd2gaTjabdwgaLnaaBaaabaGaem4CamhabeaacqWG0baDdaWgaaqaaiabdohaZbqabaaabaGaem4CamNaeyypa0JaeGymaedabaGaemiDaqhacqGHris5aaqaamaakaaabaWaaabCaeaacqWGtbWucqWGtbWucqWGubavcqWGPbqAcqWGTbqBcqWGLbqzdaqhaaqaaiabdohaZbqaaiabikdaYaaadaWcaaqaaiabdAha2naaBaaabaGaem4CamhabeaaaeaacqWG2bGDdaWgaaqaaiabdohaZbqabaGaeyOeI0IaeGOmaidaaaqaaiabdohaZjabg2da9iabigdaXaqaaiabdsha0bGaeyyeIuoaaeqaaaaakiabcYcaSaqcLbuabaGaee4CamNaeyypa0JaeGymaeJaeiilaWIaeiOla4IaeiOla4IaeiOla4IaeiilaWIaemiDaqNaeiOla4caaaaa@67DA@

Results

Overall the results demonstrate nominal Type I error probabilities for Fisher's Method, the Weighted Z-Transform Test and Modified SSS compared to SSS (using stratum specific t-tests) under a variety of assumptions. However, SSS produced larger Type I errors compared to the other methods. Further, the modified SSS which corrects for the degrees of freedom associated with the t-tests produced tests of nominal size. Type I error probabilities showed little variation for a 10% missing rate compared to a 40% missing rate.

Table 2 demonstrates the Type I error probability under a variety of missing data percentages (10% and 40%) and mechanisms (MCAR, MAR, a combination of MAR and MNAR as well as MNAR) for all methods. Simulations for this particular table assumed a small sample size of 100, a common correlation coefficient of .6 and a simulation number of 2000. For all conditions, the Type I error probabilities of SSS are larger than those of the other methods compared. The Fisher method produces the most conservative results in terms of Type I Error; however, the differences are negligible. Finally, little variation is observed in the Type I error probabilities between the different missing data percentages and/or mechanisms.

Table 2 Type I Error Probabilities of Methods for Missing Data Percentages (10% and 40%) and Mechanisms

Power for each test differed dependent on the method used as well as the missing data percentage and mechanism assumed. Table 3 demonstrates the power under a variety of missing data percentages (10% and 40%) and mechanisms (MCAR, MAR, a combination of MAR and MNAR as well as MNAR) for all methods. Simulations for this particular table assumed a small sample size of 100, a common correlation coefficient of .6 and a simulation number of 2000. Results for Table 2 demonstrate that power was generally greater for SSS compared to all other methods; however, this may be due to the inflated Type I error probabilities as previously discussed. Power was comparable across methods for the 10% missing data percentage. However, Fisher's method demonstrated a reduction in power for the 40% missing data percentage compared to modified SSS and the Weighted Z-Transform Test. Second only to SSS, the weighted Z-transform test demonstrated robustness in power for all missing data percentages and mechanisms.

Table 3 Power of Methods for Missing Data Percentages (10% and 40%) and Mechanisms

For all methods, power is decreased at least 35% for a missing data percentage of 10% versus 40%. Power is dramatically decreased for the Fisher method given a missing data percentage of 40% and a missing data mechanism of MAR or MNAR. In general, power fluctuations are observed for each missing data mechanism.

Conclusion

The statistical literature has an abundance of methods of analysis for longitudinal datasets with missing data. This paper focuses on missing data methods which can be used for hypothesis tests of the treatment effect when the missing data pattern is monotonic. Specifically, Dawson's stratified summary statistic and several other methods of combining data were assessed and developed for analysis with missing data due to their robustness to the missing data mechanism. That is, stratifying data by the missing data pattern, computing stratum specific statistics and aggregating these statistics produces tests which have nominal Type I Error and optimal power even in the presence of nonignorable missing data [1216]. These hypothesis tests of the treatment effect which are robust to the missing data mechanism may be applicable to the analysis of substance abuse clinical trials because missing data in substance abuse trials are predominately due to relapse and therefore the missing data may be nonignorable or dependent upon previous outcomes.

In this article, we have focused on two missing data percentages, a 10% rate and a 40% rate, with each treatment arm having similar amounts of missing data. In many clinical trials, the missing data percentage and/or mechanism may vary across treatment arm. Shih and Quan (21) demonstrate that Type I Error may be inflated when the missing data percentage differs between treatment arm and the missing data mechanism is MAR. Further simulation studies may want to focus on these variations and their effects on the Type I Error and power of hypothesis tests of the treatment effect.

This article demonstrates the impact that attrition can have on some of the statistical methods which are used for longitudinal data analysis. It should be noted that analysis should not be limited to these methods. These methods focus on testing hypotheses of the treatment effect. If the focus of a trial is on parameter estimation, a modeling approach of the missing data such as a pattern mixture or selection model may be more appropriate [11].

Furthermore, the stratified summary statistic methods possess a 'post hoc' quality. That is, we stratify on the pattern of missing data, which is not known until the data have been collected. In statistics we propose separation of the design from the analysis, i.e. the study design and analysis are specified in advance of data collection.

Although we will not know the exact pattern of missing data until all subject outcomes have been collected, it is well-known that substance abuse clinical trials are prone to high rates of attrition. Therefore, the use of missing data methods may be planned in advance of the study and may be specified in the study protocol. Furthermore, any reports of results from these analyses should be tempered with the knowledge that the analysis was dependent on the missing data pattern, which could not be fully discerned a priori.

The weighting schemes used in this paper are 'precision-based', and they weight stratum statistics with a larger amount of participants and/or more time point more than those with less. These methods seem to suggest that 'treatment works for those who work for it'. That is, we are weighting those subjects who perform better in the clinical trial more than those who perform worse (those that tend to drop-out due to relapse). However, these methods are preferred to 'complete case' analysis which drops subjects with any missing data. Also, results from this simulation study and several other studies demonstrate that these methods are robust to the missing data mechanism in terms of hypothesis testing of the treatment effect [1216].

Further studies should investigate the robustness in Type I error and power of stratified summary statistics as well as bias and precision of the estimates of the treatment effect for these methods. Also, future studies may want to use other weighting schemes including 'bias-based' weights [37]. However, use of bias-based weights would need to be justified a priori by determining the cause and direction of the bias incurred due to attrition in substance abuse clinical trials. Given the known history of attrition in substance abuse clinical trials where much of the attrition may be contributed to relapse; bias based weighting schemes may be justifiable in this setting.

The simulations for the comparisons of missing data could also be further generalized. For these particular simulations, missing data rates were set at 10% and 40%. We chose a missing data percentage of 40% because of the known high prevalence of missing data in longitudinal substance abuse clinical trials [26]. However, these methods can be generalized to more intermediate missing data percentages in order to demonstrate changes in Type I error and power with a variety of missing data rates.

No matter how well-designed a clinical trial, these high attrition rates can bias the analysis of a clinical trial. Validity in the presence of missing data is often dependent upon the method of analysis selected. Specifically, inappropriate methods may produce hypothesis tests of the treatment effect without appropriate size and/or power. Therefore, it is imperative that substance abuse clinical trials prepare for inevitable missing data due to attrition. That is, this paper demonstrates the need for policy development for evidence based practice specific to the analysis of longitudinal substance abuse clinical trials in the presence of substantial drop-out. For example, given the wide variety of methods used for analysis of substance abuse clinical trials, we may want to specify that missing data methods be incorporated into the design and analysis given the unique properties of this research paradigm.