The Journal of Behavioral Health Services & Research

, Volume 34, Issue 3, pp 272–289

Measuring Clinically Meaningful Change Following Mental Health Treatment

Authors

    • Center for Health Quality, Outcomes and Economic Research (CHQOER)Edith Nourse Rogers Memorial Veterans Hospital
    • Department of Health Policy and ManagementBoston University School of Public Health
  • Gayatri Ranganathan
    • MetaWorks Inc.
  • Pradipta Seal
    • Center for Health Quality, Outcomes and Economic Research (CHQOER)Edith Nourse Rogers Memorial Veterans Hospital
  • Avron SpiroIII
    • Massachusetts Veterans Epidemiology Research and Information Center (MAVERIC)VA Boston Healthcare System
    • Department of EpidemiologyBoston University School of Public Health
Regular Article

DOI: 10.1007/s11414-007-9066-2

Cite this article as:
Eisen, S.V., Ranganathan, G., Seal, P. et al. J Behav Health Serv Res (2007) 34: 272. doi:10.1007/s11414-007-9066-2

Abstract

Assessment of clinically meaningful change is useful for treatment planning, monitoring progress, and evaluating treatment response. Outcome studies often assess statistically significant change, which may not be clinically meaningful. Study objectives are to: (1) evaluate responsiveness of the BASIS-24© using three methods for determining clinically meaningful change: reliable change index (RCI), effect size (ES), and standard error of measurement (SEM); and (2) determine which method provides an estimate of clinically meaningful change most concordant with other change measures. BASIS-24© assessments were obtained at two time points for 1,397 inpatients and 850 outpatients. The proportion showing clinically meaningful change using each method was compared to the proportion showing change in global mental health, retrospectively reported change, and clinician-assessed change. BASIS-24© demonstrated responsiveness at both aggregate and individual levels. Regarding clinically meaningful improvement and decline, SEM was most concordant with all three outcome measures; regarding no change, RCI was most concordant with all three measures.

Keywords

clinically meaningful changeoutcome assessment

Introduction

An important criterion for instruments designed to assess treatment effectiveness is that they be sensitive to change in the construct they were designed to measure.1,2 Sensitivity, or responsiveness to change, should be considered a distinct criterion in assessing the psychometric properties of health status instruments used for outcome evaluation.3 Studies of treatment effectiveness commonly use repeated measures designs in which outcomes are determined by comparing pre–post differences in symptomatology or functioning. Statistical significance of change is often assessed at the aggregate level using paired t tests between T1 and T2, repeated measures analysis of variance or covariance, or generalized estimating equations.4,5 However, with reasonably large samples, it is possible for small differences that may not be clinically meaningful to reach statistical significance. In addition, these types of aggregate data analysis are not useful for determining individual changes, which can be very valuable in clinical practice for treatment planning, monitoring course of illness and evaluating response to treatment.57

Liang3 differentiates sensitivity of an instrument, defined as an instrument’s capacity to measure any change, from responsiveness—an instrument’s capacity to measure clinically important or meaningful change. Clinically meaningful change can be defined as a noticeable, appreciable difference that is of value to the patient or the health professional, and that exceeds variation attributable to chance.3,8 Minimally important change has been defined as “the difference in score on a health-related ...instrument that corresponds to the smallest change in status that stakeholders (persons, patients, significant others, or clinicians) consider important”.9 Determining clinically meaningful or important change can facilitate interpretation of scores obtained on self-report measures of health status used to evaluate intervention and treatment effects at both the aggregate and individual level.5 Recent work by Hays et al.,7 and by Atkins et al.10 discussed and evaluated multiple methods for determining clinically meaningful change. However, no study to date has used multiple criteria to determine which method is most concordant with other widely used measures of change in mental health status.

The purposes of this article are: (1) to evaluate responsiveness of the 24-item revised Behavior and Symptom Identification Scale (BASIS-24©)11 using three distribution-based methods for identifying clinically meaningful change: the reliable change index (RCI), effect size (ES), and standard error of measurement (SEM); and (2) to determine which of the three methods provides an estimate of change that is most concordant with three other mental health outcomes measures: change in self-reported global rating of mental health, a self-reported retrospective (transition) rating of change, and change based on a clinical rating of impairment [Global Assessment of Functioning (GAF)].8,1113 Because individuals treated at different levels of care and in specialized versus generic programs might reasonably be expected to differ across a range of domains, including demographic and clinical characteristics, and amount of change they are likely to experience in different mental health domains, assessment of clinically meaningful change was done separately for inpatients and outpatients treated in mental health or substance abuse/dual diagnosis programs.

Methods for determining clinically meaningful change

The problems inherent in traditional methods of evaluating treatment effectiveness were noted almost 30 years ago, and since then, a number of methods have been developed to assess “clinically significant”, “meaningful”, or “minimally important change.”1417 However, there is still no clear consensus regarding standards for determining clinically meaningful change.3,6,18,19

Effect size was first developed by Cohen to assess the magnitude of a treatment effect, originally using aggregate data, but later to assess change in individuals as well.2022 Effect size is based on the ratio of the difference between baseline and follow-up scores to the standard deviation of the baseline score. Unlike significance tests, effect size is independent of sample size; consequently, it will not increase just by increasing sample size. Because they provide standardized measures of change, effect sizes can be used as benchmarks for understanding changes in health status.21 They can also be viewed as indicative of clinically meaningful change, based on research suggesting that a medium effect size corresponds to an amount of change that is noticeable to a careful observer.23 Based on Cohen’s d, effect sizes of 0.2 are considered small, 0.5–0.6 are considered medium, and ≥0.80 are considered large.20,23

Jacobson and Truax developed the reliable change index (RCI) as the first step of a two-step process for determining clinical significance of change.24 The RCI is calculated to determine whether the magnitude of change is statistically reliable. The posttreatment score is subtracted from the pretreatment score and divided by the standard error of the differences. If the absolute value of “t” is greater than 1.96, then change is considered statistically reliable. If statistically reliable change is established, a second criterion is suggested for determining clinical significance—that the posttreatment score fall within the range of scores for a “normal” population. After development of this method, several refinements and enhancements were suggested.2529 Although reliable change can be computed for any sample, identification of clinically meaningful change requires that norms be available for a “normal” population, which is sometimes not the case. Because there is no well-established standard for determining clinical significance, and because normative data from an untreated population is not available, the RCI analysis used in this study is limited to determination of statistically reliable change.

McHorney and Tarlov proposed the standard error of measurement (SEM) as a useful statistic for assessing individual change on health-related quality of life instruments, and its use has been described for evaluating meaningful change in a number of medical, cognitive, and behavioral conditions.68 The SEM is the standard deviation of an individual score, estimated by multiplying the standard deviation for a sample by the square root of one minus its reliability coefficient.30,31 Although varying statistical thresholds have been used to determine clinically meaningful change using the SEM, recent research has reported that one SEM consistently corresponded to a minimal clinically important intra-individual change.5,30,31 Consequently, in this paper, one SEM will be used as the criterion for clinically meaningful change in BASIS-24.©

Methods

Sample

The sample consisted of 2,248 English-speaking adults treated in one of 27 inpatient (n = 1,398) or outpatient (n = 850) mental health and/or substance abuse treatment programs who completed self-report mental health assessments at two time points: admission and within 24 h before discharge (for inpatients), or intake and 30–90 days later (for outpatients). This sample included all sites participating in a field test of the revised Behavior and Symptom Identification Scale. Details regarding site and sample selection have been previously reported.11 The majority of both the inpatient and outpatient samples were between 25 and 44 years old had at least a high school education and were unemployed or employed less than 10 h per week in the previous 30 days. Sample characteristics are presented in Table 1.
Table 1

Characteristics of the inpatient and outpatient samples

Age

Inpatient sample (N = 1,397)

Outpatient sample (N = 850)

Number of patients (N)

Percentage (%)

Number of patients (N)

Percentage (%)

18–24

214

15.3

160

18.8

25–34

324

23.2

253

29.8

35–44

439

31.4

252

29.7

45–54

300

21.5

129

15.2

55+

120

8.6

56

6.6

Gender

Male

773

55.3

380

44.7

Female

624

44.7

470

55.3

Race/ethnicity

White

860

61.8

654

77.5

Black/African-American

385

27.7

67

7.9

Latino

56

4.0

59

7.0

Other/multiracial

90

6.5

64

7.6

Marital status

Never married

660

47.2

366

43.5

Married

271

19.4

253

30.1

Separated/divorced

419

30.0

213

25.3

Widowed

46

3.3

9

1.1

Education

8th grade or less

95

6.9

14

1.7

Some high school

264

19.1

99

11.8

High school graduate/GED

444

32.2

249

29.6

Some college

350

25.4

292

34.7

4-year college graduate

226

16.4

187

22.2

Employed in past 30 days

No

913

65.5

385

45.8

Yes, 1–10 hours

77

5.5

49

5.8

Yes, 11–30 h

112

8.0

104

12.4

Yes, more than 30 h

292

21.0

303

36.0

Primary psychiatric diagnosis

Schizophrenia/schizoaffective disorder

339

25.1

25

3.5

Depressive disorder

342

25.4

231

32.6

Bipolar disorder

189

14.0

79

11.1

Alcohol/drug use disorder

363

26.9

200

28.2

Anxiety disorder

34

2.5

57

8.0

Adjustment disorder

38

2.8

85

12.0

Other disorder

43

3.2

32

4.5

Insurance

Self pay

78

5.8

110

21.1

Commercial

382

28.3

233

44.6

Medicare

271

20.1

27

5.2

Medicaid

214

15.8

130

24.9

Uninsured

405

30.0

22

4.2

Program type

Mental health

1,164

83.3

593

69.8

Substance abuse/dual diagnosis

233

16.7

257

30.2

Instruments

The revised 24-item Behavior and Symptom Identification Scale (BASIS-24©) was used to assess change following treatment. BASIS-24© is a revised version of the BASIS-32® mental health outcome instrument consisting of 24 self-report items assessing six domains: depression/functioning, difficulty in interpersonal relationships, self-harm, emotional lability, psychotic symptoms, and substance abuse.11,32 In addition, an overall mental health score is computed. Items are rated on a 5-point scale, with higher numbers indicating greater symptom/problem frequency or severity.

Three other measures of change were included to provide alternative measures of mental health status from which change can be determined or inferred. First, a self-reported global rating of mental health during the past week, also assessed on a 5-point rating scale (0 = poor, 1 = fair, 2 = good, 3 = very good, 4 = excellent), was used. Previous research has reported a correlation of 0.65 for inpatients and 0.76 for outpatients between this rating and the BASIS-24© overall mental health score.11 Second, a 5-point rating of perceived change (0 = much worse, 1 = somewhat worse, 2 = about the same, 3 = somewhat better and 4 = much better) was used. This rating corresponds to transition ratings of change, which have also been used to determine clinically meaningful change in health status.12 Third, the clinician-rated GAF, extracted from medical records or administrative databases, provided an external measure of level of impairment. The GAF is a single-item, clinician rating scale assessing overall psychological symptoms, social, and occupational functioning.13 It is part of the multi-axial (Axis V), psychiatric diagnostic system.33 Ratings can range from 1 (worst functioning) to 100 (best functioning). It is the most widely used measure of psychiatric impairment, and previous research has reported good reliability and validity.34 Because not all study sites required GAF ratings at two time points, they were not available for all study participants. However, GAF ratings at two time points were available for 385 inpatients and 328 outpatients.

Demographic characteristics were collected by respondent self-report. Payer and psychiatric diagnosis, including GAF ratings, were obtained from medical records or administrative databases.

Procedure

BASIS-24© and the global mental health rating were administered twice, upon admission to an inpatient service or intake to an outpatient program (T1), and in the 24-h period before discharge (for inpatients) or 4–8 weeks after intake for outpatients (T2). The transition rating of change was obtained at T2 for inpatients and outpatients. GAF ratings at T1 and T2 were extracted from medical records or administrative databases. Data collection was undertaken by program staff within the context of continuous quality improvement activities. Verbal consent was obtained from all participants. This data collection process was approved by the Institutional Review Board of the grantee institution and by each participating site.

Data analysis

BASIS-24© subscale and overall scores at T1 and T2 were first computed,35 and statistical significance of the differences was assessed using paired t tests. Aggregate effect sizes were then computed as the difference between T1 and T2 group means, M1M2, divided by the standard deviation at T1, s1, hence, \( {\text{ES}} = \frac{{{\text{ }}M_{1} {\text{ - }}M_{2} }} {{s_{1} }} \).20 Individual change scores were then computed using the three methods described above (RCI, ES, and SEM).

The RCI was computed using a modification suggested by Hageman and Arrindell and recently used by Jerrell to evaluate reliable change of the BASIS-32.®28,36 This modification produces an “improved” standardized, Time 1 − Time 2 difference score by adjusting for regression to the mean. It is calculated as
$$ {\text{RC}}_{{{\text{ID}}}} = \frac{{{\left( {x_{1} - x_{2} } \right)}r_{{{\text{DD}}}} + {\left( {M_{1} - M_{2} } \right)}{\left( {1 - r_{{{\text{DD}}}} } \right)}}} {{{\sqrt {{\text{SEM}}_{1} ^{2} + {\text{SEM}}_{2} ^{2} } }}} $$
where RCID = the reliable change index using improved difference (ID) scores, x1 = Time 1 score, x2 = Time 2 score, M1 = mean of Time 1 scores, M2 = mean of Time 2 scores, SEM1 = standard error of measurement of Time 1 scores, SEM2 = standard error of measurement of Time 2 scores, and rDD = reliability of difference scores. Study participants were then placed into three groups based on their RCI scores: reliable decline (RCI < −1.96), no reliable change (RCI between −1.96 and +1.96), and reliable improvement (RCI > 1.96).

Effect size for individuals was computed as the difference between each individual’s T1 and T2 scores, divided by the group standard deviation at T1, s1; hence, \( {\text{ES}} = \frac{{x_{1} - x_{2} }} {{s_{1} }} \).22 Participants were then grouped based on whether their individual ES suggested a large decline (ES < −0.50), no effect or a small effect (−0.49 to 0.49), or a medium to large improvement (ES > 0.50).20 Use of ES > 0.50 was selected based on a review of 29 research studies in which it was found that the mean minimally important difference was almost exactly equal to an effect size of 0.50.23

The SEM was computed as \(s_{{1 * }} {\sqrt {1 - r_{{xx}} } }\), where s1 is the standard deviation at time 1 and rxx is the internal consistency reliability coefficient at Time 1.30,31 As for the RCI and ES, participants were then categorized as declined if their T1 − T2 difference scores declined by at least one SEM, stable if their T1 − T2 difference scores were less than one SEM, and improved if their T1 − T2 difference scores improved by at least one SEM.

Change in global mental health was calculated by subtracting Time 2 from Time 1 global mental health ratings, yielding change scores ranging from −4 to +4. All positive scores were categorized as improved, scores of 0 were categorized as unchanged, and all negative scores were categorized as declined. Transition ratings were categorized into three groups: worse (including ratings of “much worse” and “somewhat worse”), same, and improved (“somewhat better” and “much better”). Improvement in GAF ratings were computed by subtracting T2 from T1 ratings. Difference scores were considered improved if T2 ratings were 10 or more points higher than T1 scores, worse if T2 scores were 10 or more points lower than T1 scores, and unchanged if the T2 − T1 difference was less than 10 points. The 10-point difference criterion was used because the GAF uses 10-point ranges to define impairment severity levels, so that a 10-point change represents a change in level of impairment based on the clinician’s assessment.

The weighted kappa statistic was used to determine agreement in the proportion of participants categorized as declined, stable, or improved based on each of the three methods for evaluating individual change.37 To determine which of the three methods is most concordant with clinically meaningful change on the BASIS overall mental health score, the proportion of individuals identified as meaningfully improved, unchanged, or declined, who were also categorized as improved, unchanged, or declined based on the global mental health rating was calculated using RCI, ES, and SEM, respectively. This analysis was then replicated for the two other change measures (transition rating of improvement and change in GAF rating) to determine concordance of change on these measures with change in the BASIS overall score.

Results

Aggregate change over time

Table 2 presents mean Time 1 and Time 2 scores and aggregate effect sizes for study participants. Although all but one of the statistical tests for Time 1 − Time 2 differences were statistically significant, effect sizes varied by symptom/problem domain, and by program type. Among inpatients, large effect sizes (>0.80) were found for the depression/functioning domain and for the overall score among individuals treated in both mental health and substance abuse programs. Among inpatients in substance abuse programs, a large effect was also found in the alcohol/drug use domain. Effect sizes in the other domains were moderate for inpatients. Among outpatients, effects sizes were low to moderate for those in mental health programs (0.21–0.45). In all domains except for alcohol/drug use, effect sizes were smaller for those treated in substance abuse programs than for those treated in mental health programs.
Table 2

Mean symptom/problem scores (SD) at two time points for inpatients and outpatients treated in mental health and substance abuse programs*

BASIS-24© subscale

Inpatient mental health (n = 1,164)

Inpatient substance abuse (n = 233)

Outpatient mental health (n = 593)

Outpatient substance abuse (n = 257)

Time 1

Time 2

Effect size

Time 1

Time 2

Effect size

Time 1

Time 2

Effect size

Time 1

Time 2

Effect size

Depression/functioning

2.19 (1.17)

1.15 (0.89)

0.89

2.35 (0.94)

1.31 (0.78)

1.11

2.05 (1.04)

1.61 (1.00)

0.42

1.30 (1.10)

0.95 (0.85)

0.32

Interpersonal relationships

1.79 (1.07)

1.31 (1.06)

0.45

1.63 (0.99)

1.15 (0.91)

0.49

1.51 (0.97)

1.27 (0.92)

0.25

1.15 (1.08)

0.97 (0.87)

0.17

Self-harm

1.23 (1.28)

0.43 (0.76)

0.63

0.70 (1.01)

0.26 (0.61)

0.44

0.62 (0.95)

0.41 (0.77)

0.22

0.23 (0.59)

0.14 (0.43)

0.15

Emotional lability

1.95 (1.15)

1.23 (0.96)

0.63

2.01 (0.99)

1.43 (0.85)

0.59

2.04 (1.04)

1.69 (1.01)

0.34

1.43 (1.17)

1.19 (0.95)

0.21

Psychotic symptoms

1.20 (1.17)

0.70 (0.90)

0.43

0.63 (0.86)

0.35 (0.63)

0.33

0.66 (0.83)

0.49 (0.76)

0.20

0.49 (0.86)

0.33 (0.60)

0.20

Alcohol/drug use

0.94 (1.15)

0.61 (0.87)

0.29

2.67 (0.93)

1.65 (0.91)

1.1

0.53 (0.82)

0.36 (0.61)

0.21

1.12 (1.11)

0.70 (0.79)

0.38

Overall mean

1.84 (0.86)

1.05 (0.67)

0.92

1.89 (0.71)

1.14 (0.57)

1.1

1.64 (0.78)

1.30 (0.74)

0.44

1.13 (0.86)

0.85 (0.64)

0.33

*For inpatients, all differences between Time 1 and Time 2 are statistically significant (p < 0.001).

For outpatients, all differences between Time 1 and Time 2 are statistically significant (p < 0.001) except self-harm for individuals treated in a substance abuse program.

Individual change over time

Table 3 presents the proportion of individuals who showed clinically meaningful improvement on each BASIS-24© subscale and on the overall score, using each of the three methods. Consistent with the aggregate results, all three methods showed a higher proportion of inpatients than outpatients with clinically meaningful improvement. The proportion of inpatients showing clinically meaningful improvement on each BASIS-24© subscale and on the overall score ranged from 8–54% based on RCI, 29–73% based on ES, and 33–75% based on SEM. Corresponding rates of clinically meaningful improvement for outpatients ranged from <1–24% for RCI, 13–43% for ES, and 13–53% for SEM. For both inpatients and outpatients, the RCI method identified the fewest individuals as meaningfully improved. SEM identified the most individuals as meaningfully improved, with a few exceptions. For three program types, ES identified the same proportion of individuals as meaningfully improved as SEM (psychotic symptoms among inpatients treated in dual diagnosis/substance abuse programs, emotional lability among inpatients treated in mental health programs, and self-harm among outpatients treated in dual diagnosis/substance abuse programs). For three domains within three program types, ES identified somewhat more individuals as meaningfully improved than SEM (emotional lability and alcohol/drug use among inpatients in dual diagnosis/substance abuse programs and psychotic symptoms among outpatients in mental health programs).
Table 3

Percent of individuals showing clinically meaningful improvement based on effect size (ES) >0.50, reliable change index (RCI), and one standard error of measurement (SEM)

Level of care

Inpatient

Outpatient

Program type

Mental health (n = 1,164)

Substance abuse/dual Dx (n = 233)

Mental health (n = 593)

Substance abuse/dual Dx (n = 257)

BASIS-24© subscale

RCI

Effect size

SEM

RCI

Effect size

SEM

RCI

Effect size

SEM

RCI

Effect size

SEM

Depression/functioning

48.3

62.9

68.0

53.2

73.4

77.2

24.3

43.3

52.3

19.8

31.9

44.8

Interpersonal relationships

23.0

45.1

48.2

22.3

43.8

47.2

8.4

34.1

36.3

10.5

25.7

30.7

Self-harm

32.5

43.6

48.8

25.7

30.0

36.1

11.8

23.6

27.8

8.9

13.2

13.2

Emotional lability

14.6

52.3

54.0

13.7

55.8

55.8

0.5

41.0

41.0

3.9

32.3

34.6

Psychotic symptoms

8.1

37.8

38.8

9.0

33.9

33.9

0.5

24.6

24.5

5.8

21.0

21.8

Alcohol/drug use

8.8

29.5

32.8

20.6

70.0

68.2

4.9

21.4

21.8

13.2

31.9

34.6

Overall mean

47.6

63.5

70.3

55.4

70.8

76.8

23.6

42.3

52.6

20.6

31.5

48.3

Across program type, of those who did not show clinically meaningful improvement, most showed no meaningful change, (46–99% based on RCI, 23–79% based on ES and 19–79% based on SEM; data not shown). Relatively few individuals showed clinically meaningful decline (0–5% based on RCI, 3–18% based on ES, and 3–22% based on SEM.).

Results varied by both symptom/problem domain and program type. Although the highest proportion of individuals improved in the depression/functioning domain and in overall mental health regardless of program type, more individuals treated in substance abuse or dual diagnosis programs improved on the alcohol/drug use subscale than individuals in mental health programs, a finding that is consistent with the problems of individuals treated in substance abuse programs.

Agreement among the methods

Agreement among the three methods for assessing clinically meaningful change varied by domain and program type. With respect to the depression/functioning domain, agreement between ES and SEM was fairly high, with weighted kappas ranging from 0.73 to 0.93 (data not shown). However, agreement between the RCI and SEM for the same domain was low to moderate (weighted kappas ranged from 0.38 to 0.55), and agreement between ES and RCI was moderate (0.54 to 0.65). Regarding the BASIS-24© overall score, all three methods identified 49% of inpatients as meaningfully improved, 23% as unchanged, and <1% as meaningfully declined. Across all change groups (declined, stable, and improved), the three methods were in full agreement for 72% of inpatients. For outpatients, all three methods identified 23% as meaningfully improved, 34% as unchanged, and 2% as meaningfully worse. Agreement across all groups was 59%.

Relationship among the measures

This level of agreement can be explained by the common elements shared by each method for computing meaningful change. For the SEM, the criterion for clinically meaningful change is at least 1 standard error of measurement; for medium or larger effect size, it is \({\left( {{0.5} \mathord{\left/ {\vphantom {{0.5} {{\sqrt {{\left( {1 - r_{{xx}} } \right)}} }}}} \right. \kern-\nulldelimiterspace} {{\sqrt {{\left( {1 - r_{{xx}} } \right)}} }}} \right)}{\text{SEM}}_{1} \,\;{\text{and}}\;{\text{for}}\;{\text{RCI}},\,\;{\text{it}}\,{\text{is}}\frac{{1.96{\sqrt {{\text{SEM}}^{{\text{2}}}_{{\text{1}}} + {\text{SEM}}^{{\text{2}}}_{{\text{2}}} } } - {\left( {M_{1} - M_{2} } \right)}{\left( {1 - r_{{{\text{DD}}}} } \right)}}}{{r_{{{\text{DD}}}} }}\). It can be shown mathematically that the minimum change required for clinically meaningful improvement is proportional to the standard error of measurement (Seal P, Glickman M, Eisen SV. The Relationship between SEM, effect size and RCI when classifying clinically meaningful change. Manuscript in preparation; 2006).

Comparing SEM with ES, if T1 reliability (rxx) is greater than 0.75, larger T1 − T2 differences are needed to show clinically meaningful change based on ES than SEM, and all individuals who improve based on ES will also improve based on one SEM.8 The reverse phenomenon will happen when T1rxx is less than 0.75, in which case, larger T1T2 differences are required to show clinically meaningful change based on SEM than ES. When T1rxx = 0.75, the same T1 − T2 difference will identify individuals as meaningfully improved for both ES and SEM. As reported in Table 3, SEM identified more individuals as meaningfully improved than ES in all except six of the 84 cells, and in five of these six cells T1 reliability was 0.75 or less (Table 4). The only exception occurs in the self-harm domain for outpatients in substance abuse programs, in which both ES and SEM identified 13.2% of individuals as meaningfully improved, and T1 reliability (0.81) exceeded 0.75. In this case, the minimum T1 − T2 difference required for meaningful improvement based on ES is greater than 1.11 times the SEM at T1. With respect to outpatients in substance abuse programs, all those who showed meaningful improvement in the self-harm domain based on SEM had T1 − T2 difference scores greater than 1.11 times the T1 SEM. Consequently, the rates of improvement based on ES and SEM were the same.
Table 4

Cronbach’s alpha (α) reliability coefficients for BASIS-24© subscales by program type

Level of care

Inpatient

Outpatient

Program type

Mental health (n = 1,164)

Substance abuse/dual Dx (n = 233)

Mental health (n = 593)

Substance abuse/ dual Dx (n = 257)

BASIS-24© subscale

T1 α

T2 α

T1 α

T2 α

T1 α

T2 α

T1 α

T2 α

Depression/functioning

0.88

0.86

0.86

0.84

0.90

0.91

0.92

0.91

Interpersonal relationships

0.82

0.87

0.84

0.82

0.81

0.83

0.89

0.85

Self-harm

0.88

0.82

0.89

0.86

0.87

0.88

0.81

0.78

Emotional lability

0.77

0.76

0.78

0.72

0.75

0.76

0.83

0.76

Psychotic symptoms

0.77

0.76

0.75

0.76

0.73

0.73

0.86

0.72

Alcohol/drug use

0.86

0.80

0.68

0.63

0.80

0.75

0.82

0.73

Overall mean

0.87

0.87

0.87

0.85

0.90

0.90

0.94

0.91

Comparing ES to RCI, if rxx is less than 0.94 and the ratio of sample mean improvement to SEM at T1 is greater than 1.96, then ES will identify more people as meaningfully improved if rDD satisfies Eq. 1 below:
$$ r_{{{\text{DD}}}} \geqslant \frac{{{\text{ Ratio of sample mean improvement to SEM}}_{{\text{1}}} {\text{ at T}}_{{\text{1}}} - 1.96{\sqrt {1 + \frac{{{\text{SEM}}_{2} ^{2} }} {{{\text{SEM}}_{1} ^{2} }}} }}} {{{\text{Ratio of sample mean improvement to SEM}}_{{\text{1}}} {\text{ at T}}_{{\text{1}}} - \frac{{0.5}} {{{\sqrt {1 - r_{{xx}} } }}}}} $$
(1)
RCI could identify more people as meaningfully improved than ES if the ratio of sample mean improvement to SEM at T1 is greater than 1.96 and rDD is low. Using an example from the current dataset, the ratio of the mean difference to SEM1 on the BASIS-24© overall mental health score for inpatients treated in substance abuse programs was 2.73. rDD for this group was 0.74, which is greater than the quantity (0.20) derived from Eq. 1, as shown in Eq. 2 below:
$$ \frac{{2.73 - 1.96{\sqrt {1 + 0.56} }}} {{2.73 - \frac{{0.5}} {{{\sqrt {1 - 0.85} }}}}} = 0.20 $$
(2)

If rDD was lower than the quantity in Eq. 1, then RCI could identify more individuals as meaningfully improved than ES. However, this never occurred in the dataset, and it is quite unlikely to occur because an instrument with an rDD that low would not be considered reliable enough to measure change in health status.

Comparing SEM to RCI, when the T1 SEM is greater than the sample mean difference (M1 − M2), or if M1 − M2 is less than 1.96 times the T1 SEM, then the minimum difference required for meaningful improvement based on SEM will always be lower than the minimum difference required for meaningful improvement based on RCI, and SEM will identify more individuals as meaningfully improved than RCI. When the ratio of mean difference to SEM1 > 1.96, SEM will still identify more individuals as improved than RCI if
$$r_{{{\text{DD}}}} \geqslant \frac{{{\text{ Ratio of sample mean improvement to SEM}}_{{\text{1}}} {\text{ at T}}_{{\text{1}}} - 1.96{\sqrt {1 + \frac{{{\text{SEM}}^{{\text{2}}}_{{\text{2}}} }}{{{\text{SEM}}^{{\text{2}}}_{{\text{1}}} }}} }}}{{{\text{Ratio of sample mean improvement to SEM}}_{{\text{1}}} {\text{ at T}}_{{\text{1}}} - 1}}$$
(3)

Using the previous example and Eq. 3, rDD would need to be greater than 0.16 (lower rDD requirement than ES) for SEM to identify more individuals as improved than RCI, which again, was always the case in this dataset. It is possible for RCI to identify more individuals as meaningfully improved than SEM only when rDD is less than the quantity on the right-hand side of Eq. 3. However, as with ES, this is highly unlikely because an instrument with an rDD that low, is not reliable enough to measure change.

Which method is most concordant with change assessed by other outcome measures?

Across mental health and substance abuse programs, 61% of inpatients and 41% of outpatients were improved based on change in their rating of global mental health. Twenty-nine percent of inpatients and 42% of outpatients were unchanged, and 10% of inpatients and 17% of outpatients were worse (Table 5). The proportions of these “globally improved” individuals who were also identified as clinically meaningfully improved, same, or worse on BASIS-24© overall scores are presented in Table 5 for each of the three methods. The SEM method identified the greatest number of globally improved individuals and showed the highest concordance with both clinically meaningful improvement and decline in the BASIS overall mental health score. For all four treatment groups, agreement between SEM-based improvement and global mental health improvement ranged from 67 to 85%. The RCI identified the smallest number of “globally improved” individuals and had the lowest levels of concordance with clinically meaningful improvement and decline in overall mental health (32 to 58%); but RCI had the highest concordance for individuals showing no change.
Table 5

Percent of individuals reporting improved, same, or worse global mental health and clinically meaningful improvement, no change or decline in BASIS-24 overall score based on RCI, ES and SEM

Level of care

Inpatient

Outpatient

Program type

Mental health

Substance abuse/dual Dx

Mental health

Substance abuse/dual Dx

Improved (%) (n = 708)

Same (%) (n = 338)

Worse (%) (n = 119)

Improved (%) (n = 144)

Same (%) (n = 64)

Worse (%) (n = 25)

Improved (%) (n = 261)

Same (%) (n = 241)

Worse (%) (n = 91)

Improved (%) (n = 85)

Same (%) (n = 119)

Worse (%) (n = 53)

RCI

64

77

3

66

55

4

51

85

18

44

83

21

ES

79

49

21

82

38

20

67

64

32

56

72

28

SEM

84

36

24

85

27

20

78

46

38

74

45

42

n in each group is the number of cases improved, same, or worse based on the global rating of mental health.

Replication of these analyses using the transition rating and change in GAF scores, presented in Tables 6 and 7, respectively, were generally consistent with the analysis of change based on the global mental health rating. For both of these measures, the SEM method for determining clinically meaningful change was most concordant with both the transition measure of change and with change based on the clinician-rated GAF. The RCI criterion for no meaningful change was most concordant with no change based on the transition measure and the GAF. With respect to clinically meaningful decline, there was little or no difference in concordance among the three methods of determining clinically meaningful change. However, the number of cases in the “declined groups” was very small (n < 16), except for those in inpatient or outpatient mental health programs. In these programs, the SEM method for assessing clinically meaningful decline was more concordant than either RCI or ES.
Table 6

Percent of individuals reporting retrospective improvement, no change, or worse mental health and clinically meaningful improvement, no change, or decline in BASIS-24 overall score based on RCI, ES, and SEM

Level of care

Inpatient

Outpatient

Program type

Mental health

Substance abuse/ dual Dx

Mental health

Substance abuse/dual Dx

Improved (%) (n = 904)

Same (%) (n = 212)

Worse (%) (n = 49)

Improved (%) (n = 179)

Same (%) (n = 43)

Worse (%) (n = 11)

Improved (%) (n = 178)

Same (%) (n = 331)

Worse (%) (n = 84)

Improved (%) (n = 111)

Same (%) (n = 131)

Worse (%) (n = 15)

RCI

53

68

2

60

56

9

38

80

7

29

82

13

ES

69

45

24

74

30

9

55

50

18

41

64

20

SEM

75

33

27

79

21

9

66

37

23

60

40

20

n in each group is the number of cases improved, same, or worse based on retrospective rating of improvement.

Table 7

Percent of individuals with improved same or worse clinician rating of impairment (GAF) and clinically meaningful improvement, no change, or decline in BASIS-24 overall score based on RCI, ES, and SEM

Level of care

Inpatient

Outpatient

Program type

Mental health

Substance abuse/dual DX

Mental health

Substance abuse/dual Dx

Improved (%) (n = 284)

Same (%) (n = 45)

Worse (%) (n = 16)

Improved (%) (n = 30)

Same (%) (n = 10)

Worse (%) (n = 0)

Improved (%) (n = 49)

Same (%) (n = 167)

Worse (%) (n = 7)

Improved (%) (n = 32)

Same (%) (n = 71)

Worse (%) (n = 2)

RCI

60

56

0

60

70

0

41

72

14

28

85

0

ES

77

27

6

77

50

0

57

47

14

41

63

50

SEM

82

11

6

80

30

0

67

32

14

63

42

50

n in each group is the number of cases improved, same, or worse based on GAF.

The criterion for improvement on the global mental health rating was any change in a positive direction. To determine whether a higher threshold for improvement in global mental health would yield different levels of concordance for SEM, ES, and RCI, the threshold for improvement was raised by considering only those who had improved by at least two rating scale points. This threshold resulted in a much lower proportion of individuals classified as improved (30% of inpatients and 12% of outpatients). However, SEM still yielded the highest rates of concordance with global improvement, ranging from 88 to 96% depending on program type and level of care (data not shown). Concordance of RCI rates remained lower than for SEM and for ES, but were substantially higher than they were with the lower improvement threshold, ranging from 72 to 74% agreement.

Discussion

This study used three methods of evaluating responsiveness of the BASIS-24© mental health instrument and then evaluated the concordance of each method in identifying clinically meaningful change by using three other measures of improvement. Several major points emerge from the analysis. First, the BASIS-24© instrument showed responsiveness to change following treatment of both inpatients and outpatients in mental health and substance abuse/dual diagnosis programs. At both the aggregate and individual level, change was greater among inpatients than outpatients, although aggregate results mask the finding that a large proportion of individuals show no meaningful change, and some decline. Consistent with previous findings reported in the literature, both the SEM and ES methods identified a higher proportion of individuals as meaningfully improved than did the RCI method.7,10,17

Second, change varied as a function of both program type and symptom/problem domain. In general, more individuals showed clinically meaningful improvement in overall mental health and in the depression/functioning domain than in other domains. However, those treated in substance abuse programs were more likely to show clinically meaningful change in the substance abuse domain than those treated in mental health programs, highlighting the importance of examining outcomes in the domains most relevant to the focus of particular programs or diagnosis groups.

Third, analyses reporting concordance between each of the three methods and three other measures of change showed that the SEM identified a higher proportion of individuals as improved or declined than either ES or RCI. These results support Wyrwich and Wolinsky’s suggestion that the SEM may be a better method for determining meaningful change than ES because ES uses the standard deviation, which is sample dependent.5 In contrast, the SEM does not vary from sample to sample, providing a more stable method for determining clinically meaningful change.5,18 These results are also consistent with the suggestion that the RCI may be a too stringent criterion for determining clinically meaningful change, particularly for individuals with severe and persistent mental illness.25,3840

Comparison of results of this study to those reported for the original BASIS-32® in a sample of individuals with severe and persistent mental illness treated as outpatients in South Carolina, and using the same RCI formula used here, shows almost the identical proportion of cases with improvement on the overall mean score, 24.5% for the South Carolina sample and 23.6% for the national sample reported here, despite a longer follow-up period (3–6 months) used in South Carolina compared with the current study (1–3 months).36

There are a number of limitations to this study. Although the sample is relatively large and includes individuals receiving mental health treatment at many different sites and geographic regions of the U.S., it is not necessarily representative of all mental health consumers. Second, because normative data for an untreated population are not available for BASIS-24©, it was not possible to determine the proportion of individuals whose mental health status moved into the range of a “normal” population.24,25,40Although three other mental health outcome measures were used to determine which method of measuring meaningful change showed the highest concordance with change in BASIS-24©, the other measures used had limitations. Two of them were self-report measures; consequently, they were not external “anchor” measures, although these types of measures (global evaluations and transition ratings) have been used in previous research.5,16,41 The third measure (GAF) was an external measure because it is determined by the clinician. However, GAF ratings were available for only 32% of the sample. Furthermore, training of the clinicians who made the GAF ratings could not be determined nor could reliability or validity of the GAF ratings extracted from the medical records or administrative databases be assessed. However, a number of recent studies have reported satisfactory reliability and validity of GAF ratings obtained from these sources.4244 In the absence of a “gold standard” measure of mental health status and functioning, (which does not currently exist45,46) further research using multiple anchor-based methods would provide additional evidence regarding which method yields the best estimate of clinically meaningful change, particularly with regard to mental health outcomes for individuals with severe and persistent mental illness. However, results of a simulation study exploring the relationship between an anchor-based approach and effect size found a near-linear relationship, suggesting that the proportion of individuals showing meaningful improvement can be directly estimated from ES, a distribution-based method.47

Another concern in evaluating clinical meaningfulness of individual change is the reliability of values obtained for the instrument. Reliability levels of 0.90 have been recommended as minimum standards for interpretation of individual-level results.48 However, this level of reliability is rarely achieved even for medical vital signs such as blood pressure.7 Recent work has suggested that this reliability level may be too stringent and that although caution should be exercised when interpreting results of assessments with less than optimal reliability, use of individual assessment data can still be valuable.7

Implications for Behavioral Health

Many mental health assessment instruments, including BASIS-24,© were developed to monitor treatment outcomes at the aggregate level for quality improvement and accountability purposes. However, routine outcomes assessment can be both costly and of questionable value to clinical treatment providers.49,50 Timely knowledge about progress and treatment outcomes for individuals can provide opportunities for clinicians to improve care for individuals, utilize individual outcome data to guide future treatment, and engage consumers in the treatment process.

Several mechanisms can be used to obtain immediate feedback of outcome data including hand-scoring templates, which plot individual progress against a benchmark, or automated scoring software/services, which provide immediate scoring and graphical printouts of mental health status and outcome scores for use by clinicians. These methods, which are widely available for psychological tests sold by commercial testing companies, are increasingly available for mental health outcome measures as well (including for BASIS-24©). However, to be used by clinicians, the reports must be perceived as valid, culturally sensitive, interpretable and user-friendly.50 Assessing clinical meaningfulness of change is one step toward providing interpretable outcome data at the individual level. This study used several methods for assessing clinically meaningful change, and results indicated that the SEM method consistently identified a somewhat higher proportion of people as clinically meaningfully improved than ES, and a substantially higher proportion than RCI. In addition, with regard to showing change, the SEM method was more concordant with three other measures of change, irrespective of the threshold established for clinically meaningful change. On the other hand, RCI estimates of no meaningful change were more concordant with other measures indicating no change; and among outpatients, more individuals showed no change than improvement on all three outcome measures. This finding was counterbalanced by the small percentage of people showing declines in mental health status, resulting in statistically significant improvement in aggregate scores, despite the fact that more outpatients showed no change than showed improvement. Use of the RCI may be more appropriate for populations on whom the measure was first developed and reported; that is, outpatient psychotherapy clients, who are likely to be less impaired than individuals receiving more intensive and long-term services such as those treated in public mental health systems. In contrast, the SEM criterion is widely used in assessing clinically meaningful improvement among individuals with chronic medical conditions,7,30 populations which may be more similar in some ways to those with long-standing behavioral health conditions.

Acknowledgment

The authors thank Colleen McHorney, Ph.D., for suggesting use of the SEM to assess clinically meaningful change, and Joel Reisman for his helpful comments on an earlier version of this manuscript.

This research was supported by grant R01 MH58240 from the National Institute of Mental Health and by the Veterans Affairs Health Services Research & Development program.

Copyright information

© National Council for Community Behavioral Healthcare 2007