Introduction

Low back pain remains one of medicine’s most enigmatic problems, particularly in its chronic form. The majority of people will experience back pain at some point in their lives [1], but only a minority will receive a positive diagnosis [2, 3]. Most will receive what are often perceived as vague diagnoses, such as “mechanical back pain” or “non-specific low back pain”. These apparently benign labels, however, hide a health problem that costs society more than cancer, coronary artery disease and AIDS combined [4]. The cost of this 'epidemic' [5] has spurred much debate amongst researchers, therapists and clinicians alike, with a common aim of defining effective treatment regimes. Bio-medical approaches have tended to be replaced by more ‘holistic’ bio-psychosocial approaches [6] and, indeed, some have argued that it is best not to view chronic back pain as a ‘medical’ problem at all [7]. There is certainly no one therapeutic ‘silver bullet’; consequently, back pain is increasingly being treated as a multi-disciplinary problem.

Rehabilitation programmes tend to be built around the proven effectiveness of cognitive behavioural approaches [810] and exercise [11, 12], and they are typically delivered by multi-disciplinary teams. A review by Guzman et al. [13] showed the effectiveness of multi-disciplinary programmes, but a Cochrane Review [14] cautioned that many studies had methodological shortcomings. Some studies have also questioned the cost-effectiveness of such programmes [15]. Whilst the content of multi-disciplinary programmes can vary [16], the evidence base is sufficient for them to be considered the preferred choice for chronic non-specific back pain within recent clinical guidelines [17, 18].

One such multi-disciplinary programme is the ‘Nottingham Back Team’, which is a rehabilitation programme utilising a team of staff with nursing, physiotherapy, occupational therapy and clinical psychology skills. The team takes a cognitive behavioural approach based on a programme of education, goal setting, support and exercise. This paper describes an analysis of SF-36 data gathered within the programme immediately after its inception. The primary aim of this analysis was to evaluate the impact of the programme on participants' health ‘profile’ and the degree to which improvements were maintained at the 6-month follow-up. An additional objective was to identify any health differences between those completing the programme and those who either failed to attend or dropped-out after 2–3 weeks (which was relatively common). Following the approach described by Ferguson et al. [19], the Reliable Change Index (RCI) was used to gauge ‘clinical significance’ and to enable an analysis of individual scores in addition to group comparisons.

Materials and methods

Patients and methods

Patients referred for an assessment to the Back Team programme completed a questionnaire battery (including the SF-36) prior to a physical examination and face-to-face consultation. The questionnaire battery was also administered at programme completion (at the end of the final session) and at the 6-month follow-up. Consent for use of the questionnaire data was obtained at the assessment visit.

The Nottingham Back Team was set up in 2000 in response to the Clinical Standards Advisory Group Report [20] and lengthy local waiting times for consultation at the spinal unit [21]. The programme is of fixed duration, consisting of seven 3-h sessions (morning or afternoon) on consecutive weeks. The sessions are held at local leisure centres within the community rather than being hospital-based. Each session has four key components:

  • group education, covering a different topic each week (e.g. types of medication)

  • group exercise—this varies from week to week, but with a core set of stretches/exercises on which participants track their progress (individuals are also provided with a tailored exercise programme to perform at home)

  • relaxation training—after the exercise session, participants are taught a relaxation technique each week

  • key worker—each participant has a designated key worker for the duration of the programme. On an individual basis, participants discuss their progress during the previous week and define goals for the following week.

In addition to the group sessions, participants have access (if required) to specialist support in relation to medication issues, cognitive behavioural therapy, vocational guidance and psychological assessment. Whilst the programme is of fixed duration, patients are offered an ‘open appointment’ for 1 year should they need additional support or advice following the programme.

The Short Form (SF)-36 Health Survey [22] is one of the most widely used health and quality of life measures. It’s 36 items measure eight multi-item variables: physical functioning (PF); role limitation due to physical problems (RP); bodily pain (BP); general health (GH); vitality (VT); social functioning (SF); role limitation due to emotional problems (RE); mental health (MH). Version 2 of the SF-36 has been designed to facilitate norm-based scoring for all scales [23]. Raw scores are first transformed into 0–100 scores before being further transformed into T-scores where the mean is fixed as 50 (the population norm) and the standard deviation is 10.

Statistical analysis

The SF-36 data were analysed in terms of (1) statistical significance (change within and between groups); (2) with reference to the SF-36 scoring manual [23], which provides tables of sample sizes needed to detect particular differences over time; (3) in relation to UK norms (T-scores); (4) utilising the RCI scores to assess ‘clinical significance’ (meaningful individual change).

The majority of reported studies using SF-36 describe analyses with parametric statistics [24]. It has been argued that non-parametric methods should be used [25] as the distributions tend to be skewed, but in reality, the use of parametric tests is unlikely to produce misleading results provided the samples are not small [24, 26]. When distributions are skewed, however, non-parametric methods can provide greater statistical power than parametric equivalents [2729]. For this reason, group differences were analysed using Wilcoxon signed ranks (changes over time) and Mann–Whitney U (between groups differences) tests.

A number of methods are available to assess statistically significant individual change [30] and clinical significance/meaningful change [31]. Ferguson et al. [19] describe the use of RCI with SF-36 data, and the approach they used has been adopted within this study, but using relevant UK norm data [32, 33]. Clinical significance is taken as a change exceeding the RCI value that also results in the final score falling within the ‘normal range’ [19]. The cut-off defining ‘normal range’ can be taken as one, one and a half or two standard deviations from the norm [19]. The choice of cut-off will depend on the situation in question, but much of the literature on the RCI uses two standard deviations for studies in mental health [34].

Results

A total of 261 patients provided usable SF-36 questionnaires. Of these patients, 104 went on to complete the programme, whereas 112 failed to attend at all or failed to complete the programme. The remainder of the patients were unsuitable for the programme; this unsuitability may have been due to the nature of their condition (requiring alternative treatments), their level of disability (precluding participation in exercise elements) or ‘logistical’ problems, such as an inability to organise child care. Of the 104 patients completing the programme, 73 provided follow-up data at 6 months.

The ‘health profile’ of those considered suitable for the programme is illustrated in Fig. 1. The relationship to UK norms is illustrated as this figure uses data transformed to T-scores with a mean of 50 (equivalent to UK norm) and a standard deviation of 10.

Fig. 1
figure 1

The Short Form (SF)-36 health profile at the assessment of all patients suitable for the programme

The male:female ratio at assessment was 1:1. The age distribution of the patient sample was similar between genders, with female participants having a mean age of 43.8 years (SD 13.83) and male participants having a mean age of 44.5 years (SD 13.92). The even gender split was maintained at programme completion, but not at follow-up. At follow-up, the sample was 57% female.

Comparison of patients completing the programme with those who declined/dropped out

When comparing completers and non-completers, significant differences were found in each of the eight SF-36 scales (Table 1).

Table 1 Comparison of completers and non-completers (T-scores)

Changes in those completing the programme

For those completing the programme, significant improvements were evident on each scale (all P < 0.000). The SF-36 manual provides tables of sample sizes required to identify given changes in scores. Based on these tables, the sample sizes in our study were clearly large enough, given the size of changes observed. The changes in Physical Functioning and Role Physical scores exceeded RCI values. The changes are summarised in Table 2.

Table 2 Improvement in SF-36 scores from assessment to programme completion (T-scores)

Figure 2 illustrates the change in health profile as a result of the programme; again, T-values are used to facilitate comparison with UK norms.

Fig. 2
figure 2

Improvement in the health profile of the patients from assessment (black bars) to programme completion (grey bars) (T-scores)

For those attending the 6-month follow-up, no significant group differences were evident, for any scale, between scores at programme completion and scores at follow-up.

Gender differences were evident at assessment, programme completion and follow-up. At assessment, there was a difference in Role Physical (P < 0.05); at programme completion, there were differences in Bodily Pain (P < 0.05) and Social Functioning (P < 0.05); at follow-up, there were differences in Bodily Pain (P < 0.01), Role Physical (P < 0.05), Social Functioning (P < 0.05) and Role Emotional (P < 0.05). In each case, the female scores exceeded the male scores.

Individual level

Table 3 shows the proportion of participants who completed the programme who showed ‘clinically significant’ improvements (i.e. exceeding RCI values) in each of the SF-36 scales.

Table 3 Proportion of participants showing clinically significant improvement from assessment to programme completion, by gender

Table 4 provides an overview of clinically significant changes for those participants attending the 6-month follow-up appointment in terms of three time periods—assessment to programme completion, programme completion to the 6-month follow-up and assessment to the 6-month follow-up. For the period programme completion to 6-month follow-up, there were clinically significant declines as well as improvements. There were no clinically significant declines for the other two time periods.

Table 4 Proportion of participants showing clinically significant changes

Table 5 describes the extent of clinically significant, concurrent improvement in the SF-36 scales. It describes the proportion of those participants showing a clinically significant improvement between assessment and the 6-month follow-up in each of the SF-36 scales who also showed a clinically significant improvement in each of the other seven scales over the same period.

Table 5 Proportion of those patients showing a clinically significant (CS) improvement in each scale who also showed a concurrent improvement in other scales

Table 6 describes the mean scores at assessment for each of the eight SF-36 scales for the whole sample and three important sub-groups. Those patients who showed at 6-month follow-up were found to have:

  • a significant reduction in disability (physical functioning)

  • a significant improvement in perceived functional limitation (role physical)

  • a significant reduction in pain and related limitations (bodily pain).

Table 6 Comparison of mean scores at assessment (T-scores)

This table illustrates differences between the whole sample (all those considered suitable for the programme) and those that showed clinically significant improvement between assessment and follow-up in physical functioning (CS-PF), role physical (CS-RP) and bodily pain (CS-BP). None of the differences were statistically significant however.

In terms of those who attended follow-up, only one participant showed no clinically significant change on any scale over any of the time periods involved in the study.

Of the follow-up group, 93% of participants showed clinically significant improvement between assessment and follow-up on at least one scale.

Discussion

The aim of this study was to evaluate both the impact of a rehabilitation programme on the health profile of patients who had enrolled in it and the degree to which any improvements were maintained over time and to explore differences between those patients who completed the course and those that chose not to. The study utilised both individual and group scores in its analysis of an SF-36 data set. Whilst this is relatively unusual, both approaches are required in interpreting clinical significance in quality-of-life data [35]. Just as taking different statistical approaches to data analysis can alter the emphasis of results [36], using individual assessment alongside more familiar group comparisons allows a different perspective to be taken on the data. Individual assessment criteria also allow for the definition of sub-groups, which may prove useful in clinical practice by facilitating more tailored treatment regimes [37].

A number of methods are available for evaluating significant change at an individual level. Hays et al. [30] describe the use of the standard error of measurement and the standard error of prediction alongside the RCI. These researchers emphasise the point that whichever measure is used, changes required to be significant at an individual level are much greater than those required at a group level. In fact, if samples are large enough, quite small group-level changes may be statistically significant, but essentially meaningless in clinical terms. Different techniques will provide different results [31]; however, there does appear to be a reasonable comparability between approaches [38, 39] that is supportive of the construct as a whole. Although no single approach is likely to be uniformly accepted [40] and ‘blindly’ following a single threshold value could be misleading [41], the emerging evidence is supportive of the view that ‘clinical significance’ is an important element in the evaluation of treatment effectiveness.

Using Reliable Change as an evaluation criterion was first suggested over 20 years ago [42], and the technique was described in detail by Jacobson and Truax in 1991 [43]. Since then it has been increasingly used in fields such as psychotherapy, but it is much less common in medical outcomes research [19]. The concept of defining a minimum criterion for ‘clinically significant’ change is not new in disability or rehabilitation research however. For example, it is widely accepted that a change of less that 2.5 points on the Roland–Morris Disability Questionnaire should not be seen as ‘clinically significant’ [44, 45].

Ferguson et al. [19] describe the application of RCI to SF-36 outcomes, and the approach they defined has been taken in this study, but with RCI criteria calculated using relevant UK data from the Health Services Research Unit at the University of Oxford [32, 33]. Whilst any particular defined change value can be said to be somewhat arbitrary, the RCI is a conservative measure [35]. The SF-36 is itself a robust tool [22], and its scales are associated with low back pain outcome measures [46]. Hence, the combination of the two should prove effective even if it may tend to underestimate the impact of treatment.

Both the assessment and post-programme groups showed an even gender split. Previous studies have shown differences in pain perception and associated behaviours between genders [47, 48] which have been explained in terms of psychological, physiological and socio-cultural factors. Similar differences have been described for treatment outcomes [49, 50]. In this study, gender differences were identified in four SF-36 scales—Role Physical, Bodily Pain, Social Functioning and Role Emotional. In each case, male participants scored lower than their female counterparts. This is the opposite of most previous research where females tend to report higher pain levels [51]. The reasons for this anomaly are unclear, but it appears that female patients respond well to this programme, with proportionately more females showing clinically significant improvement in physical functioning and bodily pain.

Predictably, given the health profile of those assessed and deemed suitable for the Back Team programme, the lowest scores were found for Physical Functioning, Role Physical and Bodily Pain, with each of these scales falling more than one and a half standard deviations from the norm. The only other scale falling further than one standard deviation from the norm was Social Functioning. The scale falling closest to the norm was Mental Health, which was within half a standard deviation, despite the fact that low back pain is often associated with psychological factors [5254]. Indeed, the mean mental health score at programme completion was very slightly above UK norm. An ongoing research project will be looking at these issues.

When we compared the scores of participants at assessment, with those participants who subsequently completed the programme and those who failed to complete the programme, we found statistically significant differences on all scales. The differences were relatively small however, compared to the size of change in response to the programme. The scale Role Physical was found to exceed the RCI value. It would appear that failure to engage with the programme may be due to a belief amongst potential participants that they are too disabled by their condition to undertake a programme of this type. This could reflect strong ‘organic’ pain beliefs [55], where the patient strongly believes ‘hurt = harm’ (a ‘biomedical’ perspective). Patients with strong organic pain beliefs tend to have an external locus of control and are more likely to expect the medical practitioner to ‘cure’ them rather than take responsibility for themselves [55]. The nature of pain beliefs within this population is being examined within further research with the programme.

Equally, failure to engage may reflect a lack of understanding of what the programme will entail, and some similar programmes have successfully utilised a pre-programme session to deal with this. Unfortunately, resource constraints have prevented the team from gathering data on the reasons for failure to engage with the programme. More recently, the team has changed its policy such that places are not routinely allocated to those assessed as suitable for the programme; instead, patients are asked to explicitly ‘opt in’ to a particular group. This new approach has led to fuller groups, improved completion rates and reduced waiting times, but it has not addressed the underlying issue of why some individuals fail to engage.

For those completing the programme, statistically significant improvements were found on all scales. Based on the SF-36 scoring manual, the sample size exceeded the ‘numbers needed’ for the magnitude of change. Only two scales exceeded RCI criteria—Physical Functioning and Role Physical. At a group level, this programme can be said to provide clinically significant improvements in disability for those completing the programme.

An analysis of the data from those attending the 6-month follow-up provided an understanding of change across three important time frames:

  • the duration of the programme—difference between the assessment and programme completion scores

  • change subsequent to the programme—difference between the programme completion score and 6-month follow-up

  • Overall, long-term change—the difference between the initial assessment scores and those at follow-up.

For six of the eight scales, more participants showed a decline between programme completion and follow-up than an improvement, the exceptions being for Vitality and Role Emotional. The largest proportional decline between programme completion and follow-up occurred in the scales which had shown the largest numbers improving from assessment to programme completion—Physical Functioning and Role Physical. These figures indicate that change can take place in a number of ways:

  • initial improvement, maintained at follow-up (‘early takers’)

  • initial improvement, lost at follow-up (‘need support’)

  • no initial improvement, but improvement by follow-up (‘slow burners’)

  • no improvement at any stage (‘non-takers’).

These results have important implications when attempting to assess the true impact of a programme as those showing improvement at programme completion may not maintain that improvement once the support of the programme is removed. Equally, for the ‘slow burners’, a rehabilitation programme may facilitate a major change in their life (the beginnings of a recovery) without showing a significant improvement during the course of the programme itself. When the cost benefits of this type of programme are being considered, consideration must be given to the degree to which gains during the programme may be lost following its completion and to the optimal nature of the follow-up and support, which may need to be better tailored to the needs of particular patient groups (particularly the ‘need support’ group).

Analysing individual change also illustrates that improvements are not seen uniformly across scales. In terms of ‘long-term’ change (assessment to follow-up), the scale on which the greatest number of participants improved (70%) was Role Physical. This could be said to reflect the cognitive behavioural aspect of the course whereby participants learn that they can manage their condition and that it need not control or restrict their lives. Almost a quarter of those showing an improvement in Role Physical did not show a corresponding improvement in Physical Functioning however, suggesting that improvement in ‘perceived’ disability can be made without a real change in the limitation imposed by their condition. This result supports the view that disability is driven by psychological as well as physical factors [56, 57].

Whilst the figure of 70% is the highest score for an individual scale, this could be said to under-estimate the impact of the programme as 93% of those attending the 6-month follow-up showed an improvement in at least one scale (assessment to follow-up). For the overwhelming majority of the participants continuing the programme through to follow-up, the programme can be said to have had a genuine positive effect. Whilst some observed improvements at programme completion were subsequently lost to follow-up, only one participant failed to show a clinically significant change on any scale over any time period. Although some of the improvement may be due more to the social nature of the group programme than to its content, the benefits in quality of life are nevertheless important and ‘real’. Previous work by the lead author [58] has shown the impact of social isolation amongst chronic back pain sufferers and the benefits perceived within a group setting.

Over a third of those showing clinically significant change in the scales Physical Functioning and Role Physical do not show a corresponding improvement in Bodily Pain scores. This suggests that disability can improve without a reduction in pain levels and supports previous research showing the lack of a uniform relationship between pain levels and disability [59].

The individual scores can also be used to allow sub-groups to be defined. By splitting our patient sample into groups based on whether they had shown a clinically significant change on a particular variable, we sought clues for some predisposition to improvement. When we considered the mean scores at assessment for each SF-36 scale for the whole sample and for those who had shown clinically significant improvement in Physical Functioning, Role Physical and Bodily Pain, we found little difference, with the greatest differences being on the General Health scale. Even for this scale, however, there were no statistically significant differences between those who subsequently showed clinically significant improvement and those who did not. Beliefs in relation to ‘general health’ and ‘perceived disability’ may be significant in programme completion, however, and this area warrants further research.

Conclusions

This study illustrates the benefits to be gained by analysing individual change in addition to group changes when evaluating the impact of a rehabilitation programme. From both perspectives, the programme assessed in this paper had its greatest impact on Physical Functioning and Role Physical, which are the two SF-36 scales most closed related to disability. The improvement was not a simple linear change, however, as some participants showed a loss of improvement at follow-up whilst others showed an improvement at follow-up without one being evident over the duration of the programme itself.

Equally, the results of this study illustrate that an improvement in one scale is not necessarily associated with improvements on other scales: we found that an improvement in pain was associated with an improvement in disability for the vast majority of cases, but the reverse was not true. This study therefore supports the view that improving an individual’s pain score is not a pre-requisite to addressing disability levels.

The programme had widespread impact, and the vast majority of participants at follow-up had shown clinically significant improvement in at least one scale between assessment and follow-up. Whilst this is not a controlled experimental study, the data suggests that the Back Team Programme represents an effective intervention for those with chronic low back pain.

Very few participants have been unaffected by the programme, so efforts to improve its effectiveness should concentrate on addressing decline following the programme and in ensuring improvement over a wider spectrum of health factors. Of greatest concern, however, is the proportion of those suitable for the programme who decline the opportunity. Further research should explore the factors affecting this decision and other differences between completers and non-completers with a goal of improving programme uptake.