Background

One of the main challenges for drug evaluation in rare diseases is the heterogeneous course of these diseases. When a disease course differs from patient to patient, traditional outcome measures may not be applicable for all patients of a certain disease. Trial designs are often limited to patients for whom the outcome measure is relevant, whereas the underlying disease mechanism may be similar in a larger group. This increases the problem of small numbers that already challenges rare disease research.

For example, in Duchenne muscular dystrophy (DMD), new drug trials until recently often used the 6-min Walk Test (6MWT) as an outcome measure. The 6MWT has been validated as a reliable and feasible outcome measure, and has been recommended as the primary outcome measure in ambulatory DMD patients [1, 2]. However, although the 6MWT may be a relevant outcome measure for boys who are not (yet) depending on a wheelchair, it is obviously irrelevant for, usually somewhat older, boys who are. This problem in DMD research has been picked up by patient representatives and researchers from all over the world [3].

As the DMD example shows, existing measurement instruments use an outcome that is not relevant for all patients, or may not be responsive enough to measure the effect of an intervention in a rare disease. However, the development of disease-specific and patient-relevant outcome measures is hampered by the small number and heterogeneity of patients with a particular rare disease. In their handbook “Measurement in Medicine” De Vet et al. [4] recommend a minimum number of 50 patients for validation studies.

A measurement instrument that can evaluate the effect of an intervention on an individual basis may help overcome the problem of small, heterogeneous populations. The importance of patient reported outcome measures is widely recognized by pharmaceutical companies and clinical researchers as well as regulators and government agencies such as FDA and NIH [5].

Goal Attainment Scaling (GAS) is a measurement instrument that is intended for individual evaluation of an intervention. It allows patients to set individual goals, together with their treating professional. The number of goals and the content of these goals may differ per patient, but the attainment of the goals is measured in a standardized way. This makes a standardized evaluation of an intervention possible, even when the patients are all in a different stage of their disease.

Goal Attainment Scaling was first introduced in 1968, by Kiresuk and Sherman [6], originally for the evaluation of mental health services. It contains a variable number of self-defined goals and very explicit descriptions of five possible levels of goal attainment that are formulated before the intervention, usually in consultation between the patient and the clinician. In the original definition, the levels are each quantified in a 5-point scale that ranges from −2 to +2, where −2 = the most unfavorable treatment outcome thought likely, −1 = less than expected level of treatment success, 0 = expected level of treatment success, +1 = more than expected success with treatment, and +2 = best conceivable success with treatment. For each goal the expected level of treatment success and at least two other levels need to be described in such a specific way that an independent observer can assess the outcome.

There is no maximum number of goals that can be set. Each goal can be assigned a weight, according to its importance to patient and/or clinician. From the scores reached after the intervention, a composite goal attainment score is computed using the following formula:

$$ T=50+\frac{10{\displaystyle \sum {w}_i{x}_i}}{\sqrt{\left(1-\rho \right){{\displaystyle \sum {w}_i^2+\rho \left({\displaystyle \sum {w}_i}\right)}}^2}} $$

where T is the composite score, wi is the weight assigned to the goali, xi is the original score for goali ranging from −2 to +2, and ρ is the estimated correlation between goal scores. According to Kiresuk and Sherman, it is safe to assume that the correlation between the goal scores is constant, and can be set at 0.3. The T-score has a mean of 50 and a standard deviation of 10, under the assumptions as proposed by Kiresuk and Sherman [6].

Besides mental health and non-medical fields such as education and social service applications [7], GAS is reportedly used in a few specific medical research areas, such as rehabilitation [812] and geriatrics [1315]. However, the validity of GAS as a measurement instrument in drug studies has never been systematically reviewed. To evaluate the usefulness of GAS in drug studies, we formulated the following three research questions:

  1. 1.

    Has Goal Attainment Scaling been used as a measurement instrument in drug studies?

  2. 2.

    What (drug) interventions were evaluated by studies using GAS?

  3. 3.

    What is known of the validity, responsiveness and inter- and intra-rater reliability of Goal Attainment Scaling in general, and in particular in drug trials?

In this study, we follow the COSMIN guidelines, which are the generally used and accepted standards for measurement properties evaluation [16]. This checklist contains standards for evaluating the methodological quality of studies on the measurement properties of health measurement instruments. According to the COSMIN guidelines, a health status measurement instrument can be used when its validity, reliability and responsiveness, have been tested and considered adequate. We considered GAS useful when the validity, reliability and responsiveness have been described, tested and found acceptable according to these guidelines.

Methods

We conducted a systematic review, according to the PRISMA guidelines [17].

We set up a sensitive search in Medline, PsychInfo and Embase. We searched for literature from 1968, the year when GAS was introduced by Kiresuk and Sherman [6], to May 1st, 2015. For the full search strategy, see Additional file 1. Reference lists of relevant review articles were screened for additional papers.

Papers were included in which:

  1. 1.

    Goal Attainment Scaling met the following criteria:

    • One or more individual goals were established by the patient or by one or more researchers or practitioners, either with or without input of the patient, prior to the intervention. The goals did not have to be devised by the patient/researcher, as long as the goals were individually chosen per patient.

    • The scale had to consist of at least three points (e.g. more than just goal attained – goal not attained). At least 2 points on the scale were described precisely and objectively, so that an independent observer would be able to determine whether the patient performs above or below that point.

  2. 2.

    The study was either a trial in which drugs are evaluated, or a study of any design in which psychometric properties of GAS were evaluated.

  3. 3.

    The outcome measure was the attainment of goals that had been established before the onset of the intervention.

  4. 4.

    The goals had been set up individually, i.e. per patient.

Excluded were:

  1. 1.

    Trials using an outcome measure called Goal Attainment Scaling, when the outcome measure did not meet our definition of GAS.

  2. 2.

    Studies in which goal setting was used as an intervention rather than outcome measurement.

  3. 3.

    Reviews or narratives.

  4. 4.

    Conference abstracts.

  5. 5.

    Papers published in languages other than English, French, Dutch, German or Spanish.

  6. 6.

    Papers published before 1968.

The selection of articles and data-extraction were performed in pairs of two independent reviewers. Disagreements were discussed until consensus was reached; if necessary a third reviewer acted as a referee. A standardized data-extraction form was used (see Additional file 2). We divided the included studies into two categories, i.e. drug studies, and non-drug studies in which the measurement properties of GAS were investigated.

We extracted information about the following measurement properties, defined according to the COSMIN guidelines [18]: Inter-rater reliability, intra-rater reliability, face validity, content validity, construct validity, and responsiveness. For the full definitions of the measurement properties, see Table 1. We used the quality criteria as proposed by Terwee et al. [19] to evaluate the measurement properties, as also displayed in Table 1. We chose to limit the evaluation of the quality of the measurement properties to the criteria as proposed by Terwee et al., instead of using the full COSMIN guidelines, because the COSMIN guidelines are very detailed, and many details are not relevant as these aspects cannot be evaluated for GAS, e.g. internal consistency, measurement error, criterion validity.

Table 1 COSMIN definitions [49] of the evaluated measurement properties, and their quality criteria [19]

Results

The search yielded 3007, 1413, and 1039 abstracts from Medline, Embase and PsychInfo, respectively. After eliminating duplicates, a total of 3818 abstracts remained for screening. In the screening phase, we excluded 3511 articles based on title and abstract, and 249 articles based on the full text. Data-extraction was executed for the remaining 58 articles (see Fig. 1). Of these 58 articles, 38 articles described drug studies in which GAS was used as an outcome measure, and 20 articles described measurement properties of GAS in other settings (Fig. 2).

Fig. 1
figure 1

The number of articles in- and excluded in the SR

Fig. 2
figure 2

Venn-diagram depicting the number of studies in the categories drug-studies and methodology studies, and the number of studies in both categories

In Table 2 the characteristics of the articles are presented. Most studies are trials in patients with cerebral palsy or patients with spasticity due to other causes, such as acquired brain trauma or stroke (28 studies). Also, many studies focussed on the geriatric population (15 studies). There were also some studies on autism (three studies), or neurological disorders such as MS (two studies). The remaining studies covered research areas such as family problems, goal setting in adolescent students or behaviour and psychiatric problems.

Table 2 Reported Patients, Interventions, Comparisons and Outcomes in the included studies

Most drug studies evaluated an intervention with botulinum toxin (25 studies), mainly in patients with cerebral palsy and spasticity. Baclofen was also evaluated in children with spasticity (three studies). Other drugs that were evaluated, were galantamine (three studies), donepezil for Alzheimer’s Disease (two studies), fluvoxamine, trihexyphenidil, memantine, a phenol nerve block, and linopirdine (one study each).

An overview of the reported measurement properties of GAS in the 38 drug studies and the 20 non-drug studies is presented in Tables 3 and 4, respectively.

Table 3 Reported measurement properties of GAS in included drug studies
Table 4 Reported measurement properties of GAS in included validity studies

Face validity

As is shown in Tables 3 and 4, face validity is reported in one article [20]. This is a drug study that evaluated the use of Fluvoxamine in patients who met the criteria for panic disorder with moderate to severe agoraphobia. GAS was used as a primary outcome measure. Both therapists and independent raters who assessed the level of goal attainment after the intervention, were asked to rate the relevance of the chosen goals on a scale of 1 to 5 (with one meaning irrelevant and five meaning very relevant). Therapists only rated the GAS score of patients not treated by themselves. The mean score of the therapists was 4.68 (SD = .51), and the mean score of the independent raters was 4.66 (SD = .52). The researchers concluded that these numbers show that ‘the goal areas were suitably chosen’. The target population of GAS (the patients) were not involved in this evaluation, which is one of the requirements of the quality criteria that we use. However, it is inherent in the measurement instrument that the patient is involved in the choice of the items. Therefore, we score the quality of the face validity evaluation as ‘good quality’.

Content validity

Content validity was reported in five studies, of which one was a drug study. Content validity was measured in several ways, as shown in Table 5; by rating the usefulness or importance of the goals [21, 22], by comparing the goal areas with essential components as recommended by position papers in the specific field [23] and by checking whether the goals were formulated according to the criteria ‘Specific, Measurable, Assignable, Realistic, and Time-related’(SMART) [24, 25]. In one study, the content validity was reportedly tested by grouping the goals into major categories, and analyzing the content of these categories [26]. However, the study did not report the results of the categorization of the goals [26]. The quality of the content validity varied from ‘good quality’ in two studies, ‘intermediate quality’ in two studies and ‘poor quality’ in one study. Authors reported a ‘good overall usefulness’ of the goals [22], stated that all recommended areas were represented in the goals [23], whether goals were set according to the SMART principle (in this particular study, it was concluded that there was, even after a refinement process of the goal statements, still a difference in the quality of the goal statements between the different sites) [24, 25] or that more than 70 % of the responders rated GAS as a 4 or 5 on a 5-point scale as clinically relevant and important [21].

Table 5 Reported content validity of GAS in included studies

Construct validity

Construct validity was reported in 18 studies, of which six were drug studies (Table 6). In all 18 studies construct validity was assessed by correlations with other instruments measuring a construct similar to the goals that were expected to be set by the patients in each specific research area. Also, T-tests between the placebo and intervention condition [27], or T-tests between the lowest and highest T-score differences [28], were used to verify construct validity. In none of the studies, a hypothesis was formulated on the expected construct validity outcomes. Therefore, the quality of the construct validity is difficult to evaluate. Of the 18 studies, 14 reported significant correlations with other measurement instruments that were relevant for the research area. The measurement instruments used to establish the construct validity varied considerably, since GAS is used for different research areas. Three studies reported that no significant correlations with other measurement instruments were found [21, 29, 30]. In one study correlations between change scores were measured. The results were not clearly reported [31].

Table 6 Reported construct validity of GAS in included studies

Intra- and inter-rater reliability

As can be seen in Tables 3 and 4, intra-rater reliability was not assessed in any of the included studies. Inter-rater reliability was reported in 12 studies, of which two were drug studies. Different methods were used to measure the inter-rater reliability (Table 7). In four studies we rated the quality of the inter-rater reliability as poor, whereas eight studies were rated with ‘good quality’. Eight out of the 12 studies reported an ICC score. Five of those studies reported that the ICC values were all 0.9 and higher [3135]. Two studies reported ICC values between 0.8 and 0.95 [26, 36]. In one study, the reported ICC was lower than 0.5 [37]. The specific calculation for the ICC was reported in one study [37]. Confidence intervals for the ICC values were also reported in one study [35]. Inter-rater reliability was also reported with kappa-values [21, 38], where the values ranged from substantial to almost perfect agreement. Another method that was used was calculating a correlation, which had a value of 0.84 [28]. One study reported ‘agreement’ between objective goal setters and the therapists who performed the interventions, and ‘agreement’ between objective goal setters and people who did the intake of the patients before the patients were randomized. The results were an agreement of 43 and 57 % respectively. However, in the article the method used to calculate this agreement were not reported [20].

Table 7 Reported inter-rater reliability of GAS in included studies

Responsiveness

Responsiveness was reported in 14 studies, of which two were drug studies (Table 8). None of the studies used measurement properties as advised by Terwee et al. [19]. Therefore, it is difficult to evaluate the quality of the responsiveness. In nine of those 14 studies, an effect size of the measured differences was reported [26, 2931, 33, 3942]. Of those nine studies, the reported effect size was below 1 in only one study [29]. In five studies, a Relative Efficiency was reported [26, 30, 31, 33, 41]. The relative efficiency of two procedures or measurement instruments is the ratio of their efficiencies. For instance, a comparison can be made between GAS and a regularly used measurement instrument. The Relative Efficiency varied between 3 and 57, but was substantial in most studies, meaning that GAS is more efficient, or needs less observations, than other measurement instruments. A Standardized Response Mean was reported in six studies [22, 23, 26, 4042]. A standardized response mean (SRM) is an effect size index used to measure the responsiveness of scales to clinical change. The SRM is computed by dividing the mean change score by the standard deviation of the change. The SRM’s that were reported varied between 1.2 and 3.54. Two studies measured responsiveness with a paired t-test comparing response before and after the intervention, with a significant difference in GAS T-scores in both studies [22, 39]. In one study, the sensitivity, specificity and positive and negative predictive value were calculated based on a group of responders and non-responders [43]. The results were 52, 85, 81 and 60 %, respectively. In another study, responsiveness was reported as the number of patients who showed a change in T-scores of different goal areas [44]. The proportion of patients showing changes on GAS was larger than on other measurement instruments. The number of patients showing change were nine out of 23 patients on the physical goals, 18 out of 23 patients on occupational goals and 12 out of 18 patients on speech goals, whereas there was only one patient that showed change on the Gross Motor Function Measure (GMFM-66).

Table 8 Reported responsiveness of GAS in included studies

Discussion

In this systematic review, we have found 58 articles, of which 38 drug studies, where GAS was used as an outcome measure. Therefore, we may conclude that GAS has indeed been used in drug studies. Most drug studies that report any information on the validity of GAS, used Botulinum Toxin as an intervention for spasticity, usually in combination with physical or occupational therapy. The generalizability of the results of these validation studies is limited. The validity, responsiveness and reliability of GAS in drug studies have scarcely been studied. In only seven of the 38 drug studies that we found, some validation has been performed. The methods used to validate the measurements instruments often differ from the methods as proposed by COSMIN. The quality of the methods to assess measurement properties varies, and results are often difficult to interpret. We found 20 articles concerning non-drug studies reporting on the validity, responsiveness and inter-rater reliability of GAS. However, also in studies in which GAS was used to evaluate a non-drug intervention, the quality of the validity reports leaves much room for improvement.

In most articles, either drug or non-drug studies, no definition was given of the measurement properties that were assessed, the formulae used for calculation of parameters were not presented, and in some papers the results of the validity check were not reported [26, 31]. Also, none of the included articles describe hypotheses to test construct validity, which makes evaluating the reported results virtually impossible. Therefore, we conclude that the validity and reliability of GAS have not been researched extensively, neither in studies where a drug intervention was evaluated, nor in other studies.

Of all clinimetric characteristics that were investigated, the responsiveness of GAS was investigated most thoroughly. The responsiveness was consistently reported to be very good compared to other measurement instruments, such as the Gross Motor Function Measure (GMFM-66) in the evaluation of children with cerebral palsy, or the Standardized Mini Mental State Examination (SMMSE) for geriatric assessment. However, none of the studies evaluated the responsiveness according to the guidelines as proposed by Terwee et al. [19]. Therefore, it is difficult to be conclusive on the responsiveness of GAS, although the reported results suggest we may tentatively be optimistic.

The search of this systematic review was very sensitive, to make sure that no studies on GAS were missed. However, our definition of GAS is rather specific, which excludes studies with an approach that is similar, but not exactly the same. Also, we may have missed studies that did not use similar terminology, but did use an approach similar to GAS.

Our findings are consistent with previous systematic reviews on the measurement properties of GAS. For instance, Steenbeek et al. [10] concluded that, in the setting of pediatric rehabilitation, GAS is a very responsive method for treatment evaluation and individual goal setting, but sufficient knowledge is lacking about its reliability and validity, particularly. Also, in the field of psychogeriatrics, GAS may be considered useful from a theoretical point of view. Geriatric patients are heterogeneous, and GAS may be a useful tool to evaluate geriatric interventions. However, the measurement properties of GAS in geriatrics show mixed results. The evidence is not yet strong enough to state that GAS is an applicable outcome measure in this particular field [14]. In a systematic review on the feasibility of measurement instruments related to goal setting, GAS is considered a helpful tool for setting goals, although it is time-consuming and may be difficult for patients with cognitive impairments. However, the patient-centered nature of GAS makes it easier to focus on meaningful patient-directed treatment goals. Also, according to the results the scaling of GAS makes it possible to detect very small progress that may be of great significance to the patient, underlining its potential in responsiveness [45].

A problem in the evaluation of the validity of GAS may be that GAS does not measure one clear construct, since the content of the goals generally differs from patient to patient. One of the possibilities to overcome this inherent problem may be to make an item bank of possible goals that patients would be able to choose from, to make sure that the methodological properties of the goals are known [46]. However, this would be practically very difficult to achieve, since we suspect that for many orphan diseases the patient numbers are smaller, and goals could be more diverse than those of non-orphan disease patients. Another way of approaching the construct validity is to see GAS as a measurement instrument that measures the construct of the attainment of goals. Then, the construct validity could be evaluated by comparing GAS with another measurement instrument that evaluates the attainment of goals, such as the COPM. To our knowledge, this approach has not been considered so far.

The importance and difficulty of goals are often taken into account by assigning weights to the goals (more important goals are assigned a larger weight then less important goals). However, terms such as importance and difficulty are by nature subjective. What is important for one patient, may be less important for another. For example, a Duchenne patient may perceive being able to brush his teeth as very important, where someone else may conceive it as trivial. Can this difference in importance objectively be measured? In a study on the reliability of GAS weights, Marson, Wei and Wasserman [47] conclude that assigning weights to the goals of GAS according to the severity of the problem has an acceptable inter-rater reliability when scored by different objective students trained in the use of GAS. This indicates that although importance and difficulty are difficult to objectively measure, objective raters may still score goals similarly. However, more research should be carried out on this topic to answer the question more definitively.

GAS is a measurement instrument with a high potential, especially in rare diseases, but in order to use it in drug studies, more research on its validity is essential. One way of achieving this would be to use GAS as an additional measurement instrument in an ongoing drug trial, to further explore its validity. For GAS to be possibly useful, the effect of the evaluated drug should be objectively measureable in terms of behavior, and it should measure something that is valuable and noticeable for a patient, and cannot be measured otherwise. Also, the drug that is evaluated should have an effect that is also clinically relevant. Again, Duchenne Muscular Dystrophy may serve as an example. A potential drug should do more than just improve for instance the dystrophin values in muscle biopsies. It should be able to improve something that is valuable for the patient, which can be measured by activities that patients perceive as important, such as brushing teeth or using a computer. GAS may be a useful outcome measure, since it can evaluate a potential drug on a patient level, and is therefore intrinsically clinically relevant.

According to guidelines on Patient Reported Outcomes and Health Related Quality of Life by the FDA and EMA, and open comments on these guidelines by experts [48], the following qualities were essential: a PRO should be based on a clearly defined framework, patients should be involved in the development of the measurement instrument, PRO claims should be based on and supported by improvement in all domains of a specific disease, an appropriate recall period is necessary when the effects of an intervention are tested, the test-retest reliability should be assessed, as well as the ability to detect change and the interpretability of the measurement instrument. Finally, an effect found by a PRO measurement instrument can only be valid when found in an RCT.

In general these requirements also apply to GAS, e.g. patient involvement. However, not all of them are applicable to this instrument, such as test-retest reliability. Before GAS can be used in drug trials, more validity research is needed. GAS has not yet been sufficiently validated to be supported by the regulatory agencies, but it may have potential in specific drug trials, especially in rare diseases where there is a lack of validated and responsive outcome measurement instruments.

Conclusion

We conclude that currently there is insufficient information to assess the validity of GAS, due to the poor quality of the validity studies. However, the overall reported good responsiveness of GAS suggests that it may be a valuable measurement instrument. GAS is an outcome measure that is inherently relevant for patients, making it a valuable tool for research in heterogeneous and small samples. Therefore, we think that GAS needs further validation in drug studies, especially since GAS can be a potential solution when only a small heterogeneous patient group is available to test a promising new drug.

Abbreviations

ADAS-cog, Alzheimer’s disease assessment scale – cognitive subscale; AHA, assisting hand assessment; AMPS, assessment of motor and process scales; AQoL, assessment of quality of life; ARAT, action research arm test; AUC, Area under the receiver operating characteristics curve; BAD-scale, Barry-Albright Dystonia scale; CBS, Caregiving Burden Scale; CDS, Cardiac depression scale; CES-D, Center for epidemiological studies depression scale; CGI, clinical global impression; CHQ, child heath questionnaire; CIBIC-plus, Clinician’s interview based impression of change-plus; COPM, Canadian occupational performance measure; DAD, disability assessment for dementia; DCD Pinch, dynamic computerized dynamometry; FAC, functional ambulation category; FAQ, functional activities questionnaire; FIM, functional independence measure; GAS, goal attainment scaling; GHQ, general health questionnaire; GMFM, gross motor function measure; HADS, hospital anxiety and depression scale; IADL, instrumental activities of daily living; ICC, intraclass correlation coefficient; LASIS, leeds adult spasticity impact scale; LoA, limits of agreement; MAS, modified Ashworth scale; MAUULF, Melbourne assessment of unilateral upper limb function; MHOQ, Michigan hand outcomes questionnaire; MIC, minimal important change; MMSE, mini-mental state examination; MPQ, McGill pain questionnaire; MTS, Modified Tardieu Scale; NHP, Nottingham health profile; NRS, pain intensity numerical rating scale; OARS IADL, Older Americans resource scale for instrumental activities of daily living; ODQ, Oswestry low back pain disability questionnaire; PAIRS, pain and impairment relationship scale; PDMS-FM, peabody developmental motor scale – fine motor; PEDI, pediatric evaluation of disability inventory; PET-GAS, psychometrically equivalence tested goal attainment scaling; PSMS, physical self-maintenance scale; QoL, quality of life; QUEST, quality of upper extremity skills test; RR, responsiveness ratio; SDC, smallest detectable change; TSA, Tardieu Spasticity Angle