Keywords

2.1 Introduction

For clinicians, and especially for therapists who measure client performance, our role in delivering measurements effectively and consistently has never been more critical. This is in part due to the success of clinical trials in rare diseases which affect motor performance, and which will therefore often involve measurements of motor performance as a primary outcome. Ultimately the success of the trial – correctly identifying the therapy as effective or not – rests on the ability of this endpoint to detect an actual improvement correctly and not record a negative trial due to the scale’s inadequacies as a measurement tool. This has required us as clinicians to raise our game in terms of conducting performance outcome assessments consistently across a large number of centers, often in different countries [14]. It has also, of course, shone a spotlight on the validity of the chosen outcome instruments to measure what they say they will.

These requirements should not distract us from our additional role of measuring performance and function in order to evaluate the effect of a given non-medicinal intervention or therapy such as exercise or orthotics. A physiotherapist evaluating a child using a motor scale in a clinic may use the information gained to evaluate disease progression as well as guide the family in the provision of equipment and predicting the need for additional support in the future. This role is not separate from the physiotherapist’s role in a clinical trial where the motor scale is a primary outcome and may also be essential to ensure that the necessary standard of care is provided to the participant. A good example is in Duchenne muscular dystrophy (DMD) where the North Star Ambulatory Assessment (NSAA) can identify the need for a change in steroid management or identify a loss of range of movement in the ankle joints requiring a program of stretches and orthotics as a recommended approach in the published standard of care guidelines [2].

As clinicians we have an additional responsibility. Person centred outcomes require the human touch to deliver these appropriately. Asking patients to repeat the same measure time and again over a long period requires sensitivity as well as a consistent delivery. Care cannot be just about measurement; it is about appreciating the difficulties experienced by patients who find themselves getting worse and finding tasks harder at each visit. Our focus is to remember our role in using data to not just drive a natural history study or a clinical trial but also in attending to how the information can be applied to management, and to how our words and actions as we ask patients to do yet more things can make a difference between them feeling just measured or ultimately cared for. This is of course another good reason to always involve patients in the evaluation and development of outcome measures and is particularly important in patient reported outcome measures. Their early and direct involvement will ensure the necessary sensitivities and clarity.

With this in mind this chapter aims first to illuminate the specific benefits of Rasch analysis in the evaluation of performance, using practical examples to explain why these methods enhance more traditional evaluations of reliability and validity. Second I will discuss the importance of the role clinicians play in the interpretation and decision making process when using Rasch analysis. Third, I explore the role of the clinician alongside other key stakeholders to ensure that wherever we assess a patient—in clinic, at home, or via video link—the value and clinical meaning of measurements can be understood and potentially equated in relationship to other similar scales conducting in an alternative forum. This speaks especially to the role of measurement in telemedicine.

2.2 Rasch Analysis Explained and Why Clinician Involvement Is Key

In this section we are going to use various real-world scales to illustrate measures of scale robustness utilized in modern psychometric methods and importantly highlight the complementary role of the clinician in interpretation of the analysis. The roles we as clinicians play in making ongoing decisions about a scale’s development—decisions that enhance the analysis and benefit the scale’s evolution—are stories often left untold within our medical publications. These accounts convey key components of the development of patient centred outcome measures requiring multi-disciplinary collaborations with patients and families [15].

2.2.1 Guttman Principle

Rasch methodology compares our real-world data (ordinal in nature) with the perfect mathematical scale (interval level data), where each item in the scale contributes to the measurement of one thing. This single thing is often termed a construct or a concept. In this perfect scale every single item contributes to the measurement of this one construct. It is also essential that the scoring options for each item incrementally improve or increase in the same way as the total score increases. Additionally, in this cumulative scale, the difficulty of each item is always ranked in the same order – there is a hierarchy of difficulty. This perfect scale is best summed up as the Guttman distribution and it is important for clinicians to understand this model to understand their own data. Figure 2.1 illustrates the perfect scale where each item is scored as a 0 (unable) or 1 (able). What makes this such a powerful scale is that as the item difficulty does not vary, so knowing the total score means you know exactly how an individual scored each item. For example, if the total score was 3 out of top score of 4, you know the individual was able (score 1) to do item 1, 2 and 3 but not able to do item 4.

Fig. 2.1
A tabular format of the Gutman principle. The columns are items 1 through 4 and the total score shows increased item difficulty. The rows person A through person E exhibit increase patient ability.

Illustration of the Guttman principle

So, let’s use an example and see how the North Star Ambulatory Assessment (NSAA) which is used to measure motor performance in boys and young men who can walk and who have a diagnosis of DMD [18] compares to the perfect mathematical scale.

2.2.2 The North Star Ambulatory Assessment

The NSAA is a multiple item rating scale (17 items) with three ordered response categories (2, 1, or 0) which are summed to give a total score. Items are scored either 2 (‘normal’ with no obvious modification of activity), 1 (modified method but achieves goal independent of physical assistance from another), or 0 (unable to achieve independently). A total ‘ambulatory function’ score is generated by summing the ratings across the items. A higher score indicates better motor performance. The analyses presented here have been previously [9, 11] published but we will add clinical subtext, especially around the role of the physiotherapist in making decisions about the scale’s content.

The first test which we conducted, which was designed to reassure clinicians, showed how the RUMM2030 program [1]- a software package used for measurement scaling data analyses – ranked the 17 items in order of difficulty compared to the expert opinion of five neuromuscular physiotherapists who regularly used the NSAA. This enabled a comparison to be made between the Rasch item hierarchy and clinical expectation. Consistency was examined qualitatively and statistically (Spearman’s rho). The clinical utility of the total score was validated, as both clinicians and Rasch analysis reported stand and walk as easiest, while jump and hop were hardest; consequently, the Spearman rho was high.

Next, a key test of a scale is the ordering of the response categories. Each NSAA item has multiple response categories which reflect an ordered continuum of better ambulatory function (2, 1, and 0). The point at which performance on an item moves from one score to another is called a threshold. Although this ordering may appear clinically sensible at the item level, it must also work when the items are combined to form a set. Rasch analysis tests this statistically and graphically. When the response options are working as expected, this provides some important evidence for the validity of the scale.

This matters as each item’s scoring options must rank numerically higher as a higher total score is achieved. If they do not the total score can be questioned. This lack of incremental improvement or decline is termed disordered thresholds and can occur for many reasons. Perhaps the wording or interpretation of an items scoring options is not clear or too complicated (too many response items). Perhaps the scoring options are different strategies rather than hierarchical changes. Or perhaps the patient sample does not include people performing at that level.

For example, in the NSAA, item 3 describes the ability to “stand up” from a chair. A score of 2 means the patient is strong and can stand up without using their arms or moving their feet. A score of 1 is defined as uses an adaptation to the start position in order to stand up. This could be achieved by widening the feet or using their arms. However, if the scale attempted to define “uses arms” as always meaning a person doing so was stronger than one who moved their legs, the thresholds maybe be disordered because some may use one strategy and others, another, even though their weaknesses may be similar.

In Fig. 2.2 we can look at the map of the response categories and see how we can interpret them using our clinical knowledge. If this map did not make sense, this would suggest we need to relook at our scale and take into consideration some of the points raised above in the example of “stand up.” For the NSAA, “stand up” has ordered thresholds suggesting that the three scoring options were working as planned and the grouping of strategies into a score of 1 was clinically sensible and unambiguous. It is this clinical knowledge of a how a condition progresses, and how boys and young men present, that means as clinicians we lie at the heart of making changes to a scale. A scaling analyst may suggest we need to review the scoring options, but it would be clinicians who made changes using expert knowledge on a disease and its progression.

Fig. 2.2
A horizontal bar graph with multiple data. Values are estimated. Hop R, 0, negative 5 to 1.8, 1, 1.8 to 3.3, 2, 3.3 to 4. Hop L, 0, negative 5 to 1.7, 1, 1.7 to 2.8, 2: 2.8 to 4.

Threshold Map for NSAA Items in ranked order of difficulty according to Rasch analysis. (“Gowers” = adapted method of getting up from floor when muscles are weak)

Next, we wanted to examine the match between the range of ambulatory function measured by the NSAA and the range of ambulation measured in the sample of children. Figure 2.3 shows the adequate targeting between the distribution of person measurements (upper histogram) and the distribution of item locations (lower histogram). This analysis informs us as to how suitable the sample is for evaluating the NSAA and how suitable it is for measuring the sample. This is often called targeting and better targeting means the scale is working well independent of the sample, as well as working for the sample tested.

Fig. 2.3
A set of two bar graphs titled person item threshold distribution plots frequency versus location. Above, the bar is upright high around 18 persons and below, inversely high at items above 5.

Person-item location distribution

Traditionally, we understand this as a floor and ceiling effect; Rasch analysis allows us to also examine the spread of items. This allows us to ask questions such as “Are there some children we are not measuring well?” (gaps in the scale) or are there some levels we are over assessing (bunching of items)? For the NSAA there is small ceiling and as the disease is progressive there will be a floor as boys and young men lose ambulation. Clinicians will immediately recognize this as appropriate for a sample and can use their clinical knowledge to suggest additional items to bridge gaps in the current scale, to reduce any ceiling or floor effects, or suggest items to remove, as they measure the same level of ability. For instance, the cluster of items at about 1 logit on the horizontal scale relates to the box step items, some of which may justifiably be removed from a scale that aims to measure level of ability only.

Next, we examined the items of the NSAA to see if they worked together (fit) cohesively. If they are measuring more than one thing, some items would be inappropriate to the overall interpretation of the ratings and to considering the total score as a basis for the measurement of ambulatory motor function. When items do not work together (misfit) in this way, the validity of a scale is questioned. The methods for examining this using Rasch methods are four-fold and it is best to interpret all four tests together in the context of your clinical experience, including:

  1. 1.

    Fit residuals (log residuals – which should be between the range of + − 2.5, depending on sample size and test length);

  2. 2.

    x2 values (item–trait interaction) – which need to be numerically similar to each other;

  3. 3.

    t-test for unidimensionality – a standard test of whether the scale measures one thing or not; and

  4. 4.

    item characteristic curves – which tell you about whether the way an item works in real life is as the model predicts.

For this measure of fit let us use the NSAA again as an example of how a clinician’s experience can assist interpretation and steps taken. For fit residuals three items misfit but two of these only slightly (climb and second box step right leg first). One item however showed significant misfit: lifts head in supine position. This inconsistency makes clinical sense. It is a motor task impacted by DMD, and so it was included in the original scale because performance often improved when steroids were started. That is, it was originally included even though it does not directly relate to ambulatory function. As the inconsistency of the ratings for the “lifts head” item makes clinical sense and this result was reflected in other measures of fit, such as the item characteristic curve (the item did not change in the same way as other items did as the disease progressed), and the x2 values (high compared to other values), it was decided that in the linearized scale (where the ordinal level data is converted to interval level measurement) that “lifts head” would be removed from the total score prior to transformation. This decision was also approved by clinicians as clinically sensible.

Another analytical tool is the Person Separation Index (PSI), a reliability statistic comparable to Cronbach’s alpha. Higher values indicate greater reliability. The NSAA had a high reliability which confirmed findings from a study that examined reliability using traditional methods (intra-class correlations) [13].

When dependency was examined for the NSAA several pairs of items were found to be dependent. This occurs when the score on one item directly influences the score on another item. If this happens, measurement estimates can be biased and reliability (PSI) is artificially elevated. This was true for the NSAA items measuring right and left sides (stand on one leg, hop, climb and descend box step). The reason the scale includes both sides is that, though DMD is not predominantly an asymmetrical disorder, differences can influence functioning at home and in the community and may require individual management – a right ankle splint for instance. Given the clinical importance of measuring both sides the decision was made to remove the scores from one side of the body to see if this influenced the reliability. As their removal did not change PSI value significantly clinicians decided to keep the bilateral measures given their clinical utility in assessing for standards of care.

Next, we assessed stability of the scale (differential item functioning) to understand if different subgroups performed items in a similar way regardless of other differences. For instance, for one group we assessed different treatment regimes of steroids, but for other cohorts we wanted to see if gender or age made a difference to scoring stability. For the NSAA, the type of regime did not influence the ways boys scored the items which provides reassurance that it can be used in all ambulant individuals with DMD.

Finally, we examined how closely the summed NSAA scores, which are by definition ordinal, correspond to the interval-level measurements. Basically, the question is, how close is the real-world data to looking like a ruler? Rasch analysis estimates linear measurements from raw scores, a relationship that can be illustrated in a graphical plot known as a logistic ogive (Fig. 2.4). This figure shows that the change in interval measurements associated with a one-point change in the NSAA total score is nonlinear: it varies across the range of the scale, just as ordinal scores always do.

Fig. 2.4
A line graph plots score versus location. Values are estimated. Legend 1 starts from (6, 0) and ends at (5, 35). Point (3.8, 32) is where legend 1 pass. The curve is sigmoidal in shape.

Ordinal score to interval measure transformation graph

Tests of ordered scoring options, fit, stability, dependency and reliability support the scaling of the NSAA as an interval level measuring instrument. This means that change is measured and defined consistently across the scale. Therefore, regardless of how strong or weak an individual was, the change measurement means the same thing.

This NSAA “ruler” was then used to examine the differences in response to two steroid treatment regimes, which, in the context of having identified differences that make a difference, could then be used to estimate a minimally important difference. Key to our clinical interpretation was describing minimally important differences (MIDs) in terms of meaningful change to the individuals and their families. The proposed MIDs could be equated to significant ‘milestones’ of loss. In more able males, a fall in interval level measurements from 90 to 80 (raw score 31–29) means they can no longer hop, and a fall from 50 to 40 (raw score 16–11) fits with an inability to rise independently from the floor. In weaker males, a fall from 21 to 11 (raw score 3–1) means they lose the ability to stand still. This also fits with our clinical understanding of the hierarchy of difficulty of items which was reported earlier.

Subsequent publications using this linearized scale have gone on to report change and average rates of decline [17], and have used the total score to identify different clusters within the wider cohort showing different patterns of change over time and the degree to which these patterns may be associated with age [16].

Finally, Rasch analysis allows comparisons of one scale with another that purports to measure the same construct. A good published example of this kind of equating is the development of the Revised Upper Limb Scale (RULM) [12]. Here an existing scale designed for weak young patients with spinal muscular atrophy (SMA) was adapted (by clinicians and physiotherapists assisted by individuals with SMA) to measure stronger individuals. The RULM has proven useful in this population, both in the clinic and in clinical trials [4, 21].

Work is underway to further test the RULM in other populations, and to use it to advance our understanding of the relationship between PROMs and functional scales [10]. Another example in this vein concerns a study conducted in the clinically important context of dysferlinopathy—a type of limb girdle muscular dystrophy—that involved the equating of two scales (the NSAA and the Motor Function Measure 20) brought together to produce a novel scale suitable for ambulant and non-ambulant individuals. The resulting North Star Assessment for Limb Girdle Type Muscular Dystrophies (NSAD) [7] is now being validated in a study incorporating a larger group of muscular dystrophies.

Figure 2.5 illustrates how the RULM measured more able individuals and created distinctions among those who were clustered at the ceiling of the original ULM. Advanced measurement modelling methods provide valuable insights into the suitability of any scale’s ability to measure change effectively. Clinicians can benefit from this and should partner with measurement colleagues to build better scales.

Fig. 2.5
A line graph of score versus location. R U L M starts from (negative 6, 1), and ends at (8, 35). U L M starts from (negative 5.5, 0), and ends at (3.8, 17). At the top right, R U L M and legend are encircled.

Equating of RULM and ULM. The vertical axis shows that an ordinal score on the ULM of 15 is equivalent to a 26 on the RULM. The entry item was excluded from the total score on the RULM

2.3 Top Tips When Choosing Which Scale of Performance to Use

It is important that any scale that you use as a measurement tool within your clinical setting suits the patient group in question. You will want a scale sensitive to clinically significant change, perhaps in more than one direction. Because we are busy, clinicians often think that using an existing scale and applying it to a new population is the best and quickest way forward, and that any problematic issues can be sorted out at a later stage. Then, after we have used a convenient scale for some time, we may be lulled into accepting it despite various shortcomings. Later, after accruing quite a lot of longitudinal data, even though it is not a great scale, we don’t want to change because we would then lose comparable continuity with our old data!

So where can we go from here? Historical data offers powerful opportunities for evaluating the internal workings of a scale. This can then guide the experts to adapt it to improve its measurement quality or replace it with an alternative scale which others have developed. It may be that these options are not relevant to your situation, and you need to start from scratch. In that case, we suggest you make good use of the literature. The rather lengthy iterative processes of patient involvement, scale development, testing, changes and re-evaluation [3, 5, 15] although a rewarding process, comprises “a large set of wheels that not everyone has time to turn”.

Beware of scales designed for conditions and populations different from those of concern to you. Ask tough questions of the scales you intend to use or have been asked to use. Does it measure what it says it measures? How appropriate is an infant scale for use in adults? How does a scale designed for upper motor neuron problems work in a disease where the main issue is fatigue? These questions can only be answered if you know your disease, how it progresses, and with a team of experts who are critically engaged—and these experts must include patients and their families.

Don’t be afraid of research literature that uses a lot of statistical tests. Learn with others, ask those who know more about measurement for advice, and always hold up a scale to your clinical sensibility for careful examination. A measure of performance that is reported to be highly reliable may not be valid. Conversely, a scale with valid content may not be precise enough to make distinctions at the degree of clinical significance needed to support your decision process. You may be acutely aware that a scale of motor performance does not tell you what you need to know about your weakest patients (so find an alternative). Or, from a practical point of view, you may know that the activities involved are not safe in a particular group. Essentially, ask questions!

2.4 The Future: Telemedicine?

Global reaching circumstances mean that as clinicians we have been seeing our patients and families at more than arm’s length. Our current scales performed face to face have not been possible over the phone or at the very least have been difficult to perform via video conferencing methods. We have an opportunity here to compare our face-to-face assessments with those done remotely by video or we can adapt our current measures to be more suitable for use in the home. This work is already underway with some success in demonstrating the value of these remote measures [8]. We can see how our current scales relate specifically to an alternative model such as a patient reported outcome that measures the same construct. The issues and comparability of these different measures of performance can be evaluated using some of the techniques touched upon in this chapter (equating rating scales) and are described in more detail elsewhere [6]. This method of comparing scales that purport to measure the same construct is supported by regulatory authorities, often described as triangulation and can be seen in action using advanced measurement modelling in diseases such as dementia [19]. We may also find a new world engages us more with digital health technologies and it will be key that these novel methods summarize this sensor data into meaningful outcome measures [20] and we have a role to ensure that any novel measure has purpose and meaning.

As clinicians we are highly privileged to be directly involved with patients. The future may mean we need to take into better account the large amount of their lives that they conduct when they are not in clinic with us. Our ability to listen, interpret and support individuals will play a significant role and we must ensure we are part of future scale development and must not be afraid to embrace numbers because they only matter when we attach meaning to them that matters to individuals.