Background

One commonly used outcome measurement in patients treated surgically for soft tissue or bone sarcoma is the Musculoskeletal Tumor Society Score (MSTS) [1]. The MSTS was developed to evaluate postoperative function aiming to permit comparisons of end-results from different surgical treatments [1, 2]. To be useable, an outcome measurement should demonstrate content validity, i.e. include items important to the patient group and items relevant for the construct to be measured [3]. No study has asked patients with bone sarcoma what functions and activities in daily life they consider important and compared that to the MSTS. Additionally, the construct of the MSTS has not been compared to established frameworks. Therefore, it is unknown whether the items of the MSTS measure functions and activities that are important to the patient group or whether they reflect the construct of functioning. Despite the lack of evidence for content validity, the MSTS for the lower extremity (MSTS-LE) has shown consistent results for construct validity, with moderate to high correlations to other measurements of functioning, for example to the Toronto Extremity Salvage Score (TESS) and the Short Form 36 physical function [4,5,6,7]. Conversely, the results for internal structure of the MSTS are inconsistent. Some studies have found ceiling effects for both MSTS single items and sum score [4, 6, 7] while others did not find ceiling effects [5, 8]. Three studies have tested the MSTS for structural validity, with the conclusion of a one-factor solution, i.e., unidimensionality, for the MSTS [5, 6, 9]. However, the studies showed eigenvalues close to cut-off for a two-factor solution and all three studies showed moderate to low factor loadings for the items Pain and Emotional acceptance [5, 6, 9]. Based on results of low factor loadings for Pain and Emotional acceptance, one could question their reflection of functioning. If the MSTS is to be used in future research and clinical practice for the evaluation of functioning, further evidence of its ability to reflect functioning is needed.

The aim of this study was therefore to evaluate the validity of the MSTS-LE, more specifically its content validity, data quality, internal consistency, structural and construct validity.

Methods

Design, inclusion, and patient characteristics

Data for this project was extracted from three cohorts including patients with bone sarcoma or giant cell tumour of bone in the lower extremity going through bone tumour resection and reconstruction with a tumour prosthesis in the hip or knee (Table 1). Assessments were completed once for each patient, i.e., the study design was cross-sectional. All three cohorts (n = 87) were used for analyses of internal structure. Cohort one (n = 30) was in addition tested for content and construct validity.

Cohort one included 30 patients enrolled from a complete cohort of 72 patients [10]. The patients had undergone surgery between 2006 and 2016 at the Musculoskeletal Tumor Section, Department of Orthopedic Surgery, University Hospital Rigshospitalet, Copenhagen. The included patients (n = 30) were interviewed using the Patient Specific Functional Scale (PSFS) and assessed using the MSTS and concurrent outcome measurements at mean 7 (range, 2–12) years after surgery.

Cohort two included 24 patients enrolled from a complete cohort of 50 patients [11]. The patients had undergone surgery between 1985 and 2005 at the Musculoskeletal Tumor Section, Rigshospitalet, Copenhagen. The included patients (n = 24) were assessed using the MSTS at mean 15 (range, 4–29) years after surgery.

Cohort three included 33 patients enrolled from a national cohort using the Global Modular Replacement System (GMRS) only as tumour prosthesis for reconstruction of bone [12]. The patients had undergone surgery between 2005 and 2013 at the Musculoskeletal Tumor Sections at Rigshospitalet, Copenhagen, and Aarhus University Hospital, Aarhus. The included patients (n = 33) were assessed using the MSTS at mean 5 (range, 1–11) years after surgery.

Table 1 Patient characteristics of the three included cohorts (n = 87)

Musculoskeletal Tumour Society Score (MSTS)

The Danish MSTS-LE was used [1, 7]. It comprises six items (Pain, Function, Emotional acceptance, Supports, Walking ability and Gait) and is administered by a clinician. Each item is scored on a 5-point Likert scale, ranging from 0 (worst possible score) to 5 (best possible score) [1]. The items have unique response options for 0 through 5 (Table 2). A sum score for the six items is calculated (maximum 30 points) and normalised to a 0–100 score.

Table 2 Items and response options of the Musculoskeletal Tumour Society Score for lower extremity

Semi-structured interview used for the evaluation of content validity

Patient Specific Functional Scale (PSFS) is designed to identify patient-important functions and activities. It is valid for use in numerous diseases and conditions and can be administered as a semi-structured interview or as a patient reported outcome (PRO) [13, 14]. We chose the semi-structured interview modality, carried out by a physiotherapist (LF). The patients were asked to identify up to five important functions or activities they were unable to perform or had difficulties with because of the condition. Once identified, the activities were categorised and listed. Activities that included the same type of movement were categorised into a meaningful concept [15]. For example, “walking on uneven surfaces”, “walking fast” or “walking long distances” were categorised into Walking. Sport represented any sports-related activity, for example playing golf, water polo, or swimming. Running was a separate category, as it could be either sports related or a means of moving quickly from one place to another, e.g., run to catch a bus. After identifying important activities, the patients were asked to score them for level of difficulty on a 11-point scale (0 = unable to perform the activity, 10 = able to perform activity at the same level as before surgery) [13, 16]. For each category, the number of patients identifying the activity and the median level of difficulty was presented in a chart. Individual mean PSFS scores were also used for the evaluation of construct validity.

Concurrent outcome measurements

Numeric Rating Scale (NRS) is a valid and widely used tool for the measurements of pain intensity among patients with varying conditions [17, 18]. The patients were asked to score current pain intensity (0 = no pain, 10 = worst pain imaginable).

Toronto Extremity Salvage Score (TESS) is a patient-specific PRO developed to account for the heterogeneity of functioning in patients with bone and soft-tissue sarcoma [19, 20]. It is unidimensional and comprises 30 questions about daily tasks, work/school and leisure time [19]. Difficulties performing the activities are scored on a 5-point Likert scale (1 = impossible to do, 5 = not at all difficult). The total score is calculated as a percentage of the maximum score. The Danish version has shown acceptable comprehensibility, test-retest reliability and construct validity [21].

The EORTC QLQ-C30 is a multidimensional PRO measuring quality of life (QoL) in patients with cancer [22]. The QLQ-C30 is widely used, shows robust psychometric properties, has population-based reference data and is translated into Danish [22,23,24]. It consists of 30 questions scored on Likert scales [4]. We used the sum score and sub scores for physical functioning, emotional functioning and pain in the analyses, normalised to 0–100-points. A high sum score represents a high QoL, a high functioning score represents high levels of functioning and a high pain score represents high levels of pain [22, 24].

30-second chair stand test (CST) assesses muscle power and strength of the lower extremity, it can predict deterioration of function and can be used in people with different diseases and ages [25,26,27,28,29,30]. It has shown good reliability (ICC > 0.80) and a measurement error of 1 repetition [26]. The patients were asked to stand up and sit down from a 45 cm chair as many times as possible in 30 s. A standardised protocol from the Association of Danish Physiotherapists was used.

6-minute walk test (6MWT) assesses walking capacity and has been used on patients with numerous diagnoses, including bone sarcoma [10, 26, 31,32,33,34,35]. It has shown ICCs of > 0.90 and measurement errors between 14 and 30 m [26, 34]. The patients were asked to walk as fast as possible back and forth on a 20-meter walking track in an enclosed corridor at the hospital. A standardised protocol from the Association of Danish Physiotherapists was used.

Analysis

Demographic data, PRO scorings, and physical tests were presented as number (%), mean (SD), median (range) values as appropriate for different scales. A sample size of at least five observations per item and at least 100 observations has been suggested for determining structural validity [36,37,38]. We were able to include 87 patients from three complete cohorts between 1985 and 2016. The MSTS scorings had no missing data. For the data collection of concurrent measurements, there was one patient in Cohort 1 that declined physical tests (CST, 6MWT) at the hospital because of logistical reasons, and one patient had internally missing data in QLQ-C30 physical functioning. Different statistical analyses were applied for different psychometric evaluations. Analyses were performed using IBM SPSS v.29.

Content validity is defined as the degree to which the content of a PRO is an adequate reflection of the construct to be measured [36, 39]. Since the MSTS intends to measure functioning [1], the six items and their response options should be a reflection of functioning. An international consensus for quality rating of PROs has recommend three overarching criteria for the evaluation of content validity: relevance, comprehensiveness, and comprehensibility [3]. Relevance includes an evaluation of items’ relevance for the construct and the population of interest. To evaluate items’ relevance for the construct of functioning, MSTS items were listed and, wherever possible, linked to codes of the International Classification of Functioning, Disability and Health (ICF) [40]. To evaluate MSTS items’ relevance for the population of interest, we linked MSTS-items to activities identified in the PSFS. Comprehensiveness includes an evaluation of whether key concepts are included in an outcome measurement. Key concepts can be found in core outcome sets [41,42,43,44,45]. Since there is no specific core outcome set for patients that undergo bone sarcoma surgery, we chose to link MSTS items to key concepts defined in core outcome sets for cancer and primary total knee and hip joint replacement [43, 44]. Comprehensibility was evaluated by linking response options of the MSTS to the ICF and PSFS. Response options should match items to meet quality standards [3]. The linking processes were done independently by two of the authors (NS, LF) following recommendations for ICF-linking of outcome measures [15].

Data quality. Missing data of individual items, central tendency, distribution of item-scoring and floor- and ceiling effects were described. Floor- and ceiling effects were defined as present if > 15% of patients scored the lowest or highest possible score, respectively [46].

Internal consistency has been defined as the degree of interrelatedness amongst the individual items [39]. The analysis requires a unidimensional scale of at least three items [39]. If our analysis of structural validity suggested > 1 dimension, internal consistency was tested separately for each dimension [46]. Inter-item correlation, item-total correlation, and Cronbach’s alpha if item deleted were determined [47]. An inter-item correlation between 0.20 and 0.50 is recommended [36]. The item-total correlations assume that patients with a high total score also have high scores on all items [36]. If an item shows an item-total correlation of < 0.30 it does not help greatly in distinguishing between patients with high and low scores and can be removed. A Cronbach’s alpha if item deleted shows the value for remaining items that are still in the analysis. A high value indicates that the deleted item is redundant and a low value that there is room for more items under the same construct. A Cronbach’s alpha value between 0.70 and 0.90 is commonly considered acceptable interrelatedness [48].

Structural validity has been defined as the degree to which the scores of a PRO are an adequate reflection of the dimensionality of the construct to be measured [39]. Initially, the data was tested for suitability for factor analysis. Inter-item correlation coefficients between 0.20 and 0.80, overall correlation of a Kaiser-Meyer-Olkin (KMO) of > 0.50 (ideally > 0.80) and a significant Bartlett’s test of sphericity have been recommended as prerequisites for factor analysis [36, 37]. We applied a principal component analysis (PCA). The number of latent factors extracted was based on the shape of a scree plot (elbow and levelling), Kaiser’s criterion (eigenvalue > 1) and the cumulative percentage of explained variance after each factor (ideally 70–80%) [37, 49, 50]. Oblique rotation (direct oblimin) method was applied since our factor correlation matrix showed a coefficient above the suggested cut-off 0.32 [37, 49, 50]. There is no consensus on threshold for sufficient loading of an item to a factor, but with a sample size of at least 100 patients, a loading of > 0.30 is usually considered significant [50]. Items that load substantially (> 0.3) on more than one factor are called complex variables and need to be taken into consideration [50].

Construct validity is defined as the degree to which the score of an outcome measurement is consistent with hypotheses of expected relationships to other PROs [39]. High correlations are expected when measurements of the same construct and with the same mode of administration are compared (convergent). Conversely, lower correlations are expected when different constructs are compared (divergent). Previously published results of correlations between the MSTS and concurrent outcome measures were used as guidance when formulating predefined hypotheses [19]. MSTS sum scores were expected to have high correlations to scorings from TESS, PSFS and QLQ-C30 physical function, as they all measure functioning subjectively [7, 51]. MSTS sum score was expected to correlate at a moderate level with QLQ-C30 sum score, since it is a multidimensional measurement [52]. Concurrent measurements of more narrow constructs (e.g., pain, walk capacity, emotional function) were expected to have high correlations to single items of the MSTS but low correlations to MSTS sum score [53] The research group formulated hypotheses of correlation prior to analyses. Cut-offs for high (≥ 0.60), moderate (> 0.30 to < 0.60) and low (≥ 0.30) correlation were applied [40]. For a positive rating of hypothesis testing, at least 75% of predefined hypotheses should be confirmed [46]. The Spearman’s rank correlation coefficient test was used.

Results

Content validity

Semi-structured interview. The patients (n = 30), identified a total of 94 important activities which they found impossible or difficult to perform. These single activities were categorized into 12 meaningful concepts (Fig. 1). The three most frequently identified activities were Walking (n = 14), Sports (n = 19) and Running (n = 20), with median (min–max) difficulty levels of 3.5 (0–5) points, 1 (0–7) point, and 0 (0–6) points, respectively.

Fig. 1
figure 1

Number of activities (dark grey bar) the patients found important and were unable to perform or had difficulties with because of the condition. Median score (light grey bar) of the level of difficulty ranging from 0–10 points (0 = unable to perform the activity, 10 = able to perform activity at same level as before surgery). ***Squatting: This includes the isometric position in a squat and the dynamic squat. **Walking: This is a summary of walking in various speeds and distances in diverse terrain. *Sports: This includes various sports activities such as soccer, swimming, golf, tennis, badminton, dancing, water polo and skiing

Items’ relevance for the construct of functioning. All MSTS-items, except for Emotional acceptance, could be linked to ICF-codes (Table 3). The item Function was considered a wide concept and could be linked to any ICF-code under the domains (b) and (d).

Items’ relevance for the included sample. Two of six MSTS-items could be linked to PSFS (Table 3). The MSTS-item Function could be linked to any activity identified in the PSFS.

Key concepts. The MSTS-items Pain and Functioning were linked to the different domains Pain and Function defined in both core outcome sets [43, 44]. The domain ‘patient satisfaction’, in the core outcome set for joint replacement, was partly linked to the MSTS-item Emotional acceptance, since one response option included the word ‘satisfied’.

Comprehensiveness. The response options for the items Pain, Function and Walking Ability changed content throughout the scale (Table 3). The response options ‘disabling’ and ‘disability’ could be linked to several ICF-codes and the response option ‘recreational’ could be linked to several activities identified in the PSFS (Table 3).

Table 3 Content validity. Linkage of items and response options of the MSTS to ICF and PSFS

Data quality

Item median values ranged from 3 to 5 and all response options were used (Table 4). None of the items showed floor effects, but all items, except for Function, showed ceiling effects (Table 4). There were no internal missing values.

Table 4 Data quality of the Musculoskeletal Tumor Society Score

Internal consistency

Three inter-item correlation coefficients exceeded 0.50 (Supports and Walking ability, r = 0.60; Supports and Gait, r = 0.55; Walking ability and Gait, r = 0.55) (Table 5). As our PCA did not support unidimensionality, but a two-factor solution, the item-total and the Cronbach’s alpha was only tested for Factor 1. The item Function showed the lowest item-total correlation (r = 0.45) but did not fall below the limit of < 0.30 (Table 5). The items Supports and Walking ability showed Cronbach’s alpha, if item deleted, below accepted values between 0.70 and 0.90 (Table 5).

Table 5 Inter-item correlation of all six MSTS items (n = 87)

Structural validity

The inter-item correlation between Pain and Gait showed a low correlation (r = 0.19). Since this study was not a data reduction exercise and the two items had acceptable correlations to remaining items, they were retained. The KMO was 0.79 and the Bartlett’s test was significant (p < 0.001) suggesting adequate data for the performance of a factor analysis.

The scree plot illustrated a steep slope for Factor 1 (eigenvalue 2.904), intermediate slope for Factor 2 (eigenvalue 1.017) and almost flat slope for Factor 3 (eigenvalue 0.685) (Fig. 2). The cumulative percentage of total variance explained was 48.4% for Factor 1 and 65.4% for Factor 1 and 2. Based on eigenvalues, cumulative percent and the scree plot, a two-factor solution was suggested for the analysis of factor-loading pattern.

Fig. 2
figure 2

Scree plot of the principal component analysis

The factor loading pattern for a two-factor solution showed high loadings for Supports, Gait, Walking ability and Function to Factor 1, but not for Pain and Emotional acceptance (Table 6). The items Walking ability and Function loaded > 0.30 on two factors, thus identified as complex variables.

Table 6 MSTS factor loading pattern by principal components analysis with loadings sorted by size (n = 87)

Construct validity

Six out of 13 (46%) predefined hypotheses were ascertained (Table 7). The TESS, QLQ-C30 sum score, QLQ-C30 physical functioning sub score and pain ratings showed high correlations to the MSTS (Table 7). The QLQ-C30 sum score was not expected to have a high correlation to the MSTS, since it measures QoL and not functioning only. The MSTS showed a low correlation to the PSFS, which was unexpected since both should reflect the construct of functioning. Also, the MSTS item Walking ability had an unexpectedly low correlation to walking capacity (6MWT).

Table 7 Hypotheses testing (n = 30)

Discussion

The MSTS-LE showed insufficient content validity. The internal consistency and hypothesis testing were below acceptable levels. We found ceiling effects in five of six items and, in contrast to other studies, our analyses supported a two-factor solution.

The evaluation of content validity showed that there were concerns with the three quality criteria; relevance, comprehensiveness, and comprehensibility. The item Emotional acceptance was not relevant to the construct of functioning, the item Function was relevant to the construct but had a too broad and unspecific content. Pain and Function should pertain to separate constructs. Three items did not have matching response options and many patient-important activities identified in the interview were not represented in the MSTS. The MSTS has been criticised for not involving patients’ perception of function in the development of items and response options [19]. We used a semi-structured interview to evaluate the MSTS-items’ relevance to the population of interest. Our results showed that the patients reported many more functions and activities that were important for them, than those in the MSTS. For example, recreational activities such as gardening, bicycling, hiking, and different sports activities were considered important, but not specifically named in the MSTS. For the measurement of functioning, an alternative to the MSTS could be the TESS [54]. The items of the TESS were development based on input from patients with bone and soft-tissue sarcoma [19]. Comparing the TESS to our interview, the TESS includes kneeling, walking, gardening, and recreational activities also found in our interviews, suggesting that TESS has a more relevant content than the MSTS for this patient group. The evaluation of the items’ relevance to the construct of functioning showed that the item Emotional acceptance could not be linked to the ICF. This suggests that Emotional acceptance does not reflect functioning and should not be part of PROs with functioning as the construct of interest. Further, the linking process of the item Function was of concern, as it could be linked to many ICF-codes, reflecting several functions, resulting in a very broad and unspecific content. This was supported by a relatively low item-total correlation for the item Function, suggesting that the content is unspecific and there is scope for more items under the same construct [48]. An unspecific content will make interpretation difficult. The items Pain and Function were linked to important key concepts, but were defined as two separate domains in the core outcome sets suggesting that they reflect different constructs [43, 44]. When separate constructs are measured, they should either pertain to different PROs, or they should be treated separately in multidimensional scales [36]. Based on the unspecific content of the item Function and the potential mix of different constructs within the MSTS-LE, a sum score should be interpreted with caution. Moreover, the evaluation of comprehensibility of response options showed similar results as Lee et al., with concerns about the formulations for Pain, Function and Walking ability [4]. The item Pain relates to the intake of analgesics rather than the perception of pain and the items Function and Walking ability change content throughout the scale. One main requirement in formulating items and their response options is that they should be simple, easy to understand and the response options should match their items [3]. Since the response options for three items of the MSTS-LE change in content, the response options are difficult to interpret and do not match the items.

The factor analysis in our study supported a two-factor solution. In contrast to our results, two earlier studies considered the MSTS to be unidimensional, i.e. consisting of one factor only [5, 6]. The scree plots in the two earlier studies showed elbow shapes located at the second factor, similar to our study, but their eigenvalues for the second factors were just below 1, whereas ours was just above 1. Determining the number of factors, and thereby the dimensionality of a measurement, can be difficult when scree plots do not take a characteristic sharp elbow shape and eigenvalues are close to the cut-off value 1. One of the earlier studies discussed the possibility of a two-factor solution but decided to let the eigenvalue < 1 for a second factor determine the unidimensionality of the MSTS [5]. Values close to cut-offs can lead to different conclusions in different studies, which in this case indicates that the MSTS is not sufficiently robust between samples. Further, looking at the factor-loading patterns, it is doubtful whether the MSTS can be supported as a unidimensional measurement of functioning. Our study clearly showed that the items Pain and Emotional acceptance had low loadings to Factor 1 and high loadings to Factor 2, indicating that Pain and Emotional acceptance are explained by another underlying construct than functioning. This is supported by the three earlier studies showing lower loadings for Pain and Emotional acceptance compared to the other items of the MSTS although, they never tested a two-factor solution and investigated whether Pain and Emotional acceptance had a better fit to a second factor [5, 6, 9]. Since Pain and Emotional acceptance can be vaguely explained by the underlying construct of functioning, they should be treated as a separate factor. The MSTS-LE should therefore not be considered a unidimensional measurement of functioning, but rather a multidimensional measurement where the dimensions should be treated separately with separate subscores rather than a sum score, as is current practice.

One limitation in our study was sample size. It is recommended that at least 100 patients are included when performing factor analyses [36,37,38]. With the data available (n = 87) one could consider increasing the limit for an item to contribute sufficiently to a factor from > 0.30 to > 0.50 [36]. By doing so, the items Function would not load sufficiently to Factor 1. This leaves the item Function a complex variable only, not pertaining to any of the two factors, which complicates the interpretation of the MSTS even further. Another limitation is time from surgery to assessment point. In all three cohorts time from surgery varied widely and for most included patients many years had elapsed. Time from surgery can affect which patients could be included from the complete cohorts. Because around 60–80% of the patients in the three cohorts were alive at inclusion [11, 12, 55], the cohorts could comprise patients with a better outcome of physical function than the background population. Including a subgroup with a better function from the total population has presumably biased the results to better scorings of the MSTS and can possibly explain our high ceiling effects.

Conclusions

The MSTS showed insufficient content validity and when asking patients, other functions than those included in the MSTS were of importance. Our findings do not support the MSTS as a unidimensional measurement of functioning, but a two-factor solution. Thus, MSTS sum scores should be interpreted with caution. We suggest that alternative outcomes, such as the TESS and objective measurements, are considered for the evaluation of functioning in clinical practice and future research.