Introduction

Over the last decades, the focus of clinical research has shifted from conventional survival and disease outcomes, to patient experience and patient-reported outcomes (PROs) [1]. A PRO is any report coming directly from a patient, without interpretation by a physician or others, describing the patients’ current health condition [2]. PROs as a primary or secondary outcome can provide a more holistic and comprehensive assessment when investigating the harms and benefits of an intervention [1, 3]. PROs are measured using patient-reported outcome-measures (PROMs), which are the instruments or tools utilized to evaluate the patients’ health status from the patient’s perspective [1, 2].

Orthopedic injuries of the upper extremities are amongst the most common injuries in the pediatric population [4, 5]. As these ailments can be associated with consequential complications and functional disabilities, adequately evaluating patients during follow-up is essential [6]. In recent years, the previously described transition in outcome-focus has also made its way into the rapidly expanding research field of pediatric orthopedics. This shift is reflected by a significant increase in the utilization of PROMs in pediatric orthopedic studies [7,8,9]. However, an increase in PROM use does not necessarily translate to improved outcome assessment. The misuse of PROMs may prompt researchers to interpret results incorrectly and potentially make misleading or even harmful recommendations for clinical practice [10]. Thus, selecting the appropriate instrument for the appropriate study population and purpose is essential for the further development of PRO-based research [11].

Systematic reviews of PROMs play an important role in guiding PROM selection [12]. By providing an evidence-based overview of available PROMs and presenting recommendations for their use, reviews of PROMs enable clinicians and researchers to find the most suitable instrument for a given purpose [13]. However, to our knowledge, previously published reviews of pediatric orthopedic PROMs either exclusively cater a niche subgroup of patients, or focus on frequency of use, and do not aid in PROM selection [7,8,9, 14].

As a result, the inadequate application and selection of PROMs is still common practice in pediatric orthopedics. In a recent publication, Arguelles et al. [9] demonstrated that researchers are faced with major challenges when selecting appropriate PROMs. Approximately three quarters of pediatric orthopedic studies reporting PROMs used at least one PROM that was inadequately validated for the population of interest [9]. The improper use of PROMs in pediatric orthopedic research uncovers an urgent need for guidance on PROM selection and application, so that future results can be interpretated adequately and PROMs can be implemented in daily practice with true scientific justification.

Thus, we conducted a systematic review of pediatric orthopedic PROMs validated for children with impairment of the upper extremity. The primary goal of this review was to provide a comprehensive overview of self- and/or proxy-completed questionnaires targeted at children with impairment of the upper limb, and to critically appraise and summarize the quality of their measurement properties. The secondary goal of this review was to provide evidence-based recommendations for PROM selection in pediatric orthopedic research and clinical practice.

Methods and materials

Design

In conducting this systematic review, the updated COnsensus-based Standards for selection of health Measurement INstruments (COSMIN) methodology for systematic reviews of PROMs was used [15,16,17]. This systematic review adhered to the newly revised Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA) statement [18].

Pre-registration

This study was pre-registered in PROSPERO (PROSPERO registration number: CRD42021254791).

Search strategy

To identify relevant studies, MEDLINE was systematically searched using PubMed, and EMBASE was systematically searched through the Embase search engine. The timeframe was defined as 1st of January 2000 to 8th of February 2021. The search was restricted to English and/or Dutch articles only by using language filters.

A comprehensive search strategy was constructed in collaboration with a clinical librarian to guarantee a thorough approach. The search strings for each database can be found in full detail in Additional file 1: Appendix 1. The search was initially constructed for PubMed and subsequently adapted to fit the Embase search engine. The search consisted of four distinct elements: (A) search terms describing the population of interest with a validated pediatric study search filter by Leclerq et al. [19], (B) the comprehensive PROM-filter developed by the PROM Group of the University of Oxford, and two validated filters by Terwee et al. [20]: (C) a highly-sensitive measurement property filter and (D) an exclusion filter.

Eligibility criteria

Articles were considered eligible for inclusion if a full-text original version of the article was available and if the article reported on studies describing the development and/or the evaluation of one or more measurement properties of a generic and/or disease-specific patient-reported and/or proxy-reported questionnaire of any language, in a population consisting of children (0–18 years old) with an orthopedic diagnosis in the upper extremity region. Exclusion criteria consisted of any study design in which the patient-reported and/or parent-proxy-reported questionnaire was only used as an outcome measurement instrument (e.g., randomized controlled trials, longitudinal studies) and/or in which one or more questionnaires were evaluated that aimed to assess the use of prostheses by children (0–18 years old).

Study selection

First, all eligible studies were selected by screening the title and abstract. Thereafter, all selected papers were screened based on full text. During both phases two reviewers (JPR and TFF) independently identified eligible studies according to the predefined eligibility criteria and afterwards discussed the results. Disagreements were resolved by a third reviewer (IN or CJA). The references of the articles selected for full-text review were thoroughly screened to identify additional citations.

Data extraction and appraisal

The studies on measurement properties included in this review were assessed in accordance with the extensive and recently improved COSMIN methodology for qualitatively evaluating studies on PROMs [15]. Detailed information on the COSMIN taxonomy, the stepwise approach of the COSMIN methodology and the COSMIN checklists applied in this review, can be found in the corresponding publications by Mokkink et al. [16, 21], Prinsen et al. [15], and Terwee et al. [17].

Evaluation of study methodological quality

The COSMIN Risk of Bias checklist [16] was used to rate studies evaluating validity (structural validity, hypotheses testing for construct validity and cross-cultural validity), reliability (internal consistency, reliability and measurement error) and/or responsiveness of a PROM. This modular tool consists of ‘boxes’ containing standards for rating the quality of a study on a measurement property on a four-point rating scale: ‘very good’, ‘adequate’, ‘doubtful’ or ‘inadequate’ [16]. “The worst score counts” principle was then applied to come to an overall methodological quality rating for each individual study on a measurement property [15].

Studies on content validity (content validity and PROM development) were evaluated using the separate COSMIN methodology for evaluating content validity [17]. The quality of these studies was rated following the standards included in the ‘boxes’ of the COSMIN content validity checklist [17]. The worst score counts principle was then used to come to an overall quality rating for the studies [17].

Data extraction

Following the methodological quality assessment, data on the characteristics of the included study populations (e.g., sample size, age range, diagnoses), characteristics of the studied PROMs and results of each study on a measurement property were extracted using tables provided by the COSMIN initiative [15].

Assessment of psychometric properties

The result of each study on a measurement property was rated against the updated criteria for good measurement properties [15]. The individual results were rated as ‘sufficient’ ( +) when the results were in line with the COSMIN criteria, and ‘insufficient’ (–) if the results did not meet the criteria. The result of a study on a measurement property was considered ‘indeterminate’ (?) when essential information was missing, no hypotheses were defined prior to starting the study or relevant analyses were not performed [15].

Evidence synthesis

Finally, a qualitative synthesis of the evidence per measurement property, per PROM was constructed to come to an overall conclusion of PROM quality. If consistent (i.e., ≥ 75% of the results are either rated ‘sufficient’ or ‘insufficient’), the results of the individual studies on measurement properties were qualitatively summarized and again rated against the criteria for good measurement properties. If inconsistent, an explanation for this inconsistency was sought. When the inconsistency remained unexplained, the overall result was rated as ‘inconsistent’ (±). An ‘indeterminate’ (?) rating was given when the individual results were all rated as ‘indeterminate’ [15].

After qualitatively synthesizing and rating the overall results per measurement property, per PROM, the quality of this evidence was graded. In accordance with COSMIN guidelines, a modified Grading of Recommendations Assessment, Development, and Evaluation (GRADE) approach was used for grading the evidence [15]. The summarized results were graded as ‘high’, ‘moderate’, ‘low’ or ‘very low’, based on three factors: risk of bias (based on methodological quality), inconsistency and imprecision (i.e. sample size). The fourth factor ‘indirectness’ was not taken into consideration in evaluating evidence quality, this review only included studies with a predefined and fixed patient population. If the quality of the summarized result was rated ‘inconsistent’ or ‘indeterminate’, the quality of the evidence could not be graded [15].

The above-mentioned subsequent steps of the COSMIN evaluation were performed by two reviewers (JPR and TFF) independently. If consensus could not be reached during any of the evaluation procedures, an additional reviewer (IN and/or CJA) was consulted. For evaluating inter-rater agreement, a percentage agreement was calculated by dividing the number of ratings which the reviewers agreed on, by the total number of ratings given by the two reviewers. In accordance with the criterium for assessing inter-rater agreement proposed by Mokkink et al. [22], the inter-rater agreement of the reviewers was considered appropriate when reviewers reached > 80% agreement.

Results

The literature search initially identified 8179 articles. After duplicates were removed, 6423 articles remained. Of these 6423 references, 113 were deemed eligible for inclusion after screening the titles and abstracts. As a result of hand-searching the bibliographies of these eligible articles, 27 potentially relevant citations were identified. The full-text assessment of the remaining 140 articles resulted in the inclusion of 32 original reports. The PRISMA flow diagram describing the selection process is shown in Fig. 1.

Fig. 1
figure 1

PRISMA flowchart

The inter-rater agreement (percentage agreement) was calculated to be 94% and therefore considered appropriate.

General characteristics of included studies and instruments

Table 1 details the key characteristics of the articles included. In total, 32 articles reported evidence on 97 measurement properties of 22 PROMs (i.e., 15 original English PROMs and 7 cultural adaptations). The measurement property most frequently evaluated was construct validity, with 25 articles reporting on at least one construct validity assessment (e.g., hypotheses testing for construct validity). In contrast, responsiveness was evaluated in only four articles [23,24,25,26].

Table 1 Characteristics of the included studies

In agreement with COSMIN methodology, each version of a questionnaire was considered a separate PROM (i.e., cross-cultural adapted versions or revised versions) [15]. The characteristics of the instruments included in this review are shown in Table 2. English versions of PROMs were assessed most frequently. Studies performing cross-cultural adaptation and subsequent validation were scarce. Only seven culturally adapted PROM versions were evaluated in validation studies [26,27,28,29,30,31,32].

Table 2 Characteristics of the included PROMs

Synthesized evidence

The results of the methodological quality assessment and criteria for good measurement properties ratings of the individual studies are presented in Table 3. In Table 4, for each PROM the qualitatively summarized results per measurement property, their overall quality rating (criteria for good measurement properties) and evidence quality grade (modified GRADE approach) are detailed. The detailed results of each study on a measurement property of a PROM included in this review, can be found in Additional file 1: Appendix 2.

Table 3 Methodological quality and ratings of measurement properties of the included PROMs
Table 4 Synthesized evidence

Content validity

No studies evaluating the content validity of a PROM were considered eligible for inclusion in this review. Therefore, only the methodological quality of the included PROM development studies was determined. As each of the included development studies did not report on a pilot study assessing the comprehensibility and comprehensiveness of the instrument, the overall methodological quality of the four PROM development studies was rated as ‘inadequate’ or ‘doubtful’ [33,34,35,36].

Structural validity

Structural validity was evaluated for eleven of the included PROMs [27,28,29,30,31,32,33,34,35,36,37,38,39]. Five studies assessed the structural validity of a cultural adaptation of the ABILHAND-Kids questionnaire [27,28,29,30,31]. Only one PROM demonstrated evidence for sufficient structural validity: the Persian adaptation of the ABILHAND-Kids questionnaire [31]. For the other PROMs, the results of the structural validity analyses did not meet the COSMIN criteria for good measurement properties (mostly regarding the range of goodness-of-fit statistics) [27, 28, 30, 33, 36], the authors failed to report on important aspects of the IRT/Rasch analyses [29, 35, 38] and/or the subscales were only separately evaluated, which does not provide evidence for structural validity of the instrument as a whole [34, 37,38,39].

Internal consistency

For internal consistency analyses to be interpreted correctly, an instrument should at least show low-quality evidence for sufficient structural validity [15]. Therefore, only the internal consistency analysis of the Persian version of the ABILHAND-Kids questionnaire was rated [31]. For the other PROMs, the results of the internal consistency analyses were reported and an ‘indeterminate’ rating was given.

Other measurement properties

Thirteen of the included PROMs demonstrated evidence for sufficient test–retest reliability [26, 28,29,30,31,32, 37, 39,40,41,42,43,44]. Only the Dutch version of the Pediatric Outcomes Data Collection Instrument (PODCI) demonstrated evidence for insufficient reliability with ICC values ranging from 0.022–0.972 for the different subscales [26].

The results of analyses on measurement error were all rated as ‘indeterminate’, since information on minimal important change (MIC) had not yet been published for the PROMs included in this review.

Discussion

This study is the first systematic review to provide a comprehensive overview of evidence on the psychometric properties of PROMs used for evaluating children with impairment of the upper extremity. Twenty-two PROMs, measuring various constructs, were included and evaluated using the updated version of the extensive COSMIN methodology to ensure a high-quality assessment. Additionally, this study provides an opportunity to formulate evidence-based recommendations for PROM-selection and increase awareness on proper PROM utilization in clinical practice and research.

When basing recommendations for PROM-selection exclusively on the quality of their measurement properties, the current lack of evidence on PROM-quality has the consequence that the 22 pediatric orthopedic PROMs included in this review have the potential to be recommended for use, but further research is required to assess their quality. Evidence on content validity and internal consistency of a PROM is fundamental to formulating a transparent, evidence-based recommendation [15]. However, content validity, which can be considered the most important psychometric property of a PROM [21], was not evaluated for any of the included PROMs. Internal consistency was evaluated for 16 of the 22 pediatric orthopedic PROMs. Unfortunately, only one study provided sufficient evidence to rate the internal consistency of the questionnaire (ABILHAND-Kids: Persian version). All other studies provided insufficient evidence on structural validity, which is essential for correctly interpreting the results of internal consistency analyses [15]. Furthermore, psychometric properties of only four of the questionnaires were validated in more than one validation study (ABILHAND-Kids (original version), PODCI, Children's Hand-use Experience Questionnaire and Hand-Use-at-Home questionnaire). Even though these instruments were evaluated most frequently, the quality of two thirds of their measurement properties was rated as ‘indeterminate’ or ‘inconsistent’, with the PODCI solely demonstrating inconsistent evidence. This trend was also observed for the other PROMs included in this review. Moreover, the overall quality of the included validation studies varied considerably, mainly due to insufficient sample size and/or poor methodological quality.

When exploring additional means to provide clinicians and researchers with a basis to guide their PROM-selection, formulating recommendations based on feasibility aspects of PROMs constitutes a valuable alternative approach. The term ‘feasibility’ refers to the ease with which the instrument is applied in its intended context of use and includes PROM characteristics such as completion time and length of the questionnaire [15]. Although feasibility is not considered a measurement property as it does not pertain to the quality of a PROM, feasibility aspects profoundly influence the practical utility of a PROM, especially factors influencing response rate and patient compliance such as questionnaire length [45]. The data collection method of computer-adaptive testing (CAT) uses item-response theory to minimize questionnaire length and completion time; consequently, optimizing response rates [45]. Whereas the majority of the included PROMs use traditional data collection methods, one PROM was assessed using computer-adaptive testing: the PROMIS – Upper Extremity item bank computer-adaptive test (CAT). Therefore, based on the evidence currently available, the PROMIS – Upper Extremity item bank CAT can be considered the most appropriate PROM for evaluating upper extremity function in children, when adopting this feasibility-driven approach to guiding PROM-selection.

The overall methodological quality of the four PROM development studies included in this review was rated as ‘inadequate’ or ‘doubtful’ [33,34,35,36]. For each of the instruments, the developmental process lacked a cognitive interview study or other pilot test evaluating their comprehensibility and comprehensiveness. During the development of PROMs in pediatric research, researchers must take developmental influences such as age-dependent disease-awareness and cognitive–linguistic ability, into careful consideration [46, 47]. These considerations unique to pediatric qualitative research, make developing pediatric PROMs with a high methodological quality, a strenuous and time-consuming practice. However, to ensure the questionnaire matches the perspective and needs of the patients it has been designed for, it is imperative to adequately evaluate aspects such as comprehensibility, especially for pediatric PROMs. To guarantee future pediatric orthopedic PROMs will adequately reflect the patients’ perspective on their health condition, it is vital to incorporate pilot studies assessing relevance, comprehensiveness, and comprehensibility into the development of these instruments.

Whilst conducting this systematic review, we followed the extensive and newly updated COSMIN methodology for systematic reviews of PROMs, which can be considered one of the strengths of this study. Using the COSMIN checklists sometimes requires a subjective judgement by the reviewer (e.g., in determining which measurement properties were assessed when the terms used in the article did not match the COSMIN taxonomy). This potential source of bias was addressed by two reviewers independently extracting and evaluating data and by building consensus, further strengthening the approach utilized in this review.

This review has some limitations. Even though using the COSMIN methodology guarantees a standardized and thorough approach for evaluating the included studies on measurement properties, “the worst score counts” principle applied in rating these studies can be considered reductive. As the worst rating in a COSMIN box will determine the overall result of the quality assessment, the absence of reporting on a particular evaluation step or statistical method can result in the study being rated as ‘doubtful’ or even ‘inadequate’. Consequently, a cogent argument can be made that using this principle results in the undervaluation of the already small amount of evidence available on pediatric orthopedic PROMs.

In an effort to provide a comprehensive overview of the pediatric orthopedic PROMs available to clinicians and researchers, we purposefully used broad inclusion criteria with respect to study population (e.g., any orthopedic condition in the upper extremity region) and type of instrument (e.g., self-completed as well as proxy-completed questionnaires). Subdividing the population of interest based on affected limb, body region or disease type, was limited by the paucity of evidence available on pediatric orthopedic PROMs. In addressing the challenges these broad inclusion criteria posed to the feasibility of our review, some concessions had to be made regarding the scope of our search. Consequently, only MEDLINE and EMBASE were searched omitting potentially relevant databases like CINAHL, and the timeframe was condensed, possibly preventing the inclusion of additional relevant articles.

Conclusions

In conclusion, a comprehensive overview was given of PROMs used in pediatric orthopedic research of the upper extremity. None of the PROMs included in this review demonstrated sufficient evidence on their measurement properties to strongly recommend the use of any of these instruments in children with impairment of the upper extremity. The absence of studies on content validity for any of the included PROMs is especially worrisome, as this implies it is currently unknown if the questionnaires used in pediatric orthopedic research and clinical practice adequately reflect the construct they intend to measure. When an alternative, feasibility-driven approach to guiding PROM-selection is adopted, the PROMIS – Upper Extremity CAT can cautiously be considered the most appropriate PROM for measuring upper extremity function in children with impairment of the upper limb. The lack of evidence on PROM-quality uncovers a need for high-quality development and validation studies, and especially studies on content validity, for PROMs utilized in pediatric orthopedics.