Introduction

Duchenne muscular dystrophy (DMD) is a rare x-linked genetic neuromuscular condition with an estimated prevalence of 19.8 per 100,000 live male births [1]. People with DMD experience progressive muscle degeneration and weakness with childhood symptom onset as early as age two. The condition manifests in increasing difficulties with ambulation and motor functioning, with eventual cardiovascular and respiratory problems [2]. Improved treatments and standards of care have increased the life expectancy of people with DMD, with those on ventilator support living for a median 31.8 years [3].

As a pervasive and life-limiting condition, DMD has been observed to impact the health-related quality of life (QoL) of people with the condition in multiple ways. Key areas of impact include independence, relationships and social participation, and psychological wellbeing [4]. As well as the impact on people living with the condition, DMD has a notable influence on the QoL of informal caregivers [4]. Duchenne requires substantial caregiving input, which increases over time as functional ability deteriorates [5]. Caring for someone with DMD involves a vast range of caregiving activities and over time comes to include a host of emotional, social, and physical support, including assistance with day-to-day living (e.g., dressing, eating, cleaning, toileting, transfers and mobility) [6]. As primary carers tend to be family members (i.e., parents), any potential impact on their QoL is heightened as they have to learn to cope with a DMD diagnosis, its progressive and pervasive nature, and the knowledge of what that means for them and their loved ones.

Documented effects on QoL of caring for someone with DMD include problems with sleep, psychological wellbeing, relationships, family resources, physical burden, and impact on the wider family [4]. However, other impacts are likely to exist that have not been well captured in existing data. Carer health-related QoL is typically measured using self-report questionnaires. As well as generic QoL instruments that are used in carers and non-carers alike, specific questionnaires have been developed with an aim to assess carer QoL in particular, including the Care Related Quality of Life (CarerQol) instrument and the Caregiver Strain Index (CSI), which have both been used in DMD research [7]. However, at present, there is little evidence to justify the use of any particular generic and/or specific questionnaire for assessing QoL in DMD carers. Reviews on the burden of caregiving exist [4, 6], but none that critically evaluate the reliability and validity of the self-report instruments that have been used to measure it.

As competing instruments exist to assess carer QoL in DMD, without further data and an evaluation of the evidence on the psychometric properties of these measures, it is difficult to ascertain which instruments are most suitable for use in this context. Given the degree and breadth of the impacts of DMD on daily life for people living with the condition and their informal caregivers, it is not known whether available instruments are sufficiently reliable and valid for assessing carer QoL in DMD. In other progressive conditions, such as neurodegenerative diseases, condition-specific carer questionnaires have been advocated for [8]. Getting the tool right when assessing QoL in DMD carers is important, for understanding the scale of the impact on carers themselves and for accurately ascertaining the benefits of new health technologies.

Health technology assessment (HTA) agencies, such as the National Institute for Health and Care Excellence (NICE), promote the inclusion of “all direct health effects for patients or, when relevant carers” in their Guide to the Methods of Technology Appraisal [9]. This includes carer utility values (or resultant quality-adjusted life years [QALYS]), which are often included in the economic evaluation for health technology appraisals, including recently for Ataluren in treating DMD [10]. For one to have confidence that the evidence on carer QoL is accurate and reliable, it is important that the correct instrument to measure QoL (and thus the generated QALYs) is used, and this judgement should be based on supportive psychometric evidence. Such evidence must be collated in order to appropriately justify the use of a particular questionnaire and/or to indicate where future psychometric and instrument development work is needed. A full assessment of reliability and validity includes internal consistency, reliability, measurement error, content validity, construct validity, criterion validity, and responsiveness, as defined as a result of international expert consensus by the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) group [11].

The COSMIN approach represents a structured way of assessing the psychometric evidence of available questionnaires across a number of agreed-upon criteria. Content validity is argued to be the most fundamental psychometric property and refers to the extent that the content of a measure adequately reflects the target construct that is being assessed [11]. It can be meaningfully subdivided into three components: relevance, comprehensiveness, and comprehensibility, which can be understood by asking three questions. First, are the items, response options, and recall period used relevant for the construct, target population, and context of use? Second, is the questionnaire fully comprehensive, or are key aspects of the construct of interest missing? Third, is the content of the questionnaire, including the items, understood by the target population as intended [12]? When assessing content validity, the initial development paper(s) on the instrument and the content of the instrument itself are assessed, as well as any content validity studies undertaken in the population of interest [12].

According to COSMIN, the second most important psychometric property is structural validity [13]. Structural validity describes the extent that scores generated from an instrument appropriately reflect the dimensions of the underlying construct being measured [14]. As QoL is usually theorised and assessed as being multidimensional, questionnaires designed to measure QoL should be dutifully assessed to check that they accurately represent the multidimensional structure of QoL in the target population. Alternatively, if instruments are developed to target a specific dimension of QoL, psychometric tests should be conducted to validate that they are unidimensional when completed by the population of interest. If instead questionnaires are used without accompanying evidence of their structural validity in the target population, interpretation of the data (e.g. through the generation of dimension scores) may not be accurate.

COSMIN has provided internationally-consensual definitions of other important measurement properties, which, when good content and structural validity are documented, all contribute to a measure’s psychometric performance [11]. These include internal consistency, or the degree that items measuring the same thing are interrelated with one another; reliability, describing the proportion of variance due to genuine differences among participants; measurement error, relating to the error in a participant’s response not attributable to genuine changes in the construct being measured; construct validity, which includes structural validity, but more broadly covers the extent to which scores of an instrument are consistent with hypothesised internal relationships, relationships to other measures, and/or differences between groups; criterion validity, or the extent to which scores reflect a “gold standard”; and responsiveness, or the ability of the questionnaire to detect change over time in the construct of interest.

This systematic review has been designed to evaluate the content and psychometric properties of instruments used to measure QoL in informal carers of people with DMD using the COSMIN approach [12, 15]. COSMIN methodology is becoming increasingly used within systematic reviews evaluating the quality of QoL measures in particular health contexts [13, 16,17,18,19], including in a recent review of self-report measures used to assess QoL in people with DMD [20], which contributed to the rationale for the development of a new condition-specific QoL measure in this population [21].

For the purposes of this review, we define informal carers as someone providing care to a person with DMD with whom they have a non-professional caregiving relationship, including a parent or guardian, or other family member, friend, neighbour, or other relative or non-kin (where a caregiving relationship is defined) [22]. We exclude children (under 16 years of age) and people who are providing care in a formal or professional capacity, such as personal assistants. We also exclude relations (such as siblings) where a caregiver role is not defined or made explicit. Further, we define QoL as multidimensional, featuring components of physical (e.g., pain/discomfort, mobility, fatigue), psychological (e.g., self-esteem, mood), and social (e.g., relationships with others, participation) wellbeing [23]. We operationalise QoL as inherently subjective and thus do not include assessments of objective function that may affect QoL. In this review, we include instruments with multiple items that measure at least one aspect of QoL in informal carers of people living with DMD. The objectives of this review are to:

  1. 1.

    Identify which questionnaires have been used to assess QoL in informal carers of people with DMD.

  2. 2.

    Evaluate the measurement properties, including the strength and quality of evidence, of questionnaires that have been used to measure QoL in informal carers of people with DMD.

  3. 3.

    Make a recommendation for which questionnaire(s) (if any) are best suited to assess QoL in DMD informal caregivers, based on the current evidence, and identify gaps for future work.

Methods

The protocol for this review was registered with the International Prospective Register of Systematic Reviews (PROSPERO) (registration no: CRD42020200120) and can be accessed at: https://www.crd.york.ac.uk/prospero/display_record.php?ID=CRD42020200120. The manuscript has been written using the PRISMA 2020 reporting guideline and checklist [24].

Search Strategy and Information Sources

Searches

An information specialist was consulted in developing the appropriate search strategy and was responsible for conducting the main database searches. Search terms in this review included:

  1. (i)

    Duchenne muscular dystrophy (and derivatives);

  2. (ii)

    a comprehensive list of carer terms;

  3. (iii)

    a comprehensive search filter developed by the Patient Reported Outcome (PROM) Group at the University of Oxford to identify questionnaires [25];

  4. (iv)

    questionnaires known to be used in carers of people with DMD based on an earlier rapid review of the literature [4]; and

  5. (v)

    a validated search filter by the COSMIN group for identifying studies on measurement properties, as recommended by the COSMIN group [26].

A two-stage search was used, where in the first stage the search terms (i) AND (ii) AND ((iii) OR (iv)) were combined to identify all articles using questionnaires to assess QoL in DMD carers. In the second stage, the names of questionnaires identified in stage one were combined with (i) AND (ii) AND (v) to identify articles reporting on the measurement properties of these instruments for DMD carers. No restrictions on date or language were applied to the search strategy. The two-stage search strategy allowed us to identify which instruments have been used and reported in studies of carers of people with DMD, in the absence of any evidence of reliability and validity for their use. Full copies of the searches are contained in Additional file 1.

Electronic databases

The electronic databases searched for the systematic review are outlined in Table 1. All databases were searched from inception.

Table 1 Electronic databases for the primary searches

Additional searches

Following recognised approaches [13, 20], we searched Google Scholar (last searched 5th July 2021) with the names of the instruments identified in the database searches and taken forward for review in order to identify potential development papers for assessing content validity.Footnote 1 The first 100 hits on Google Scholar were screened for inclusion. Where development papers were not found in this manner, manual searching of instrument citations in the included papers was conducted. In addition, citation tracking, by means of screening of references (via Scopus) and Google Scholar citations, was conducted on full text research articles (not development papers) meeting the eligibility criteria at Stage 2 (last searched 5th July 2021), as a supplementary measure to identify any additional studies not captured by the database searching [27].

Eligibility criteria

The following selection criteria was applied to the search results at Stage 1 (identifying instruments):

  • Full text original research article (i.e. not including abstracts, editorials, or reviews);

  • Published in English;

  • At least 75% of the sample, on which data from an instrument was reported, was formed of informal adult carers of people with DMD;

  • Used a self-reported, multi-item questionnaire to assess at least one aspect of QoL; and

  • Included a questionnaire that was validated in English, with a free/review copy that was available to access.

Additional selection criteria were applied at Stage 2 (evaluation of measurement properties):

  • Reports data on at least one measurement property of the instruments identified and taken forward for review in informal carers of people with DMD.

  • Development studies on the instruments identified in Stage 1, to assist with the assessment of content validity, were included in any form (i.e. journal article, book chapter, user manual etc.).

Selection Process

In order to apply the eligibility criteria for the selection of papers from search results, the following steps were performed by two independent reviewers at all stagesFootnote 2:

  1. (I)

    The titles and abstracts of records identified in the Stage 1 searches were screened against the Stage 1 eligibility criteria (as were any additional records in the Stage 2 searches or through citation tracking). Records were selected for full text review if deemed relevant, potentially relevant, or if doubt existed. All records that were selected for review by either reviewer were then subsequently reviewed at full text.

  2. (II)

    Full text articles identified in (I) were assessed for eligibility using the Stage 1 eligibility criteria. Any discrepancy was resolved through discussion and reasons for exclusion were documented.

  3. (III)

    Copies of the instruments identified in the articles in (II) were reviewed to ensure they met the eligibility criteria (i.e. assessed an aspect of QoL). If an English free/review copy of the instrument was not available (or not made available upon request) the questionnaire and corresponding article(s) were excluded from review.

  4. (IV)

    Full text articles meeting the Stage 1 inclusion criteria AND identified as potentially containing measurement properties using the COSMIN filter were screened using the Stage 2 eligibility criteria, using the title and abstract and full text approach as described above.

  5. (V)

    In order to identify development papers for the instruments identified for review in (II) and (III) and/or any potential missed articles from the database searches, Google Scholar search results, the results of citation tracking, and manual searching for development papers were screened for inclusion, first by title and abstract and then by full text as described above. Further, the citations of two previous reviews were screened for potentially relevant records not otherwise identified in earlier searches [4, 6].

  6. (VI)

    A manual review of any articles meeting eligibility criteria at Stage 1 was conducted for potential measurement properties that may have been missed by the COSMIN filter.

Data extraction and quality appraisal

Data extraction was undertaken independently by two reviewers using a pre-prepared data extraction sheet, with consensus reached through discussion. The data extraction sheet was first piloted (on two development paper articles and two measurement property articles), before being revised for further use. Extraction was informed by tools developed by COSMIN on reporting guidance: https://www.cosmin.nl/tools/guideline-conducting-systematic-review-outcome-measures/. A copy of all data extracted (including which data was sought) is in Additional file 2. Data on interpretability or feasibility of questionnaires (e.g. completion time) was not extracted as it was not typically reported.

COSMIN standards, via the COSMIN risk of bias checklist [28], were used to evaluate the methodological quality of instrument development papers and studies on their measurement properties (ranked on a four-point scale: “very good”, “adequate”, “doubtful” and “inadequate”). The checklist was applied independently by two reviewers, with consensus reached through discussion. Total ratings are determined using the lowest rating for any checklist item for that study (i.e. worst score counts).

Assessment of content validity

The content validity of each instrument was assessed following published COSMIN guidance [12], which involves evaluating and synthesising evidence from three sources:

  1. (I)

    The quality of the instrument development;

  2. (II)

    The quality and results of any additional content validity studies (if available); and

  3. (III)

    An evaluation of the content of the instrument itself by the review team.

Ratings of relevance, comprehensibility, and comprehensiveness were made for each source of evidence separately and could be satisfactory (+), unsatisfactory (−), or indeterminate (?). Ratings for (I) and (II) were initially made independently by two reviewers, and, in the case of disagreement, consensus was reached following discussion.

In order to evaluate the instrument content (III), informal carers (parents) of people with DMD aged between 3 and 19 (identified through Duchenne UK) were included as part of the review team (15 mothers, 2 fathers). The carers rated a selection of instruments on 8 criteria across relevance (5 criteria), comprehensibility (2 criteria), and comprehensiveness (1 criterion). As a significant number of instruments were included in the review, they were distributed across the carer group, so that each instrument was rated by a minimum of three carers. Further, carers provided ratings for full instruments (i.e. how they are usually disseminated), rather than instrument subscales separately. While the COSMIN terminology was retained in rating sheets (for standardisation), elaborated instructions were provided to explain the concepts and ratings in lay terms (see Additional file 3). All documents were handled electronically.

For each criterion carers could provide a rating of “positive” (+), “negative” (−), or “unsure” (?). For example, for comprehensiveness the criterion is “Are all key concepts included?”, rated as yes (+), no (−), or unsure (?). Ratings across reviewers were then synthesised using rules adapted from COSMIN, to account for more than three reviewers, as outlined in Table 2. As we wanted to put greater descriptive emphasis on the results of the review by Duchenne carers, we included two additional possible synthesised ratings: inconsistent trending towards positive (± (+)) and inconsistent trending towards negative (± (−)). These are not traditionally used in COSMIN, so we have included these to provide additional descriptive information for the carer ratings only. Reviewer ratings were then synthesised for each aspect of content validity (i.e. relevance, comprehensiveness, and comprehensibility) using rules defined by COSMIN [12].Footnote 3

Table 2 Data synthesis rules for carer reviewer ratings

Assessment of Psychometric Properties

Each source of evidence on the remaining measurement properties was evaluated against the COSMIN criteria for good measurement properties, using the same satisfactory (+), unsatisfactory (−), and indeterminate (?) as mentioned above [15]. The criteria for good measurement properties specifies thresholds for evaluating effect measure(s) for each measurement property and what data was eligible, such as Cronbach’s alpha for internal consistency, with the full list of effect measure(s) evaluated described in the COSMIN manual [15]. All ratings were initially made independently by two reviewers and then ratified, with any disagreement resolved through discussion.

COSMIN ratings for construct validity (convergent and known groups) and responsiveness require a priori criteria for the testing of hypotheses by the review team [15]. These are based on generic hypotheses provided in the COSMIN manual. The hypotheses were based on expected effect size magnitude (of r for convergent validity and d for between-group tests), which were either reported in the studies or calculated by the reviewers.Footnote 4 COSMIN criteria is for the review team to judge whether ≥ 75% of the results are in accordance with these hypotheses (a + rating). The hypotheses used are in Table 3.

Table 3 Generic hypotheses used for the assessment of construct validity and responsiveness

Reviewers then made a judgment on the size of difference that they would expect given the comparison being made (see Additional file 2). For example, if the Hospital Anxiety and Depression Scale (HADS) depression subscale was compared to another depression measure then a correlation coefficient of r ≥ 0.5 was expected. Likewise, if a study compared QoL results for mothers of people with DMD against a control comparison group of mothers (who were not health-related carers), medium or large differences in the expected direction were rated as acceptable (i.e. d ≥ 0.5).

Evidence synthesis

Individual ratings for each measurement property were qualitatively synthesised using a priori rules based on those recommended by COSMIN (see Table 4) [12, 15]. Based on these rules, each instrument could receive an overall (synthesised) rating of sufficient (+), insufficient (−), or inconsistent (±) for each measurement property (with content validity additionally split into relevance, comprehensibility, and comprehensiveness). For example, if the rating of the instrument development was satisfactory (+) and the carer rating was satisfactory (+) for relevance, then the overall synthesised rating for relevance for that questionnaire would be satisfactory (+). Content validity was evaluated for the total instrument (except for the State-Trait Anxiety Inventory [STAI] form X state and trait versions, which were considered separable). Where a total score was not available, subscales of instruments were rated for other measurement properties (where evidence was available).

Table 4 Data synthesis rules for each measurement property

As recommended by COSMIN, a different weight was applied to development studies than the reviewer ratings for the rating of content validity (and its subcomponents) only [12], whereby more weight was chosen to be applied to reviewer ratings than the development study. This contradicts traditional COSMIN recommendations to place more weight on published literature, but was decided upon as carers represent the target population for this review and are likely to a better judge of instruments’ content validity (as the vast majority of instruments were not developed in a Duchenne carer setting).

In the final step, the quality of evidence was evaluated via a modified Grading of Recommendations, Assessment, Development and Evaluations (GRADE) approach [29], and categorised as “high”, “moderate”, “low”, or “very low”. The quality of evidence rating incorporates down-grading based on the risk of bias evaluation noted above (which includes limited or missing evidence); imprecision (based on pooled sample size); inconsistency in evidence; and indirectness (of sources of evidence) [15]. Full details on how all the above criteria are applied are detailed elsewhere in comprehensive COSMIN manuals [12, 15].

Results

Searches and study inclusion

The results of the searches and study selection are summarised in the PRISMA diagram in Fig. 1. In Stage 1, a total of 1531 records were identified via database searching, from which 553 duplicates were removed and 978 were screened at title and abstract. A total of 100 were further assessed for eligibility at full text, from which 76 were rejected and 24 were included in the review. Cohen’s kappa of inter-rater reliability for full text review at Stage 1 was κ = 0.73, which can be interpreted as ‘substantial’ agreement [30]. In Stage 2, 81 records were identified, from which 48 duplicates were removed and 33 were screened at title and abstract. Fifteen records were assessed for eligibility against the Stage 2 eligibility criteria at full text (13 of which had already been accepted in Stage 1), from which 3 were rejected and 12 were accepted as having evidence of measurement properties. Cohen’s kappa at Stage 2 full text review was κ = 0.59 (‘moderate’ agreement). Finally, after removing duplicates, 5306 records were screened from additional sources (i.e., Google Scholar searches, citation tracking, previous reviews, and manual searching for development papers). From these 141 were sought for retrieval and 110 were reviewed at full text (of the 31 not retrieved, 12 were duplicates, 9 were no longer available, 6 were not in English, and 4 were not either a full text article in DMD carers or a development paper). Of the 110 reviewed, 70 were rejected and 40 accepted (7 of these were DMD studies and 33 development papers). Cohen’s kappa for the full text review from additional sources was κ = 0.74 (‘substantial agreement’).

Fig. 1
figure 1

PRISMA flow diagram of study searches (adapted from [24])

In addition to the 15 articles reviewed for measurement properties identified using the COSMIN filter in the Stage 2 database searches, the DMD carer studies added from additional sources or otherwise meeting the eligibility criteria at Stage 1 were manually screened for evidence of measurement properties. This resulted in an additional 12 articles being included in the review with data on at least one measurement property.

To summarise, 31 records were included in the review where a multi-item QoL instrument meeting the inclusion criteria had been used in a published study with DMD carers, 24 of these contained evidence of measurement properties (7 did not). A further 34 development papers of these instruments were included (33 of these came from additional searches and 1 was both a development paper and DMD carer study identified in the primary database searching).

Questionnaires identified for review

From the searches in Stage 1, 58 questionnaires (from 34 articles) were considered for potential inclusion. After a review of their content, 30 were taken forward for COSMIN review (10 were excluded due to being inaccessible or behind a paywall; 9 were judged as not assessing QoL; 4 were not a caregiver measure; 2 had no validated English version available; 1 was not self-report; 1 was a single-item instrument; and 1 was a duplicate). Two additional instruments were added from the additional sources, giving a total of 32 questionnaires for review. The questionnaires taken forward for review are summarised in Table 5 (see Additional file 4 for a full list of the 60 questionnaires identified in the searches, with reasons for exclusion).

Table 5 Summary of the 32 instruments used to assess carer QoL in DMD from the full-texts meeting the Stage 1 eligibility criteria (n = 31)

COSMIN Evaluation of measurement properties

The overall results of the COSMIN evaluation of measurement properties for the instruments included in the review are summarised in Table 6. Full rating sheets on which this evaluation is based are included in Additional file 5. Of note is the lack of published evidence for many measurement properties for these instruments in Duchenne carers across the board. The Zarit Burden Inventory (ZBI) 22-item had the best breadth of evidence, due to a dedicated study exploring selected psychometric properties in carers of people with DMD [60]. However, evidence on responsiveness was still missing. Furthermore, no evidence on reliability, measurement error, or criterion validity was recorded for any of the questionnaires (not shown on Table 6).

Table 6 Overall rating and quality of evidence of measurement properties for included instruments against COSMIN criteria

Content validity

34 development papers were evaluated using COSMIN methodology, with the development paper ratings from six instruments (Beck Depression Inventory [BDI], EQ-5D-3L, Hospital Anxiety and Depression Scale [HADS], 36-Item Short Form Survey [SF-36], Satisfaction with Life Scale [SWLS], WHO Quality of Life-BREF [WHOQOL-BREF]) extracted from a prior review [20]. Key details from these papers are summarised in Table 7, including the COSMIN rating and whether carers were involved in the development of the instrument. All but two of the instruments (EQ-5D-5L and WHOQOL-BREF) received an inadequate rating for the methodological quality of the development phase. This inadequate rating was primarily driven by the instrument development study not being performed in a sample representing the measure’s target population. In fact, of the instruments in this review only three featured a concept elicitation/development study of some form (Caregiver Strain Index [CSI], EQ-5D-5L, Questionnaire on Resources and Stress [QRS]), content for the rest was derived from reviewing the literature, existing measures, and/or expert/researcher judgment. Strikingly, only two instruments had carers involved in some form in the development of the measure (CSI, WHOQOL-BREF).

Table 7 Summary and assessment of development papers for the instruments included in the review

A total of 7 instruments featured some form of piloting/cognitive interviewing during their development (Care-related Quality of Life Instrument [CarerQoL], Caregiver Well-being Scale [CWBS], EQ-5D-5L, Family Problems Questionnaire [FPQ], Female Sexual Function Index [FSFI], QRS, State-Trait Anxiety Inventory form X [STAI-X]), during which participants were asked about the measure’s comprehensibility. Comprehensiveness was probed in 2 further instruments (CarerQoL, FPQ). Aside from the EQ-5D-5L, where comprehensibility was explored using a focus group methodology [61], the rest of the pilot studies either didn’t use qualitative methods or the reporting of the methods was poor. In short, there was little evidence of any robust qualitative methods in the development of these carer instruments.

Table 8 summarises the synthesised ratings for the content validity of the evaluated instruments, based on the available evidence and synthesised DMD carer reviewer ratings. Ratings are split into relevance, comprehensiveness, and comprehensibility. CarerQoL performed best in the ratings of instrument development. Carer ratings were mixed, with a lot of inconsistency. No one instrument received a positive rating across all aspects of content validity from reviewers. For carer ratings, the best performing instrument was the PedsQL Family Impact Module (PedsQL FIM). The joint worst performing instruments were the SWLS, 12-Item Short Form Survey (SF-12), FSFI, and Caregiver Strain Index Plus (CSI+). Overall, primarily due to a lack of evidence and inconsistent ratings across carers, the overall rating for the content validity of all instruments evaluated in this study was inconsistent. No studies were identified which had independently assessed the content validity of the QoL instruments in samples of carers of people with DMD. Contributing to the low quality of evidence observed.

Table 8 COSMIN ratings of the relevance, comprehensiveness, and comprehensibility of instruments used to assess carer quality of life in DMD

Structural validity

Only one study had assessed the structural validity of an instrument evaluated in this review, the ZBI (22-item) [60]. Landfeldt et al. (2019) examined the structural validity of the ZBI (22-item) using a Rasch partial credit model in a study with a high quality of evidence and found this measurement property was unsatisfactory in DMD carers. The results are summarised in Table 9.

Table 9 Results of studies assessing structural validity of the instruments included in the review

Internal consistency

Seven studies were identified which assessed the internal consistency of an instrument and/or its subscales. The results are summarised in Table 10. Most instruments evaluated demonstrated a satisfactory rating for internal consistency, with a moderate or high quality of evidence. Exceptions were the FPQ which was indeterminate as the Cronbach’s alpha value was reported as a range across all subscales, the QRS which had a low quality of evidence, and the Social Networks Questionnaire (SNQ) (subscale A) which received an unsatisfactory internal consistency rating.

Table 10 Results of studies assessing internal consistency of the instruments included in the review

Hypotheses testing for construct validity

Table 11 summarises the results of studies with evidence on the construct validity of the instruments included in the review. Evidence on construct validity was observed for 30 instruments/instrument subscales, from a total of 19 studies, featuring a mixture of convergent (i.e. correlational) and known groups validity. Performance of the instruments against reviewer a priori defined hypotheses was mixed and the quality of evidence ranged from very low to high. Some instruments, such as the WHOQOL-BREF and some SF-36 subscales, performed inconsistently with a moderate or high quality of evidence. Others, such as the Family APGAR (FAPGAR), HADS and other SF-36 subscales, performed well with a high quality of evidence.

Table 11 Results of studies assessing construct validity of the instruments included in the review

Cross-cultural validity/measurement invariance

Landfeldt et al. was the only study identified in the review that evaluated the measurement invariance of an included instrument, the ZBI (22 item), using differential item functioning [60]. Measurement invariance was observed (i.e. no differential item functioning) using the criteria adopted in the study, giving the ZBI (22 item) a satisfactory rating on that measurement property. However, this was based on a very low quality of evidence, as it was doubtful that groups were similar except for the grouping variable and the group sample sizes were lower than recommended by COSMIN. The results are summarised in Table 12.

Table 12 Results of studies assessing responsiveness of the instruments included in the review

Responsiveness

One study, with a moderate quality of evidence, was identified which assessed responsiveness of four of the instruments included in this review in carers of people with DMD [51]. The results are summarised in Table 13. Both the Psychological Adaptation Scale (PAS) and Worry about Care for Child with DBMD (WAC-DBMD) received satisfactory ratings, and the Perceived Personal Control Questionnaire (PPC) and ZBI (12 item) received unsatisfactory ratings, based on reviewers’ a priori hypotheses.

Table 13 Results of studies assessing measurement invariance of the instruments included in the review

Other measurement properties

No studies were found that contained evidence on the reliability, measurement error, or criterion validity of any of the instruments included in this review.

Discussion

This systematic review was designed to identify instruments used to assess elements of QoL in informal carers of people with DMD and evaluate the published evidence on their measurement properties in this population. Overall, there was a picture of low quality or missing psychometric evidence across a variety of measurement properties for the instruments identified. The majority of the measures did not involve carers in their development and there were no content validity studies in DMD caregivers to assess their suitability (in terms of their relevance, comprehensiveness, and comprehensibility). This, combined with inadequate or doubtful instrument development studies by COSMIN standards, and mixed caregiver ratings of the instruments themselves, lead to inconsistent results for content validity, based on a low quality of evidence. Furthermore, only one study assessed the structural validity of an included instrument in DMD carers, revealing unsatisfactory results [60]. These two measurement properties (content and structural validity) are considered the most important in the COSMIN framework [13, 107, 108], and the finding that evidence on them is lacking and/or unsatisfactory for DMD caregivers is revealing. Instead, the questionnaires included in this review have been used in DMD studies by researchers without also assessing or confirming they are reliable and valid for use with DMD carers. For example, the ZBI (22 item) is one of the most popular tools used in DMD carers [31, 41, 42, 58,59,60], but has unsatisfactory measurement properties, including elements of content validity and structural validity. This ultimately puts the validity of the conclusions from studies using such instruments into question.

As no previous content validity studies on QoL instruments have been conducted with DMD carers, the ratings provided by carer team members in this review represent the first insight into how people asked to complete these instruments evaluate them, in terms of their relevance, comprehensiveness, and comprehensibility. This is a strength of the review. Incorporating consideration of the lived experience into the assessment of existing instruments not only adds to the validity of the findings of the review, but also highlighted some of the inadequacies of the questionnaires themselves. While conducted using COSMIN procedures, it should be acknowledged that this is a limited assessment of how DMD carers responded to a selection of the instruments included in this review. As the number of instruments was large, it was not possible to have the same carers rating all of the instruments, so individual differences in interpretation and rating are not held consistent. Further, ratings were completed individually and synthesised, not arrived at through consensus. Thus, this is not a full content validity study and further work is urgently needed, which would benefit from in-depth qualitative techniques. Nevertheless, this does provide the first, preliminary insight into how these instruments perform in the eyes of Duchenne carers. From this insight, the PedsQL FIM had the most potential as a QoL measure for Duchenne carers.

Aside from content and structural validity, internal consistency was a measurement property that was quite frequently reported, often with satisfactory results (with the exception of SNQ and FPQ). Mixed results were observed on construct validity, but it should be acknowledged that evaluation of this psychometric property is determined by a priori reviewer-generated hypotheses and expectations about how QoL instruments and known-group criteria should be related [15]. The evidence differs across all studies (i.e. in terms of what a QoL instrument is chosen to be compared to by researchers) and thus not all instruments are subjected to the same test of validity. There were also only a handful of studies on measurement invariance and responsiveness. While some measures performed well on these criteria, it is our view that these should not be used to advocate the use of an instrument in the absence of good content and structural validity, the two most important measurement properties [13].

An aim of this review was to make a recommendation for which questionnaire(s) (if any) are best suited to assess QoL in DMD informal caregivers. Making such a recommendation is difficult as there was no instrument with evidence that excelled across all measurement properties (or even across the foundational measurement properties of content and structural validity) and the quality of available evidence was often low. Further, many of the instruments identified in this review were designed/used to assess only one aspect of carer QoL, rather than QoL as a whole. CarerQoL performed best in terms of instrument development, but was inconsistent in carer reviewer ratings, and had no additional evidence on its psychometric properties. PedsQL FIM received the best ratings from carer reviewers and while the instrument received an unsatisfactory rating for construct validity, this was based on a very low quality of evidence. Our recommendation is thus, first and foremost, for additional high-quality research into the measurement properties of instruments included in this review in Duchenne caregivers. In the interim, we recommend that the PedsQL FIM is considered for future use and evaluation as a multidimensional QoL instrument that appears to be received well by Duchenne caregivers.

It was of interest to note that during the sifting process of literature as part of this review, that a number of qualitative studies exploring the impact of caring for individuals with DMD were identified (e.g. [109,110,111]). Furthermore, work is continuing to emerge in this area [112]. Whilst these were not selected for inclusion within this review (due to predetermined inclusion criteria), it is clear that there is a body of evidence on this important topic. Consideration, and potential synthesis, of such studies could be a meaningful area of future study. It is possible that existing qualitative literature highlights aspects of carer QoL that are not captured when measuring QoL using any of the instruments identified in this review. Furthermore, existing qualitative literature may also identify any potential cultural and/or country differences which may be important. Given that DMD is a rare condition, large-scale prospective studies of carer QoL can only be achieved using a multi-country recruitment approach. Ensuring that any instrument used to measure carer QoL is culturally appropriate will be necessary.

The focus of this review was to report on the measurement properties of instruments that have been used to quantify QoL of carers of individuals with DMD. However, it is clear when applying COSMIN methodology that the content validity of the instruments identified was questionable. It could be argued that the appropriateness of such questionnaires to assess carer QoL for other health conditions is not justified. Whilst there is still a requirement to assess the relevance, comprehensiveness and comprehensibility of the instruments for other health conditions, this does not overcome the limited evidence for the content validity (i.e. development) of the measures themselves. This review has highlighted the need for future studies to support the content validity of instruments for the target population. It can be postulated that other neuromuscular disorders could imply similar impacts upon carer QoL, however this has not been explored within the context of this review. Furthermore, there are other instruments available which can be used to measure carer QoL which were not included in this review (as they had not been used in studies relating to DMD).

This review is not without its limitations. Whilst the methodological approach adopted is recognised and robust, it does have some limitations, as previously noted [16, 20]. Firstly, the COSMIN appraisal tools assume a worst score counts system. If a study fails to report key details, this results in a reduced rating of the instrument to doubtful or inadequate. Secondly, many of the questionnaires identified in the review could be considered as legacy measures. They were developed at a time when detailed descriptions of instrument development were not necessarily reported and/or different methods for instrument design were accepted. The COSMIN approach is such that these instruments score poorly. It is important to recognise that this does not necessarily mean that the development of these instruments was fundamentally flawed or inappropriate such that have no utility whatsoever, but that an assessment of the available evidence by modern standards has found them lacking. Thirdly, whilst the inclusion of the lived experience (i.e. carer perspective) was incorporated into this review, it must be acknowledged that there may be a degree of bias associated with the responses. Whilst efforts were made to mitigate this (by providing average ratings, i.e. obtaining more than one carer ratings per instrument), it must be noted that all respondents were from the UK. It is possible that their experiences (and their experiences of the UK health and social care systems) may have influenced their ratings. It is not clear whether the findings are replicable in other countries. In addition, the vast majority of informal carer ratings were provided from mothers. Indeed, a number of the studies included in the review also assessed the impact of mother’s QoL (presumably with the assumption that mothers are usually the primary caregiver). However, it can be argued that modern-day parenting situations and roles have altered over recent years, and it cannot be assumed that paternal ratings of the instruments included in the review would align with maternal views.

Due to the large number of instruments included within our review, and the focus on quality of life as a multidimensional construct, we made a pragmatic decision to present the results of instruments, rather than individual subscales (where appropriate). Furthermore, in the assessment of comprehensiveness of the instrument, we applied this to the construct of overall quality of life (rather than the construct the instrument may have been designed to measure).Finally, the assessment of measurement properties of the instruments undertaken here does not incorporate consideration of the acceptability or feasibility of the identified measures. These are also important factors when assessing their suitability. For example, this may include the length of the instrument and how cognitively demanding it is. Practical issues of PROM availability, such as costs and licensing requirements, availability in all required languages, and mode of administration (i.e. electronic versus paper) also play a key role. This review was limited to those instruments where a free or review copy was available for research purposes.

Conclusion

The instruments used to measure impact on Duchenne carer quality of life have limited psychometric evidence to support their use. To that end, the published evidence reporting QoL in carers of people with DMD may not accurately reflect the true impact of caregiving on QoL. Further work is thus required to investigate the measurement properties of common QoL measures in DMD carers, including content validity studies. Research should also examine whether a) the constructs of the instruments identified as part of this review map onto a conceptual framework of carer quality of life in DMD; and b) whether this differs for other paediatric life-limiting conditions. Given the results of this review, work may also be justified in the development of condition-specific carer QoL measures (or within paediatric life-limiting conditions) for use in DMD to better capture the true impacts of the condition on carers. In the interim, we recommend the consideration of the PedsQL FIM as a QoL measure in Duchenne carers, as it showed most promise from evaluation by carers themselves.