The suitability of patient-reported outcome measures used to assess the impact of hypoglycaemia on quality of life in people with diabetes: a systematic review using COSMIN methods

Aims/hypothesis It is generally accepted that hypoglycaemia can negatively impact the quality of life (QoL) of people living with diabetes. However, the suitability of patient-reported outcome measures (PROMs) used to assess this impact is unclear. The aim of this systematic review was to identify PROMs used to assess the impact of hypoglycaemia on QoL and examine their quality and psychometric properties. Methods Systematic searches (MEDLINE, EMBASE, PsycINFO, CINAHL and The Cochrane Library databases) were undertaken to identify published articles reporting on the development or validation of hypoglycaemia-specific PROMs used to assess the impact of hypoglycaemia on QoL (or domains of QoL) in adults with diabetes. A protocol was developed and registered with PROSPERO (registration no. CRD42019125153). Studies were assessed for inclusion at title/abstract stage by one reviewer. Full-text articles were scrutinised where considered relevant or potentially relevant or where doubt existed. Twenty per cent of articles were assessed by a second reviewer. PROMS were evaluated, according to COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) guidelines, and data were extracted independently by two reviewers against COSMIN criteria. Assessment of each PROM’s content validity included reviewer ratings (N = 16) of relevance, comprehensiveness and comprehensibility: by researchers (n = 6); clinicians (n = 6); and adults with diabetes (n = 4). Results Of the 214 PROMs used to assess the impact of hypoglycaemia on QoL (or domains of QoL), eight hypoglycaemia-specific PROMS were identified and subjected to full evaluation: the Fear of Hypoglycemia 15-item scale; the Hypoglycemia Fear Survey; the Hypoglycemia Fear Survey version II; the Hypoglycemia Fear Survey-II short-form; the Hypoglycemic Attitudes and Behavior Scale; the Hypoglycemic Confidence Scale; the QoLHYPO questionnaire and the Treatment-Related Impact Measure-Non-severe Hypoglycemic Events (TRIM-HYPO) questionnaire. Content validity was rated as ‘inconsistent’, with most as ‘(very) low’ quality, while structural validity was deemed ‘unsatisfactory’ or 'indeterminate'. Other measurement properties (e.g. reliability) varied, and evidence gaps were apparent across all PROMs. None of the identified studies addressed cross-cultural validity or measurement error. Criterion validity and responsiveness were not assessed due to the lack of a ‘gold standard’ measure of the impact of hypoglycaemia on QoL against which to compare the PROMS. Conclusions/interpretation None of the hypoglycaemia-specific PROMs identified had sufficient evidence to demonstrate satisfactory validity, reliability and responsiveness. All were limited in terms of content and structural validity, which restricts their utility for assessing the impact of hypoglycaemia on QoL in the clinic or research setting. Further research is needed to address the content validity of existing PROMs, or the development of new PROM(s), for the purpose of assessing the impact of hypoglycaemia on QoL. Prospero registration CRD42019125153 Graphical abstract Supplementary Information The online version contains peer-reviewed but unedited supplementary material available at 10.1007/s00125-021-05382-x.


Introduction
Both the experience and the risk of hypoglycaemia can have a serious negative impact on the quality of life (QoL) of adults with diabetes [1][2][3][4][5][6][7][8]. Living a life of quality is perhaps the ultimate goal, so protecting QoL is a daily burden for people experiencing or at risk of hypoglycaemia, and one that can be contradictory to the goals of medical therapy [8]. This may particularly be the case in those who aim for very tight glucose targets. The extent of this impact on QoL can be assessed using patient-reported outcome measures (PROMs). PROMs are questionnaires that can be used in both research and/or clinical care. PROMs complement objective data (e.g. actual blood glucose levels) by capturing the individual's experiences in a quantifiable and standardised manner, across a range of concepts, e.g. health-related QoL, satisfaction with treatment or emotional well-being [9,10]. When applied to the study of hypoglycaemia in diabetes, PROMs can facilitate an assessment of the psychological and economic burden of hypoglycaemia, which can be used to determine the value of therapeutic approaches to reducing hypoglycaemia frequency and severity.
Given the large number of PROMs available, it can be challenging to determine which PROM(s) to select for a given clinical or research purpose. Factors such as response burden (e.g. mode of administration, number of items [questions]), type of PROM (generic or condition-specific) and the purpose of the data collection will influence choice. However, a more fundamental issue is whether the PROM has been evaluated as 'fit for purpose'. This evaluation should include assessment of three overall domains (validity, reliability and responsiveness), for which consensus-based standards (COnsensusbased Standards for the selection of health Measurement Instruments [COSMIN]) can be applied [11]. The COSMIN methodology and standards derive from widespread international expert consensus [11,12] and have been applied to other PROM measures [13][14][15][16][17], but not yet to the assessment of the impact of hypoglycaemia on QoL.
QoL is highly subjective and has been defined in many ways and most people, intuitively, have an understanding of what it means to them [18]. Perhaps the simplest definition is that QoL is a personal evaluation of how good or bad one's life is [19]. For the purpose of this review, and consistent with the general consensus [9], we operationalised QoL as: (1) a multidimensional construct including components such as physical well-being (e.g. pain/discomfort, mobility, fatigue), psychological well-being (e.g. mood, fear, confidence) and social well-being (e.g. stigma, participation) [20]; (2) a subjective construct based on feelings, values, experiences and priorities (therefore, we do not include objective measures, or purely functional performance or assessment instruments); and (3) a dynamic construct, which changes over time according to the person's priorities, experiences and situation.
The objectives of this review were to: (1) identify PROMs used to assess the impact of hypoglycaemia on QoL in adults with diabetes; and (2) formally evaluate their content validity, structural validity and other measurement properties. Our intention was to provide researchers and clinicians with a robust evidence base to assist them when selecting PROMs for this purpose. The review was undertaken as part of the Hypoglycaemia REdefining SOLutions for better liVEs (Hypo-RESOLVE) project, an international collaboration of clinicians, scientists, industry partners and people with diabetes [21].
Data sources and searches A protocol was developed and registered with PROSPERO [25]. A systematic literature search was conducted during 26-28 November 2018 to identify published evidence around the four concepts of: (diabetes) and (hypoglycaemia) and (psychosocial outcomes) and (measurement properties of measurement instruments). Databases searched include MEDLINE, EMBASE, PsycINFO, CINAHL and The Cochrane Library. Terms for psychosocial outcomes were chosen to include both generic, 'umbrella' terms for 'quality and life' and 'well-being' (sourced from published search filters) and specific psychosocial outcomes of diabetes known to the Hypo-RESOLVE team (e.g. fear of hypoglycaemia). In order to identify studies for the present systematic review, a validated search filter devised for retrieving studies on measurement properties of instruments in PubMed was used [26]. An example search strategy is shown in the electronic supplementary material (ESM) Methods.
Study selection Inclusion criteria consisted of any study design that included the primary development and/or validation of a hypoglycaemia-specific PROM used to assess the impact of hypoglycaemia on QoL in adults diagnosed with diabetes with any type, e.g. type 1, type 2 and gestational, and who have experienced hypoglycaemia. Studies of hypoglycaemia/hypoglycaemic episodes not associated with diabetes were excluded. Commentaries, reviews, opinion pieces and any other non-empirical work were also excluded. Studies were assessed for inclusion at title and abstract stage by one reviewer (JL). Full-text articles were scrutinised where considered as relevant or potentially relevant or where doubt existed. Twenty per cent of studies were assessed by a second reviewer (JC) to check for consistency. Disagreements were resolved through discussion.
Data extraction Data extraction included study characteristics (e.g. language; participant characteristics; recall period; analysis model), a brief summary of results and measurement properties of the PROMs. Primary outcomes included measurement properties of identified PROMs, consistent with the COSMIN checklist: PROM development; content validity; structural validity; internal consistency; cross-cultural validity/measurement invariance; reliability; measurement error; criterion validity; hypothesis testing for construct validity; and responsiveness. Definitions of the measurement properties are detailed in Table 1. In accordance with COSMIN guidelines, all data relating to PROM measurement properties were extracted independently by two reviewers (JL and JC) against the respective COSMIN criteria. Discrepancies were resolved through discussion.
Content validity assessment Content validity is the extent to which a PROM is deemed to reflect the construct of interest and, arguably, the most fundamental aspect of scale selection [27]. The methodological quality of the PROM development studies and other studies supplementing content validity were assessed using COSMIN standards [28]. The assessment involves three steps (see  The systematic and random error of a person's score on the PROM that is not attributed to changes in the construct to be measured Criterion validity The extent to which the scores of a PROM reflect the scores of a test or measure considered to be the 'gold standard' Hypothesis testing for construct validity The extent to which the scores of a PROM are consistent with hypotheses. For example, with regard to internal relationships, relationships to scores of other instruments or differences between relevant groups. It is based on the assumption that the PROM is a valid measure of the construct Responsiveness The ability of a PROM to detect change, as expected, over time in the construct to be measured when there is a true change in a person's condition or treatment Cross-cultural validity The extent to which the measurement properties of the translated or culturally adapted PROM reflect the performance of the original version of the PROM

STEP 3 Evaluate the content validity of the PROM
• 3a: PROM development and content validity studies are rated individually on ten COSMIN criteria for content validity. Reviewers also provide ratings: sufficient (+), insufficient (−), inconsistent (±) or indeterminate (?) • 3b: The ratings from 3a are combined, producing an OVERALL rating for relevance, comprehensiveness and comprehensibility, and content validity overall, of sufficient (+), insufficient (−), inconsistent (±) or indeterminate (?) • 3c: The ratings produced in 3b are accompanied by a grading for evidence quality using a modified GRADE approach of 'high', 'moderate', 'low' or 'very low' using the lowest rating for any item for that study (i.e. worst score counts) [22].
Step 3 consists of three sub-stages.
Step 3a incorporates reviewer ratings of the identified PROMs whereby reviewers consider relevance, comprehensiveness and comprehensibility. We sought ratings from three key stakeholder groups: (1) researchers (including those with expertise in systematic reviewing, QoL research and psychological aspects of diabetes) (n = 6); (2) clinicians (n = 6); and (3) adults with diabetes (n = 4), including two representatives of the Hypo-RESOLVE Patient Advisory Committee (PAC). All reviewers provided independent ratings of the PROMs based on several criteria: (1) the construct of interest (i.e. does the PROM include items that are relevant in measuring the impact of hypoglycaemia on QoL?); (2) the population of interest; (3) the context of use of interest (i.e. is the PROM suitable for use in research and/or clinical practice?); (4) the appropriateness of response options; (5) the appropriateness of the recall period; (6) the comprehensiveness (i.e. does the PROM assess the impact of hypoglycaemia on QoL as a whole, or only on select domains of QoL?); (7) the suitability/clarity of the PROM instructions; (8) whether PROM items and response options are understandable; (9) the appropriateness of PROM item wording; and (10) the extent to which response options are appropriate to the question being asked. A majority rating was determined for each group (researcher, clinician and PAC). The group ratings were then consolidated to produce an overall reviewer rating for each PROM. Table 2 details how relevance, comprehensiveness and comprehensibility were assessed.
Step 3b involves summarising the results of all available studies to provide an overall rating of relevance, comprehensiveness and comprehensibility and an overall content validity rating. This results in an outcome of 'sufficient', 'insufficient', 'inconsistent' or 'indeterminate'. Finally, in Step 3c, the overall ratings determined in Step 3b are accompanied by a grading of the quality of the evidence using a modified Grading of Recommendations Assessment, Development and Evaluation (GRADE) approach [29]. Using the modified GRADE approach, the quality of evidence is graded as 'high', 'moderate', 'low' or 'very low'. The GRADE approach uses five factors to consider the quality of the evidence: risk of bias, inconsistency, indirectness, imprecision and publication bias [29]. Detailed information of the rating process is reported elsewhere [28]. The resultant evaluation of content validity includes an overall rating of: + ('satisfactory'); − ('unsatisfactory'); ± ('inconsistent'); or ? ('indeterminate'), with a measure of the quality of the evidence to support the content validity rating ('high', 'moderate', 'low', 'very low'). A worked example of content validity rating and scoring is shown in Table 2. Detailed information on the COSMIN methodology applied is reported elsewhere [28]. Table 1 defines each of the psychometric properties assessed. As above, a COSMIN rating was determined by assessment across the criteria for measurement properties using the same rating scale ('sufficient', 'insufficient', 'inconsistent' or 'indeterminate'). The assessment of the quality of the evidence was applied using the GRADE approach. This results in a rating of: + ('satisfactory'); − ('unsatisfactory'); ± ('inconsistent'); or ? ('indeterminate'), with a measure of the quality of the evidence to support the structural validity rating ('high', 'moderate', 'low', 'very low'). Full information on the COSMIN methodology applied in this review is reported elsewhere [23].

Assessment of other psychometric properties
Quality assurance of the review The quality of this review was assessed against a COSMIN checklist that was designed to evaluate the quality of systematic reviews of PROMs [30] (ESM Table 1).

Results
The search returned a total of 3661 unique records, from which 214 PROMs were identified as used in studies to assess the impact of hypoglycaemia on QoL or subdomains of QoL (Fig. 2, Table 3). Of these, 17 PROMs were initially identified as hypoglycaemia-specific and for consideration in this review, and nine were subsequently excluded following further scrutiny of the instruments. PROMs were excluded if they were: hypoglycaemia symptom measures that assessed attitudes, awareness and/or attitudes to awareness of symptoms (n = 3); related to specific treatments (n = 2); only a subscale of an overall PROM (n = 2); or not available for full inspection (n = 2). Consequently, the current review includes eight hypoglycaemia-specific PROMs that have been used to assess the impact of hypoglycaemia on QoL or at least one aspect of QoL: the Fear of Hypoglycemia 15-item scale (FH-15); the Hypoglycemia Fear Survey (HFS); the Hypoglycemia Fear Survey version II (HFS-II); the HFS-II short-form; the Hypoglycemic Attitudes and Behavior Scale (HABS); the Hypoglycemic Confidence Scale (HCS); the QoLHYPO questionnaire and the Treatment-Related Impact Measure-Non-severe Hypoglycemic Events (TRIM-HYPO) ( Table 4).
Overall COSMIN assessment of PROMs The overall results of the COSMIN assessment are shown in Table 5. There are considerable evidence gaps for the measurement properties of most of the PROMs. The HFS-II, QoLHYPO and TRIM-HYPO were the only instruments that could be rated across all the measurement properties. Table 2 summarises the key characteristics and COSMIN quality assessment of the PROM Table 2 COSMIN criteria and rating system for evaluating the content validity of the PROMs (adapted from Terwee et al [28] [32]. For the QoLHYPO, adults with diabetes were asked about the comprehensibility, but not relevance, of the PROM ('doubtful' COSMIN quality rating) [33]. During the development of the TRIM-HYPO, adults with diabetes were asked about the comprehensibility and relevance of the PROM, but were not asked about comprehensiveness of the PROM ('doubtful' COSMIN quality rating)  [34]. Aside from the development studies, no further studies were identified that independently assessed the content validity of the PROMs. ESM Table 4 details the consensus ratings for the three groups of reviewers (researchers, clinicians, people living with diabetes), and an overall reviewer consensus rating for each PROM. FH-15 had an overall reviewer rating of 'sufficient'; HFS-II, HABS, HCS and TRIM-HYPO were rated as 'inconsistent'. For two of the PROMs (HFS and QoLHYPO), relevance, comprehensiveness and comprehensibility ratings resulted in a combination whereby COSMIN guidance is not explicit, and, thus, an overall rating could not be applied [28].

Screening Eligibility
Structural validity Twelve studies assessed the structural validity of the PROMs, all of which were reported in the development papers (ESM Table 5). No independent assessments of the structural validity were identified. Four studies examined the structural validity of a cultural adaptation/ language translation of the HFS [35][36][37][38]. A further study assessed the structural validity of the short-form of HFS-II [31]. COSMIN quality ratings of the HFS-Norwegian, HFS-Singapore and HFS short-form were 'very good' and ratings were 'adequate' for the remaining PROMs. The same principles as noted above were applied to assess the quality of the evidence for these instruments. The quality of evidence for the HFS-Norwegian, HFS-Singapore and HFS-II short-form instruments was assessed as 'high'. The HFS-Spanish, HFS-Swedish and TRIM-HYPO instruments were assessed as 'moderate'. Many of the studies reported exploratory factor analysis (EFA) (rather than the confirmatory factor analysis required to receive a 'satisfactory' rating). Those studies reporting confirmatory factor analysis (language versions of the HFS) did so to examine whether the expected two-factor structure (observed for the original HFS) fitted their dataset.  (7), Emotional wellbeing (7), Work productivity (9), Sleep disruption (5)  However, they all rejected this a priori-defined structure, and therefore went on to explore the latent structure of the tool using EFA.
Reliability (test-retest) Seven studies were identified that assessed the test-retest reliability of a PROM measure. Four of the studies were conducted by the instrument developers (FH-15, HFS-II, QoLHYPO and TRIM-HYPO). The remaining studies were assessments of the language versions of the HFS instrument (ESM Table 7). Four studies had an 'adequate' COSMIN quality rating [32,33,36,40]. Two studies had a 'very good' COSMIN quality rating [37,38]. One study had a 'doubtful' COSMIN quality rating [34].
Other psychometric properties No studies were found to demonstrate evidence for cross-cultural validity, measurement error, criterion validity or responsiveness.

Discussion
This systematic review has summarised and critically evaluated published evidence on the psychometric characteristics of PROMs used to assess the impact of hypoglycaemia on QoL in adults with diabetes using COSMIN methodology. Our intention was to provide an evidence base that would help researchers and clinicians when selecting PROMs, based on the robust and comprehensive consensus-based COSMIN criteria. We identified eight PROMs that had been developed to assess the subjective impact of hypoglycaemia on QoL or a subdomain of QoL. None of the PROMs included in this review had a 'high' rating for content validity (in relation to assessing the impact of hypoglycaemia on QoL), which is arguably the most important measurement property of a PROM [28,44]. All had 'inconsistent' COSMIN ratings for content validity, but the quality of the evidence to support those ratings was greater for the HFS and QoLHYPO. To that end, there is some support to recommend the use of HFS and QoLHYPO instruments in research studies and/or clinical practice. However, it is important to acknowledge the conceptual framework from which these two instruments were developed, and how this diverges from our operationalisation of the concept of QoL (i.e. multidimensional, subjective and changing over time). The HFS was developed to measure fear of hypoglycaemia through two subscales-behaviour and worry. Fear is arguably a very specific aspect of the psychological subdomain of QoL. Furthermore, the developers were not explicit in describing the target population for the instrument (i.e. their sample included people with 'insulin-dependent' diabetes, but it is unclear whether this included people type 1 and/or type 2 diabetes, and whether it is also applicable to people who manage their diabetes without insulin but experience hypoglycaemia). While the content of the QoLHYPO instrument includes items that assess various domains of QoL (e.g. social relationships, mood, daily activities), it was designed for use only by people with type 2 diabetes. Furthermore, there have been no translations beyond the original Spanish version. Consequently, the format and layout of the QoLHYPO is not clear for English-speaking researchers, and the developers provide no information on domains. Further investigation would be required to determine the suitability of the QoLHYPO instrument in measuring the impact of hypoglycaemia in people with type 1 diabetes and in other language groups. We have included details of psychometric properties of the PROMs identified as part of the original literature search. However, it is plausible that additional papers have also reported psychometric properties for one or more of the included PROMs (particularly in intervention studies). To that end, the information on measurement properties reported here should not be considered exhaustive. We did not adopt the approach taken by (some of) the PROM authors to consider HbA 1c as the 'gold standard' in the assessment of criterion validity and criterion approach to responsiveness. Studies have shown that HbA 1c it is not a reliable indicator of whether an individual experiences hypoglycaemia [45,46], nor a surrogate for QoL [47], nor of the impact or burden of hypoglycaemia. Advances in glucose monitoring technologies are continually changing our understanding of diabetes and are contributing to a better understanding of the lived experience of diabetes and hypoglycaemia. Consequently, it may be appropriate in future studies to consider 'time in range' or 'time in hypoglycaemia' as a marker for the impact of hypoglycaemia on QoL-but the extent to which this will reflect the subjective experience has yet to be elucidated. In the absence of an agreed 'gold standard', it is not possible to determine the assessment of any criterion validity or criterion approach to responsiveness for any PROM.
In this systematic review, we followed the robust and comprehensive guidance developed by the COSMIN initiative [23,28]. However, it is not without its limitations. The assessment of content validity and psychometric performance of PROMs is determined by taking the lowest rating of any standard in the criteria (i.e. the 'worst score counts' principle) [22,28]. This means that a study could be rated as 'very good' or 'good' on all but one criterion; however, the overall rating could be affected by a 'doubtful' or 'inadequate' rating, thus reducing the overall score to 'doubtful' (or 'inadequate'). The omission of one key component in reporting (such as whether interviews were recorded and transcribed verbatim) can result in a lower overall content validity rating, which could be argued as overly harsh and should be recognised as a limitation of the COSMIN approach. Where appropriate within this review, we consistently rated in favour of the PROM (rather than assuming the worst). Another limitation of the COSMIN approach was identified in the guidance for determining content validity ratings of studies. Here we noted that there was no information on how to determine overall content validity rating with the combinations achieved. We have documented our approach; however, if the review was to be replicated, others may opt to 'down-grade' the overall content validity rating. Furthermore, as part of the content validity assessment, we sought to include the opinion of stakeholders. The COSMIN guidance does not advise on how to ratify ratings should there be conflicting opinions between or within stakeholder groups.
It should be noted that some of the PROMs included within this review are legacy or 'first generation' measures; that is, they were developed at a time when there were no international standards for instrument development methods, so these were either not reported, or reported selectively or in little detail. Similarly, the way in which PROMs are developed has changed over time [27]. It is now more common to report the methodological steps undertaken during the instrument development phase. The COSMIN ratings should therefore be interpreted with a degree of caution, and do not provide evidence that the instrument development was not rigorous or that the instruments are not 'fit for purpose', but rather expose an absence of key evidence.
While there is published evidence of studies that report hypoglycaemia to negatively impact upon QoL [1][2][3][4][5][6][7][8], we have identified that those that utilise hypoglycaemiaspecific PROMs have inadequate reliability and validity for this specific purpose. Thus, the current literature on the impact of hypoglycaemia on QoL is limited (if not flawed) and needs to be interpreted with caution. Given that the content validity of the instruments was lacking, it is plausible that hypoglycaemia impacts individuals in ways that are currently not being measured. It may be that the items within the instruments are no longer relevant (e.g. due to changes in diabetes treatments, monitoring, society, language use), or that the items are not comprehensive enough to fully capture the ways in which hypoglycaemia affects adults in the modern world.
In conclusion, none of the PROMs identified had sufficient evidence to demonstrate satisfactory content validity, i.e. they do not assess the impact of hypoglycaemia on QoL in adults living with diabetes. Furthermore, most were also limited in their published evidence of reliability, validity and responsiveness. There is an urgent need to follow contemporary guidance [27,[48][49][50] to develop new instruments that can assess the impact of hypoglycaemia on QoL.