Introduction

With the introduction of antiretroviral therapy (ART), the life expectancy of PLWH has been prolonged. However, HIV, ART, infectious diseases, comorbidities, and premature aging pose challenges to the health-related quality of life (HRQoL) of PLWH. HRQoL can be defined as one’s perceived functioning in the physical, emotional, psychological, and social domains of health [1]. Alternatively, HRQoL was defined by Torrance as a concept incorporating factors that are part of an individual’s health [2]. HRQoL is currently regarded as a health aspect of quality of life (QoL); nonhealth aspects, including economic and political circumstances, are not included in HRQoL. Achieving a high level of HRQoL has become an important issue and a component of HIV/AIDS care [3]. In 2016, Lazarus and colleagues proposed adding a fourth “90” to the existing “90–90–90” target [4, 5]. The fourth 90% target is 90% of PLWH with viral load suppression to have good HRQoL. According to the World Health Organization's 90–90–90–90 goals, improving the HRQoL of PLWH is the ultimate goal of HIV/AIDS treatment and care [6, 7]. However, which measures are the most suitable is still under debate.

Many HIV-specific and generic HRQoL patient-reported outcome measures have been validated in different contexts. As one of the earliest HIV-specific HRQoL PROMs, MOS-HIV is the most commonly used measure [8]. The MOS-HIV consists of 35 items and 10 dimensions, including general health perceptions, physical functioning, role functioning, pain, social functioning, mental health, energy, health distress, cognitive functioning, and overall self-rated quality of life. In addition to MOS-HIV, other HIV-specific HRQoL PROMs are also widely used, including the WHOQoL-HIV-BREF [9], Multidimensional Quality of Life Questionnaire for Persons with HIV/AIDS (MQoL-HIV) [10], HIV Disease Quality of Life 31-Item Instrument (HIV-QL31) [11], and Patient-Reported Outcomes Quality of Life–HIV instrument (PROQoL-HIV) [12]. Additionally, validated subscales or scales with over 40 items, such as the World Health Organization Quality of Life-HIV (WHOQoL-HIV) [13], HIV Overview of Problems Evaluation Scale (HOPES) [14], Functional Assessment of HIV Infection (FAHI) [15], HIV/AIDS Targeted Quality of Life (HAT-QoL) [16], and HIV/AIDS Quality of Life Questionnaire (HIV/AIDSQoL) [17], are also used to evaluate HRQoL. In addition to HIV-specific PROMs, some generic PROMs, including the Short Form Health Survey (SF-12, SF-36) [18, 19], EuroQol—5 Dimensions (EQ-5D) [20, 21], World Health Organization Quality of Life assessment (WHOQoL) [22], Medical Outcomes Study Health Survey (MOS) [23], Missoula-Vitas Quality-of-Life Index (MVQOLI) [24], Patient-Reported Outcomes Measurement Information System (PROMIS) [25], Health Assessment Questionnaire Disability Index (HAQ-DI) [26], Quality of Well-Being scale (QWB) [27], and Health Utility Index 3 (HUI3) [28], have been validated and used in the PLWH population globally. The advantage of using generic HRQoL PROMs is that researchers can directly compare the results with those of other groups based on the same problem without standardizing the data. However, for PLWH, generic PROMs may not be as sensitive as specific PROMs assessing HIV-specific dimensions of HRQoL regarding stigma, relationship issues, and comorbidities [29].

A preliminary literature search was conducted in PubMed, PsycINFO (EBSCO), Cochrane Library (Wiley) and JBI (Ovid), and many reviews on measures of HRQoL were found. Cooper et al. [29] briefly summarized PROMs with fewer than 40 items for measuring HRQoL in PLWH and found that the MOS-HIV was the most well-established measure. The WHOQoL-HIV-BREF and PROQoL-HIV were considered to have good psychometric properties and to potentially have more relevance to PLWH than other PROMs. However, the study included only instruments that can be completed within 10 min or that have fewer than 40 items. Additionally, the assessment process of psychometric properties was not systematic enough to provide a concrete conclusion. Clayson et al. [30] conducted reviews with similar aims but in a specific context (in clinical trials and in sub-Saharan Africa) in 2006 and 2010, respectively. Gakhar et al. conducted a nonsystematic review of the literature on quality of life assessment after ART in developed countries in 2013 [31].

However, previous systematic reviews have mainly focused on the content of HRQoL PROMs and have not reported their psychometric properties, which has made it difficult for healthcare professionals to select one of the existing PROMs to evaluate HRQoL in research and clinical practice [29,30,31]. Accurate and reliable PROMs are a prerequisite for obtaining robust results. It is critical to choose an acceptable PROM with good psychometric properties [32]. Therefore, to obtain reliable evidence regarding the psychometric properties of HRQoL PROMs, we conducted a systematic review to identify and assess the psychometric properties of PROMs of HRQoL in PLWH. This conclusion may provide a scientific basis for researchers to choose PROMs for future scientific research and clinical practice measuring HRQoL in PLWH.

Methods

Aims and design

The aim of this study was to identify and assess the psychometric properties of PROMs of HRQoL in PLWH. This systematic review was performed with the guidance of the Joanna Briggs Institute (JBI) methodology for systematic review of psychometric properties and the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (Additional file 1: PRISMA) statement. The protocol of our review was published in JBI Evidence Synthesis [33].

Search strategy

We conducted a three-step search. First, a limited search was conducted in PubMed to develop search strategies tailored to each database. Second, researchers implemented the search strategies in PubMed, MEDLINE (Ovid), EMBASE (Ovid), CINAHL (EBSCO), Web of Science, ProQuest Dissertations and Theses, Cochrane Library (Wiley), CNKI, and WanFang. The databases were searched for published studies from 1st January 1996 to 1st May 2020. We set the start point 1996 because ART was first used in 1996. Google Scholar and Baidu Scholar were searched for gray literature. We used MeSH terms ([“HIV” OR “Acquired Immunodeficiency Syndrome”] AND “Quality of Life”) combined with ([HIV OR AIDS OR “acquired immunodeficiency syndrome”] AND “quality of life” AND “COSMIN search filter”). Additional file 2: Appendix I lists the search strategies used for all databases. Finally, we manually reviewed all references included during the supplemental searches.

Inclusion and exclusion criteria

The inclusion criteria were as follows: (1) studies that targeted HIV-positive adults (≥ 18 years old); (2) studies of any types of self-reported measures, including but not limited to, self-management questionnaires that aimed to measure HRQoL among PLWH; (3) validation studies or studies that aimed to develop PROMs or assess one or more measurement properties; and (4) studies published in either English or Chinese. The exclusion criteria included the following: (1) studies that aimed to validate measures assessing only a certain domain of HRQoL related to specific comorbidities or treatment side effects and (2) studies that provided indirect evidence of psychometric properties (e.g., comparing one PROM with another instrument).

Study screening and selection

We imported all references identified in the search into Endnote X8 (Clarivate Analytics, PA, USA). After the removal of duplicates, two researchers (HW & ZY) screened the titles, abstracts, and full texts independently to assess whether the studies met the eligibility criteria. Any discrepancies were resolved by the third researcher (ZZ). The reasons for exclusion of studies at the full-text screening stage were recorded.

Quality appraisal

Two reviewers (HW & ZY) assessed the included studies independently by using the COSMIN Risk of Bias Checklist. When there were discrepancies, a third reviewer (ZZ) was included to resolve them. The COSMIN Risk of Bias Checklist consisted of 10 domains (38 items), including PROM development, content validity, structural validity, hypothesis testing of construct validity, cross-cultural validity/measurement invariance, criterion validity, internal consistency, measurement error, test-test reliability, and responsiveness. The options for each item included “very good”, “adequate”, “doubtful”, and “inadequate quality”. The methodological quality of the study was based on the worst score counts.

Data extraction and synthesis

Two researchers (HW & ZY) independently extracted information, including the author, publication year, country/language, study design, target population, sample size, measurement domains, number of items, and total score range. The main findings regarding psychological properties included construct validity, internal consistency, cross-cultural validity/translation, criterion validity, and reliability. Any discrepancies were discussed between the two researchers.

We used the COSMIN criteria to summarize and rate the psychometric properties of each study regarding structural validity, internal consistency, reliability, measurement error, hypothesis testing for construct validity, cross-cultural validity/measurement invariance, criterion validity, and responsiveness. Each measurement property was rated as sufficient (+), insufficient (−), or indeterminate (?). When data were synthesized and the ratings of each study were consistent, the overall rating of the measurement property was rated as sufficient (+) and insufficient (−). If the ratings of each study were all sufficient (+), the overall rating of the measurement property was rated as sufficient (+). If the ratings of each study were all insufficient (−), the overall rating of the measurement property was rated as insufficient (−). We used narrative synthesis to synthesize the data for each measurement property. If the ratings of each study were inconsistent, we explored possible explanations (e.g., different languages). If the explanation was reasonable, we provided ratings by subgroup. If the explanation was unreasonable, the overall rating of the measurement property was rated as inconsistent (±). If there was no information to support the rating, the overall rating was rated as uncertain (?).

Assessment of the certainty of the evidence

We used a modified Grading of Recommendations, Assessment, Development, and Evaluation (GRADE) system to assess the certainty of the evidence. Each piece of evidence was graded for risk of bias, inconsistency, imprecision, and indirectness. Four reviewers (HW, ZY, ZZ, and SH) graded each measurement property and each PROM separately. Discrepancies were resolved by the fifth reviewer (YH). Based on the methodological quality of each psychometric property, four reviewers finally classified the instruments as strongly recommended, weakly recommended and not recommended according to the modified GRADE system. The classification results were verified by all authors.

Results

Literature search

The literature screening and selection process is shown in Fig. 1. In the initial search, a total of 13,371 articles were identified in the databases. Twenty-one articles were found through additional supplementary searches. After the removal of duplicates, a total of 10,097 articles were retained, and 10,028 articles were deleted after the review of the titles, abstracts, and full text. We finally included 69 articles [9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28, 34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82]. A total of 30 PROMs were investigated in the included studies.

Fig. 1
figure 1

Flowchart of the identification and selection of studies

Study description

Among the 69 included articles, 54 were in English, and 15 were in Chinese; the articles were published from 1996 to 2019. A description of the studies is shown in Table 1. All the included studies were cross-sectional studies. Twenty studies were conducted in China [17, 22, 36,37,38,39,40, 57,58,59,60,61,62, 74, 77,78,79,80,81,82], fourteen in the United States [15, 16, 21, 25,26,27, 35, 42, 64, 65, 67, 72, 73, 75], three in Uganda [24, 41, 46], three in Italy [44, 49, 69], two in Australia [70, 71], two in Vietnam [20, 55], two in Portugal [52, 75], and two in Canada [28, 66]. A total of 28,480 participants were included, with sample sizes ranging from 50 to 1923 [9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28, 34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82]. One study was conducted with adult males [35]. Four studies were conducted with HIV-positive women [41, 42, 65, 66]. One study was conducted with HIV-infected patients aged 50 years and older [52], and two studies were conducted with people with advanced AIDS [24, 28]. One study involved transgender male, transgender female, and genderqueer individuals [25]. One study was conducted in patients with HIV-related opportunistic infections [47].

Table 1 Overview of the included studies

The characteristics of all 30 HRQoL PROMs, including the items, domains, and score range, are shown in Table 1. The total number of items ranged from 8 to 142 [9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28, 34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82]. A total of 10 PROMs had multiple language versions, and the remaining 18 had only one language version. Tables 4 and 5 summarize the psychometric properties of the HIV-specific and generic instruments.

Quality assessment

Methodological quality assessment

Tables 2 and 3 show the methodological quality of the 69 included studies based on the COSMIN checklist. All studies were considered to have sufficient methodological quality for further study. Table 2 presents an overview of the COSMIN ratings of the HIV-specific instruments, and Table 3 presents the generic instruments. Limited information was retrieved on cross cultural validity/translation (58 studies) [11,12,13,14, 16,17,18,19,20,21,22,23, 25,26,27,28, 35,36,37,38,39,40, 42,43,44, 47, 48, 50,51,52,53,54,55,56, 58,59,60,61,62,63,64, 66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82], criterion validity (59 studies) [9,10,11,12, 15,16,17, 19,20,21,22,23,24,25,26, 34, 37,38,39,40,41,42,43,44,45,46,47,48,49,50, 52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67, 69,70,71,72,73,74,75,76,77,78,79, 81, 82], reliability (49 studies) [11, 13,14,15,16,17,18,19,20,21, 23,24,25,26,27,28, 34,35,36, 38, 39, 41,42,43,44,45,46,47, 49,50,51,52,53,54,55, 57, 59, 62,63,64,65, 68, 69, 72,73,74,75,76], hypothesis testing (18 studies) [11, 16, 17, 34, 38, 39, 41, 53, 61, 67, 68, 71, 77,78,79,80,81,82] and responsiveness (62 studies) [9,10,11,12,13,14,15,16, 18,19,20, 22,23,24,25,26,27, 34,35,36,37,38,39,40,41,42,43,44,45, 47,48,49,50,51,52,53,54,55,56,57, 59,60,61,62,63,64, 66, 68,69,70,71,72,73,74,75,76,77,78,79,80,81,82]. No data were identified on error and interpretability.

Table 2 Methodological quality assessment of the HIV-specific instruments
Table 3 Methodological quality assessment of the generic instruments

Quality of measurement properties of assessments

Table 4 presents the quality of the psychometric properties retrieved from the 69 included studies for all 30 measures. Fifteen PROMs were rated as insufficient (-) for content validity [11, 17, 48, 49, 53, 57, 59,60,61, 77,78,79,80,81,82]. There were 19 PROMs [19, 24, 26, 37,38,39,40, 45, 51,52,53,54, 57, 59, 60, 64, 70, 74, 75] rated as sufficient (+) for construct validity, and 31 [10,11,12, 14,15,16,17, 21, 34,35,36, 41, 42, 44, 46,47,48, 50, 53, 55, 56, 58, 59, 67, 68, 76,77,78,79, 81, 82] were rated as insufficient (−). The internal consistency was rated as sufficient (+) for 59 PROMs [9,10,11,12,13,14,15,16,17,18,19, 22,23,24,25, 34, 36,37,38,39,40,41,42,43,44,45,46,47,48,49, 51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72, 74, 76,77,78,79,80,81,82] and as insufficient (−) for 4 PROMs [20, 21, 35, 50].

Table 4 Rating of the measurement properties of the instruments

Certainty of evidence

Table 5 shows the overall quality score for each measurement property of the HIV-specific and generic instruments. Five PROMs were strongly recommended based on the methodological quality of each psychometric property, including MOS-HIV, WHOQoL-HIV-BREF, SF-36, MQoL-HIV, and WHOQoL-HIV. Among the seven language versions of the MOS-HIV [21, 34,35,36,37,38,39,40,41,42,43,44,45,46,47], six were rated as “high” for internal consistency [21, 34, 35, 41,42,43,44,45,46,47], and one was rated as “moderate” [36,37,38,39,40]. There were three versions rated as “high” for cross-cultural validity/translation [34, 41, 44, 46]. Among the eight versions of the WHOQoL-HIV-BREF [50,51,52,53,54,55,56,57,58,59,60,61], five were rated as “high” for internal consistency [50,51,52, 54, 56], and one was rated as “moderate” [53]. In total, more studies of the MOS-HIV were rated as “high” than studies of the WHOQoL-HIV-BREF, and more studies of the WHOQoL-HIV-BREF were rated as “very low” than studies of the MOS-HIV.

Table 5 Overall quality score for each measurement property

Discussion

This systematic review identified and assessed the psychometric properties of 30 HRQoL PROMs in PLWH and evaluated the certainty of the evidence provided for each PROM. To the best of our knowledge, this is the first and most comprehensive systematic review summarizing all psychometric properties of HRQoL PROMs for PLWH. The results may provide quantitative evidence for researchers and healthcare professionals to choose PROMs measuring HRQoL in PLWH in future scientific research and clinical practice.

Our systematic review found that compared to other HIV-specific and generic PROMs, the MOS-HIV has the best psychometric properties. The MOS-HIV is the most widely used HIV-specific instrument. In total, we searched fourteen validation studies to evaluate the psychometric properties of eight different language versions of MOS-HIV. Chinese included both simplified and traditional versions. Only one version was rated as “moderate” in internal consistency, and the other was rated as “high”. The MOS-HIV also has good construct validity, criterion validity, and hypothesis testing for construct validity. Overall, the expert group classified MOS-HIV as strongly recommended based on the GRADE system. Our results were in line with previous studies. Cooper and colleagues conducted umbrella reviews and found that the MOS-HIV was also recommended as a suitable measure for assessing HRQoL in PLWH from a content perspective [29]. In general, the MOS-HIV was considered to have good psychometric properties. Good internal consistency was generally reported, and its reliability was considered adequate [83, 84]. Acceptable convergent validity and discriminant validity were reported in several reviews [31, 32]. As one of the earliest HIV-specific HRQoL PROMs, MOS-HIV has been translated into at least 14 languages. The reliability and validity of the instrument were likely to decrease in the different translated versions due to their cultural adjustment. For these versions, mixed findings on the hypothesis testing of the MOS-HIV were reported [34,35,36,37,38,39,40,41,42,43,44,45,46,47]. As data on the psychometric properties of many studies were missing or indeterminate, we can draw only preliminary conclusions. More research is needed to fill the gap in the research on the psychometric properties of the existing instruments on HRQoL in PLWH.

Our review found that, in addition to MOS-HIV, the WHOQoL-HIV-BREF was reported to have good psychometric properties. Seven of eight different language versions of the WHOQoL-HIV-BREF were rated as “high” in hypothesis testing for construct validity. The WHOQoL-HIV-BREF was reported to have better reliability and internal consistency than other instruments except the MOS-HIV. Two language versions of the WHOQoL-HIV-BREF were rated as “very low” in internal consistency. Three language versions were rated as “very low”, and two were rated as “moderate” in construct validity. Connell and Skevington published a study to report the development and psychometric properties of the WHOQoL-HIV-BREF [51]. The results showed very good discriminant validity, which suggested the important role of the WHOQoL-HIV-BREF in distinguishing different stages of HIV disease progression [51].

Although the MOS-HIV showed good psychometric properties, a major advantage of the WHOQoL-HIV-BREF is its brevity. It contains only 31 items, and most participants can complete the instrument in 8 min. The WHOQoL-HIV-BREF is increasingly being used in HIV research. From a practical perspective, the MOS-HIV and WHOQoL-HIV-BREF focus on different dimensions and are based on different theoretical perspectives. The MOS-HIV is a multidimensional assessment measure that assesses physical, psychological, and social functioning. The MOS-HIV consists of 35 items across 11 domains: physical functioning, pain, social functioning, role functioning, emotional well-being, energy/fatigue, cognitive function, health distress, health transition, general health, and overall quality of life [8]. The WHOQoL-HIV-BREF has 31 items across six domains: physical functioning, psychological functioning, levels of independence, social relationships, environment, and spirituality [9].

The SF-36 is an internationally used generic instrument that can provide a comprehensive assessment of HRQoL in various populations. Although the SF-36 is also widely used in PLWH, only four validation studies were found in PLWH [19, 72,73,74]. The number of validation studies of different language versions was fewer than that of WHOQoL-HIV-BREF and MOS-HIV. From a global perspective, a better PROM should report decent psychometric properties in all language versions. Future studies are warranted to conduct validation studies evaluating the psychometric properties of the SF-36 in PLWH in various contexts. In addition, other aspects, such as scoring methods and content of items, may also restrict the wide usage of PLWH [85, 86]. Skevington et al. concluded that the SF-36 includes several different scoring scales and response options, which may complicate scoring and thus limit the widespread clinical use of the SF-36 [85]. Abbasi-Ghahramanloo et al. showed that the SF-36 may lack the ability to measure self-reported subjective HRQoL [86].

This study strongly recommends four HIV-specific and one generic PROM. Generic PROMs can be used to measure the HRQoL of general or HIV-infected populations. However, they may lack the sensitivity to detect subtle changes specific to PLWH, including stigma, relationship issues, and comorbidities [87]. HIV-specific PROMs are more closely related to the disease than generic PROMs and have the sensitivity and specificity needed for HIV-specific domains. Nonetheless, they are not conducive to use in comparisons across populations [88, 89]. It is highly recommended that when selecting instruments, researchers need to consider more aspects, including psychometric properties, instrument content coverage, ease of use, and scoring methods. Therefore, the choice of PROMs should be based on the specific aims of assessments and the response burden for participants.

Overall, we acknowledge that there are some limitations to this study. First, this study included only articles published in English or Chinese. Therefore, some studies published in other languages may not have been included, which may have affected the conclusions of this review. Second, we included only studies that aimed to evaluate the measurement properties of PROMs in PLWH. Some cross-sectional studies that aimed to explore the level of HRQoL in PLWH also reported the reliability and validity of PROMs. These types of studies were not included in this study. Third, we included four PROMs in Chinese that did not report a specific name. We used “unknown” to describe the names of these PROMs in all tables.

Conclusions

This systematic review identified and described the psychometric properties of 30 instruments and 69 studies. The findings from the included studies highlighted that compared to other HIV-specific and generic HRQoL PROMs, the MOS-HIV had the best psychometric properties and could be recommended as the most suitable for use in research and clinics. We also strongly recommended using WHOQoL-HIV-BREF, SF-36, MQoL-HIV, and WHOQoL-HIV to evaluate HRQoL in PLWH. We suggest that the choice of PROMs should be based on the specific aims of assessments and the response burden for participants.