Introduction

HIV has infected a total of 84.2 million people and claimed 36.3 million lives worldwide since the start of the epidemic [1]. Today, HIV remains to be a major global public issue. An estimated 40.1 million people were living with HIV/AIDS worldwide at the end of 2021 [2]. Given the large population of China, the influence of HIV in China should not be underestimated despite the relatively low prevalence. By the end of 2020, China had 1.05 million people living with HIV/AIDs and 351,000 cumulative reported deaths [3].

The widespread application of the highly active antiretroviral therapy (HAART) has made HIV infection a manageable chronic health condition, enabling people living with HIV/AIDS (PLWHA) to live a longer life. At the same time, HIV infection and antiretroviral treatment could accelerate the aging process of PLWHA [4]. The World Health Organization suggested the age of 50 to be a cut-off to discriminate older subjects within HIV-infected people [5]. As of the end of 2019, there were about 7.5 million PLWHA aged 50 and over worldwide, making up one fifth of PLWHA [6]. As a result of increasing access to effective HIV diagnosis and treatment, China has also witnessed an increasing number of older PLWHA in recent years [7]. In 2011, the proportion of older PLWHA aged between 50 and 64 in China reached 13.6%, up from 1.6% in 2000 [8].

However, longer life expectancy does not necessarily mean better well-being. Alongside physical discomforts, PLWHA also struggle with depression, anxiety, financial stress, and HIV-related discrimination [9]. To fully understand the health status of PLWHA and address their holistic needs beyond viral suppression, patient-reported outcome (PRO) measures should be developed and validated to complement biomarkers to depict patients’ experience with the disease and treatment [10].

Among the previous studies assessing health outcomes of PLWHA, generic instruments have been most widely used as they can facilitate comparison between different disease or treatment groups, but they were not originally designed to identify disease-specific issues and therefore may fail to capture important impacts of HIV [11]. As for specific PRO instruments established for PLWHA, quite a number of them were developed before the wide application of HARRT, decreasing their validity in evaluating treatment effectiveness [12, 13]. Besides, PRO instruments for PLWHA introduced from foreign countries should be used with caution as they might be culturally inappropriate [14]. Another big problem is that few PRO measurement instruments exist for older PLWHA. Aging is accompanied with decline of physical function and transition of social roles, further deteriorating and complicating the physical, psychological and social consequences for older patients. Measuring how older adults perceive their overall health condition is gaining increasing attention, both generic and disease-specific PRO instruments have already developed modules specific for older adults [15, 16].

Most of the instruments mentioned above were developed using classical test theory (CTT), which does not allow the test items to be divided up and reorganized to meet different test needs without compromising the instrument’s reliability. An alternative to the CTT approach is item response theory (IRT), which postulates that the probability of correctly responding to a given item can be modelled as a function of the item’s difficulty, discrimination and participant ability on the trait being measured [17]. Different from CTT statistics being dependent upon the sample from which they are taken, IRT could provide stable estimates of an item’s difficulty, discrimination and guessing probability that do not vary with changes in sample, item order and test conditions [18]. This characteristic makes it an ideal approach to developing adaptable yet rigorous instruments.

PRO measures for HIV/AIDS are expanding but still no gold standard exists, the advancement in treatment therapy, unique needs of older PLWHA, the culture-dependent feature of PRO, as well as the progressive development in psycho-metrics, raised concerns about developing new measures to accommodate different situations. This study aimed to use both the CTT and IRT to develop a disease specific PRO instrument for Chinese older PLWHA (PROHIV-OLD), hoping collected PRO data could better interpret life with HIV/AIDS of older people in China and accordingly improve the treatment and care for this population. This article reports on the iterative process of item selection; initial validation of reliability and validity of this instrument will also be conducted in this study.

Preliminary work

Literature review and focus group interviews with health care professionals were conducted first, based on which an initial conceptual framework involving physical, emotional, social, and treatment was generated. According to the conceptual framework, a total of 93 patients were interviewed face-to-face and videotaped. At numerous points in the interview, participants were encouraged to spontaneously add any comments or areas related to the disease that they deemed appropriate and important. Once completed, the videotapes were transcribed. Transcriptions were then compared against the original videotapes by a second set of research assistants. The transcripts of the interviews were reviewed and coded by 2 researchers, and items were generated and categorized. A draft preliminary item pool of 56 items was then presented to patients who had not participated in the initial interviews to evaluate the relevance, importance, comprehensibility, and potential redundancy of items, during which one item was discarded because of overlap with other items. The remaining 55 items comprised the preliminary PROHIV-OLD instrument tested here. Items were scored using a 7-point Likert scale with anchor points labored from “not at all” to “very much”. The recall period is determined to be one month.

Methods

Design and subjects

From February 2021 to November 2021, participants were recruited from six designated hospitals of three cities with varying socioeconomic status according to GDP per capita in Zhejiang Province, China. Participants were followed six months later after first investigation. PLWHA aged 50 and over, with ongoing antiviral therapy were eligible to participate in this study, while those who had cognitive issues, could not understand Mandarin Chinese, or at terminal stage of AIDS were excluded.

The PROHIV-OLD and a validated outcome measure, the Medical Outcomes Study HIV Health Survey (MOS-HIV) [19] were administered at baseline and at 6-month follow up. Demographic and HIV-related information were also collected. The baseline data was used as the study sample for item reduction analyses (Phase I), and the follow-up data as the validation sample to test the final instrument (Phase II).

This study was approved by the Institutional Review Board of Zhejiang University (approval number: ZGL202007-03), and written informed consent was obtained from all participants.

Phase I: item reduction

Item reduction based on the CTT

Distribution of scores of each item was analyzed. An item should be removed if floor or ceiling effects exceed 20% [20]. Items with standard deviations lower than 1, or coefficients of variation lower than 0.3 are deemed to be of low degree of variability and should be removed [21].

Exploratory factor analysis (EFA) aided in item reduction and exploration of factor structure. Exploratory structural equation modeling (ESEM) was also employed to analyze the factor structure. ESEM can be seen as a compromise between the flexibility of EFA and the rigor of SEM [22]. It has been used when factor structures were not yet well established as it allows for a more detailed model fit assessment [23, 24]. The principal axis factoring analysis with an oblique rotation was employed to extract factors. The scree plot [25], Horn’s parallel analysis (PA) [26] and Velicer’s minimum average partial (MAP) [27] were adopted to determine the number of factors to be extracted. Proposed models were compared by ESEM using the following fit indices, chi-square divided by degree of freedom (χ2/df), Tucker-Lewis index (TLI), standardized root mean square residual (SRMR), root mean square error of approximation (RMSEA), and Bayesian information criterion (BIC). Satisfactory model fit requires χ2/df < 3, TLI\( \ge \)0.9, SRMR<0.08, RMSEA<0.08, and a lower BIC [28, 29]. Fit indices of ESEM analysis, the conceptual clarity and the model’s simplicity were taken into account to select the optimal factor structure [30, 31]. Items with lower factor loads were dropped one by one in an ascending order until all the remaining items have a loading of 0.35 or higher on only one factor [32, 33].

After factors have been determined after factor analysis, the internal consistency of items was evaluated using the Cronbach’s alpha if item deleted (CAID) values. If the removal of an item leads to an increase of the CAID value, that item will be removed as it poorly contributes to the internal consistency [34].

Item reduction based on the IRT

Given the ordered categorical nature of the response categories, the graded response model (GRM) was employed in this step to analyze the items within each dimension [35].

The assumption of unidimensionality and monotonicity are checked before estimating item parameters and latent trait levels. PA was used to check unidimensionality, which requires that there is a single latent trait underlying a set of test items [36]. Monotonicity could be verified by the graphical ascent of the item characteristic curve (ICC) [37].

Discrimination and difficulty are the two parameters of interest in IRT. Item discrimination (α) represents the ability of an item to discriminate respondents with close latent trait level. Discrimination values between 0.4 and 4.0 are deemed acceptable [38]. Item difficulty (βi) is defined by the latent trait levels indicating the thresholds between response options. There is supposed to be a graded monotonic relationship between the respondents’ trait level and the item response options such that respondents with low trait level endorse low response options. Disordered thresholds occur when this monotonic relationship does not exist on the category characteristic curves (CCCs). A polytomous item with 7 response categories has six difficulty parameters (denoted β1, β2, β3, β4, β5, β6). The six degrees of difficulty values should range from − 3.0 to 3.0 and should be sorted in order [21, 39, 40].

Examining differential item functioning (DIF) is important in the investigation of the stability of an item’s measurement properties across subgroups differing in background characteristics [41]. The presence of DIF was evaluated, whether uniform or non-uniform, by logistic regression analysis. Items were flagged for possible DIF when the probability associated with the \( {{\upchi }}^{2}\) test was < 0.01 and the effect size measures (McFadden’s pseudo R2) > 0.13 [42, 43]. Variables used to test DIF in this study were gender (male vs. female), place of residence (city vs. village), and household monthly income per capita (≤ 600 RMB vs. >600 RMB).

Phase II: scale validation

Reliability

Internal consistency reliability was determined by calculating Cronbach’s alpha coefficient, McDonald’s ω, and composite reliability (CR). Values of 0.7 or above were considered appropriate [31, 44].

Test-retest reliability was assessed in a two-week interval in a group of 60 patients with stable disease condition using intraclass correlation coefficients (ICCs) with a two-way mixed effects model. Generally, ICCs\( \ge \)0.7 were acceptable [45].

Validity

CFA was implemented to examine the structure validity. The measurement model with χ2/df < 3, CFI\( \ge \)0.9, TLI\( \ge \)0.9, SRMR<0.08, RMSEA<0.08 was considered to be of goodness-of-fit [28].

Convergent and discriminant validity was assessed through correlation analyses between the PROHIV-OLD and the MOS-HIV. Correlations between comparable dimensions are expected to be larger than those between less comparable dimensions [46]. Spearman’s correlation coefficients of 0.50 or above were regarded as strong, 0.30–0.49 as moderate, and lower than 0.30 as weak [47].

Known-groups validity examines how well the instrument can discriminate among participants with different demographic backgrounds and clinical conditions. Previous studies have found the health outcome of PLWHA was poorer for females and those with heavy financial burden, high plasma HIV-1 RNA level, low CD4+T cell counts, and at terminal stage of AIDS [48,49,50]. In addition, we hypothesized patients with co-morbidity, abnormal liver or renal function would have worse quality of life. One-way ANOVA was performed to assess group differences.

Data analysis software

EFA, IRT-based item reduction, and the calculation of McDonald’s ω were conducted by R (Version 1.3.959, macOS). ESEM and CFA were conducted in Mplus (Version 8.6, macOS). All the other analyses were performed using SPSS (Version 24.0, macOS). A p value of smaller than 0.05 was set as the statistically significant level for all the analyses except DIF, for which the p-value was set at < 0.01.

Results

Sample characteristics

Of the 600 patients recruited at the baseline, 82.17% were male. The average age of the study sample was 61.31 years (SD=\( \pm \)8.01). Most of the participants were married (71.17%), had middle school education or below (76.50%), and got infected due to heterosexual sex contact (69.78%). A total of 180 participants (30.00%) reported comorbidity. 57.50% patients were asymptomatic HIV carriers. Respondents with CD4+T cell count above 200 cell/\( {\upmu }\text{l}\) occupied 81.90%, and 87.00% participants’ baseline plasma HIV-1 RNA level below level of quantification (20 copies/ml). Of the 485 patients who completed the follow-up investigation, 483 were included in the validation sample (Table 1).

Table 1 Sample demographic and disease-related information (n = 600)

Item reduction results

The percentage of response at the floor (score = 0) ranged from 7.00 to 16.17%, and the percentage of response at the ceiling (score = 6) ranged from 4.33 to 19.17% (Table 2). Each item demonstrated acceptable discrete trend, with SD ranging from 1.67 to 1.99 and CV ranging from 0.55 to 0.74 (see Additional file 1).

Table 2 Percentage of each option for all items

In determining the number of factors to be extracted, the results of PA and MAP suggested to extract 4 and 5 factors respectively. The scree plot showed that a total of 9 factors had eigenvalues greater than 1, but factors 7, 8, 9 were discarded as they were difficult to interpret. The hypothesized conceptual framework of PROHIV-OLD proposed a four-factor structure. Therefore, three EFA models with four to six factors were proposed, ESEM was conducted to compare the fitness of these models (Table 3). Fit indices of χ2/df, TLI, SRMR, and RMSEA seemed to be more satisfactory when more factors were retained, but BIC of the five-factor model was the smallest. Considering the interpretability and simplicity of the model structure, the five-factor solution was finally considered as the most theoretically sensible pattern of the results.

Table 3 Comparison of the three models by their fit indices

Factors were then extracted by principle axis analysis using oblique rotation, and the items were sorted by descending order of factor loads on each factor. According to the results, item 39, 37, 52, 36, 30, 41, 40, 8, 9, 29, 5, 33, 42 were dropped accordingly due to factor loads lower than 0.35, and item 43, 48, 51, 2, 4, 14, 17, 18, and 24 with loads of 0.35 and higher on multiple factors were also discarded. Finally, 33 items were retained after EFA, accounting for 52.24% of the total variance (Table 4).

Table 4 Exploratory factor analysis for the PROHIV-OLD five-factor model

The five-factor structure with 33 items was further verified to be of good fitness by ESEM (χ2/df = 2.91, TLI = 0.89, SRMR = 0.027, RMSEA = 0.056). The remaining items were closely related with their own dimension (all r > 0.6, p < 0.05), and the deletion of the item did not lead to the increase of CAID values (see Additional file 1), therefore, no more items were removed in the reduction based on CTT.

In the reduction based on IRT, several assumptions were examined first. PA suggested that each of the five factor established by CTT was unidimensional (Table 5), as only the first eigenvalue generated from raw data was greater than that expected by random data (simulations based on normal distributions). All the ICCs were monotonically rising (see Additional file 2), verifying the monotonicity.

Table 5 Test results of unidimensionality by PA

All items showed acceptable discrimination ability except item 35 (\( {\upalpha }\)=4.99) and item 38 (\( {\upalpha }\)=5.09) on Factor 4. Item 38 was first deleted as its discrimination parameter was slightly higher, after which item 31 exhibited an extremely large discrimination value (\( {\upalpha }\)=51.45) and was consequently deleted. Discrimination and difficulty of the remaining 3 items on this factor were retested and results were acceptable. Item 6, 22, and 25 were deleted due to disordered thresholds, Figs. 1, 2 and 3 listed the CCCs for these three items, the CCCs of other items could be found in Additional file 2. Significant uniform DIF were detected for item 7, 10, 46, 49, and 50, and non-uniform DIF by registration or monthly household income per capita were detected for item 7, 22, 27, and 46. The R2 coefficients were all lower than 0.13, indicating that the impact of DIF on the assessment was small. Items with non-uniform DIF, i.e. item 7, 22, 27, and 46 were finally deleted. Therefore, the item reduction process resulted in a final version that comprised 25 items within 5 dimensions (Table 6). Based on the content of grouped items, the five dimensions were finally named physical symptoms, mental status, illness perception, family relationship, and treatment dimension (Table 7).

Fig. 1
figure 1

Category characteristic curves of item 6

Fig. 2
figure 2

Category characteristic curves of item 22

Fig. 3
figure 3

Category characteristic curves of item 25

Table 6 Item reduction based on IRT
Table 7 Bank of 25 items in the final PROHIV-OLD

Validation results

Reliability

The Cronbach’s alpha, McDonald’s ω and CR were excellent (> 0.85), supporting the internal reliability of the PROHIV-OLD instrument. The ICCs of the physical symptoms dimension was slightly lower than 0.7, while all other dimensions had ICCs higher than 0.7, indicating that the PROHIV-OLD had acceptable test-retest reliability (Table 8).

Table 8 Internal consistency reliability and test-retest reliability of the PROHIV-OLD instrument

Validity

The CFA was conducted on the final PROHIV-OLD instrument to test structure validity. The five-factor model achieved a good fit (χ2/df = 2.54, CFI = 0.94, TLI = 0.93, SRMR = 0.06, RMSEA = 0.06), with the factor loads of all 25 items ranging from 0.47 to 0.90, indicating good structure validity.

The Spearman correlation coefficients between the PROHIV-OLD and the MOS-HIV were stronger between more comparable dimensions (e.g., 0.65 between PROHIV-OLD physical symptoms dimension and MOS-HIV physical functioning scale) than those between less comparable dimensions (e.g., 0.26 between PROHIV-OLD family relationship dimension and MOS-HIV physical functioning scale). Generally, convergent and discriminant validity of the PROHIV-OLD was considered to be satisfactory (Table 9).

Table 9 Correlations between the PROHIV-OLD and the MOS-HIV (n = 483)

Table 10 shows the mean PROHIV-OLD dimension scores by subgroups. Female participants had significantly lower scores on physical symptoms and mental status dimensions than males. Household monthly income per capita was positively related with physical symptoms scores. No significant effect was found for different HIV-1 RNA level on all of the five dimensions. The CD4+T cell count at the latest blood test was positively associated with physical symptoms and treatment scores. Patients with CD4+T cell counts higher than 500 cell/\( {\upmu }\text{l}\) scored highest on mental status and family relationship dimensions, while those with CD4+T cell counts lower than 200 cell/\( {\upmu }\text{l}\) scored lowest on illness perception dimension. PLWHA who had progressed into the stage of AIDS performed worse on treatment dimension. Comorbidity and dyslipidemia were significantly related with lower scores on physical symptoms dimension, while patients with abnormal liver or kidney function did not report more physical symptoms.

Table 10 Validity of the PROHIV-OLD instrument assessed by the known-groups method (n = 483)

Discussion

As an increasing number of PLWHA are now living into older age, more attention should be paid to the overall quality of life of the extended years. Older PLWHA were once rarely involved in the development, validation and application of related PRO instruments as few of them could lead a long life before, thus the validity of existing patient-reported measures for older PLWHA could be challenged. Therefore, this study developed and validated an instrument to understand how HIV influenced Chinese older patients.

In scale development and psychometric evaluation, CTT is the most frequently used method as it is easier to understand and implement. However, reliability of results based on CTT statistics can be inadequate as these methods are associated with certain disadvantages, such as being item sample dependent, and lack of information on respondents’ ability [51], while IRT methods are independent from sample characteristics and can afford more accurate examination of each item [52], which have gained IRT popularity in item selection [53]. However, few HIV/AIDS specific instruments have been developed using IRT to date. This study used both CTT and IRT to select items in the phase of item reduction, hoping to further improve the performance this instrument.

In item selection by EFA, determining the appropriate number of factors is an important yet controversial issue as no single procedure seems to be entirely satisfactory among the many rules of thumb and statistical indices for addressing the dimensionality issue [54, 55]. The more common indices of the Kaiser’s criterion [54] and the more accurate methods of the PA and MAP [30, 42] were employed in this study to identify the number of latent factors needed to accurately account for the common variance among the items. ESEM, which offers the advantage of providing the overall tests of model fit [56], was then conducted to compare the fitness of the proposed competing models to determine the optimal factor structure. A five-factor structure was finally determined and the factor rotation resulted in as many as 22 items being deleted, the strict requirements of EFA on the number and correlation of variables, as well as the sample size and distribution could explain the large number of items being deleted at this stage [57], previous studies also found quite a number of items being removed by EFA [58, 59].

In item reduction using IRT, 2 items failed to meet the discrimination criterion and were first deleted. Disordered thresholds were detected for 3 items, indicating that respondents may have difficulty in distinguishing between the response options and these 3 items were removed consequently. Uniform DIF was observed for 5 items and 4 items exhibited non-uniform DIF. No consensus has been reached on the disposition of items with DIF. Items with non-uniform DIF were generally required to be deleted, while appropriate weightings can be applied to items with uniform DIF [60, 61]. Some studies suggested to determine the salience of DIF by testing the magnitude of DIF beyond significance, and items that exhibits DIF with large magnitude of impact, whether uniform or non-uniform, are supposed to be deleted [37, 42, 43]. This study also examined the magnitude of DIF, the DIF observed had no substantial influence, therefore only items with non-uniform DIF were finally removed.

The reliability and validity of the final instrument have been rigorously tested. Internal consistency reliability of the PROHIV-OLD was supported by the high Cronbach’s alpha coefficients, McDonald’s ω and CR, which are deemed to be more suitable to evaluate reliability for multidimensional instruments [62], further confirmed the reliability for each dimension. All dimensions demonstrated good test-retest reliability except that the ICC of the physical symptoms dimension was slightly less than 0.7. Apart from disease and treatment related symptoms, the physical symptoms dimension also contains items less specifically related with HIV infection, such as energy, and sleep quality, which might be responsible for the lower test-retest reliability of this dimension.

Regarding the structure validity of PROHIV-OLD, the poor fitness of the one-factor model confirmed that the PROHIV-OLD is multidimensional in nature, and the final structure of the instrument was supported by CFA. Correlations between comparable PROHIV-OLD and MOS-HIV dimensions were stronger than those between less comparable dimensions. The correlations between the role functioning scale of the MOS-HIV with all five dimensions of the PROHIV-OLD were weak. The two entries in the MOS-HIV role functioning scale concern the ability to do certain kinds or amounts of work, housework, or schoolwork, which are no longer the main content of older adults’ social life, instead, their social relationship and interaction will be more confined to family [63, 64], which possibly resulted in the stronger correlation between the MOS-HIV role functioning scale with the PROHIV-OLD family relationship dimension. This also implied the uniqueness of older patients’ experience and the conceptual framework of the PROHIV-OLD.

Known-groups validity was examined across a range of demographic and clinical relevant factors. Similar with existing studies, gender [48] and income differences [49] on dimension scores have been detected. For clinical factors, all the five dimensions of PROHIV-OLD distinguished patients with different levels of CD4+T cell counts well, while no significant associations were found between any dimensions of the PROHIV-OLD and HIV-1 RNA level. The proportion of patients with abnormal plasma HIV-1 RNA level (11.47%) might be too small to detect its effect on patients’ perceived health status. Dyslipidemia was associated with poorer performance on the physical symptoms dimension, whereas patients with abnormal liver or kidney function did not report more physical symptoms. One possible reason was that the liver and kidney function can only be roughly determined based on limited medical information, future studies can consider to employ more precise medical examinations and include respondents’ self-perceived condition.

Several potential limitations of this study should be stated. First, generalizability of this study might be inadequate given that only patients in Zhejiang province were included. Besides, epidemic-related control policies under COVID-19 prevented us from interviewing hospitalized patients, who are at higher possibility of undergoing serious opportunistic infections or other adverse events, which further limited the representativeness of the study sample. Second, for older PLWHA with poor vision, investigators assisted them to fill the survey by reading the items verbatim to them, which might cause selection and social desirability bias. Third, the primary aim of instrument development and validation limited this study to only detect the presence and the salience of DIF, the underlying complex mechanisms for DIF remain to be identified in future qualitative and quantitative studies. Fourth, although the reliability and validity shown in this study seems to be satisfactory, the instrument’s ability to detect change over time remains to be examined to further support the psychometric properties of this instrument. Nevertheless, this large study in multiple sites with rigorous instrument development and validation methods provided a strong foundation for health outcome assessment and promotion for the ever-increasing population of older PLWHA.

Conclusions

The PROHIV-OLD instrument demonstrated acceptable reliability and validity, suggesting that it can be implemented in clinical research and practice to provide further valuable information on health outcome of older PLWHA in China. Other measurement properties such as responsiveness and interpretability will be further examined.