1 Introduction

Idiopathic Pulmonary Fibrosis (IPF) is an interstitial lung disease that causes irreversible and progressive lung scarring [1]. Lung transplantation stands as the sole curative treatment for IPF, but anti-fibrotic medication may decelerate disease progression in some patients [1]. Even with anti-fibrotic therapy, patients’ health-related quality of life (HRQoL) is significantly impacted by the burdensome symptoms of breathlessness, cough, and fatigue [2]. Not surprisingly, patients with IPF identify HRQoL as one of their treatment priorities [3]. The St George's Respiratory Questionnaire (SGRQ) is one of the most commonly used patient-reported outcome measures (PROMs) in IPF [4, 5]. However, SGRQ was initially developed for patients with chronic obstructive pulmonary disease (COPD), and an IPF-specific version (SGRQ-I) has been recently developed for patients with IPF [6]. Swigris et al. conducted a literature review on the SGRQ in patients with IPF [7]. They found that the SGRQ's psychometric properties were adequate and suggested that it may be a useful measure of HRQoL in patients with IPF. However, they did not assess the SGRQ's properties against The COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) guidelines. These guidelines incorporate a methodology for integrating the methodological rigor of studies on measurement properties with the quality of the PROM itself (psychometric properties) [8]. This enables reviewers to draw better conclusions about the quality of the PROMs when selecting evidence-based measures for use in research and clinical practice [9]. Also, the previously mentioned review included studies until 2013 and excluded the SGRQ-I due to the limited number of studies. The SGRQ and SGRQ-I are widely used in IPF despite the lack of a comprehensive evaluation of their psychometric properties. Thus, our study aimed to conduct a systematic review with meta-analysis, wherever possible, of the psychometric properties of both the SGRQ and SGRQ-I among IPF patients using the COSMIN guidelines. The psychometric properties encompass content validity, structural validity, internal consistency, test–retest reliability, criterion validity, construct validity, known-groups validity, responsiveness, and interpretability (floor and ceiling effects, minimal important difference). Our study is intended to provide evidence on the strengths and weaknesses of using these instruments in research and clinical practice.

2 Methods

2.1 Study design

This systematic review followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement [10] and the COSMIN guidelines for conducting a systematic review of PROMs [9]. The protocol for this study is registered in Open Science Framework (https://osf.io/kgtz9/) but has not been published in a peer-reviewed journal.

2.2 Study eligibility criteria

The inclusion criteria for studies were as follows: (1) full-text studies and unpublished doctoral dissertations and master’s theses, (2) the study cohort included a substantial portion of patients with IPF ages 18 years and older, defined as more than 30%, consistent with literature [11], (3) studies that addressed any psychometric property of the PROMs (development or evaluation studies), (4) any study design, and (5) studies conducted in any country. Exclusion criteria included (1) studies that used PROMs as an outcome measurement instrument or validation studies in which the PROM was used for validating other instruments, (2) unpublished work such as abstracts from conference proceedings and technical reports, and (3) studies published in non-English language sources. The exclusion of unpublished studies and those published in languages other than English has been reported to have little effect on the overall results [12].

2.3 Data sources

The following six electronic databases were searched: (1) PubMed, (2) Medline (via OVID), (3) CINAHL, (4) PsycInfo, (5) Web of Science, and (6) Scopus. In addition, to identifying doctoral dissertations and master’s theses, the ProQuest Dissertations and Theses Database was also searched. Searches for relevant publications were conducted between the start dates of each respective database through June 2022. For all electronic database searches, we employed the general search filter proposed by the COSMIN guidelines [13]. Appendix A Table A.1 shows the search strategy used for all databases searched. This search strategy was adapted according to the requirements for each of the databases searched. The first and second authors (RM and YPM) conducted database searches and article retrieval with oversight from three senior authors (KMK, TL, and GAK).

2.4 Study selection

All search results were stored using Mendeley reference manager software (version 1.19.6). Duplicate records were removed both electronically using Mendeley and manually. Studies were then selected using a two-stage process. All titles and abstracts retrieved in the first stage were screened against the inclusion/exclusion criteria. Articles not excluded from the first stage were advanced to the second stage for review. Studies meeting all inclusion criteria were considered for data extraction. Two authors (RM and YPM) checked the bibliographies of included studies and previous systematic reviews to identify any additional studies meeting the inclusion criteria. The study selection and data extraction were conducted independently by RM and YPM. Both reviewers held meetings after each stage and reached an agreement to resolve any disagreements. When disagreements persisted, RM and YPM discussed them, and if no consensus was reached, they sought the opinion of one or more adjudicators (KMK, TL, or GAK).

2.5 Data abstraction

The data extracted from the studies included the population characteristics, recruitment period, language, sample size, age, gender, setting, and pulmonary function tests. All extracted data were entered into Microsoft Excel.

2.6 Risk of bias assessment (RoB)

The COSMIN risk of bias checklist was used to evaluate the methodological quality of studies on psychometric properties (Appendix A, Table A.2) [14]. Only checklist items relevant to the psychometric property examined in the included study were assessed, as not all measurement properties were evaluated in every article. The RoB of each study on a measurement property was assessed separately as very good, adequate, inadequate, and doubtful. The “worst score counts” principle was applied, meaning the lowest rating among any standard was used to determine the overall rating for each study on a measurement property [14]. This overall rating of the RoB of the studies was accounted for when categorizing the quality of the evidence of the PROM measurement property. The risk of bias was assessed independently by RM and YPM. They reviewed the assessments for agreement, and if there were unresolved disagreements, an adjudicator (KMK, TL, or GAK) was consulted.

2.7 Instrument measurement properties evaluation

The result of each study on a measurement property was rated according to the COSMIN updated criteria for good measurement properties as either sufficient ( +), insufficient ( −), or indeterminate (?) [9] (Appendix A, Table A.2). Based on general COSMIN recommendations, if the ratings for each study were consistent, results were either statistically pooled in a meta-analysis, assuming at least five effect sizes for that outcome, or qualitatively summarized. If all findings were consistent, the overall rating was categorized as either sufficient ( +), insufficient (–), or indeterminate (?). An overall rating was assigned for each measurement property by consolidating the rating of each study on that measurement property. A positive ( +) or negative ( −) rating was considered if 75% or more of the study results yielded a consistent rating, while an indeterminate (?) rating was assigned if the results were inconsistent (i.e., < 75% of studies did not show the same results). If the ratings were inconsistent, our conclusion was primarily guided by the most consistent results and downgraded the evidence for inconsistency ( ±).

2.8 Quantitative summarization

2.8.1 Calculation and pooling of effect sizes

The Fisher r-to-z conversion was applied to approximate the distribution of the correlation coefficients to a normal distribution and stabilize the variances [15,16,17]. We first transformed F-ratios and unstandardized beta coefficients to r and then to Fisher's Z. Transformation using Fisher's r to z approach was as follows: z = 0.5 × ln((1 + ICC)/(1‐ICC)), where ICC represents intraclass correlation coefficients which have an approximate variance, (Var(z) = 1/(N‐3)), and where N is the total sample size. Upon completion of the pooled analysis, results were back-transformed to r for ease of interpretation. The inverse variance heterogeneity (IVhet) model, recognized for being more robust than the traditional random-effects model, was employed to pool effect sizes for all outcomes [18]. A minimum of five studies per outcome was required for pooling effect sizes [18]. In addition to pooling results, influence analyses were conducted to assess the impact or sensitivity of each study on the overall results. Furthermore, an outlier analysis was performed by excluding effect sizes whose 95% confidence interval (CI) fell completely outside the pooled 95% CI.

2.8.2 Stability and validity of outcomes

Heterogeneity was examined using the Q statistic, with a significant level of ≤ 0.10, indicating statistically significant heterogeneity [19]. Inconsistency was evaluated using I-squared (I2), with < 25%, 25–50%, and > 50% indicating small, medium, and large amounts of between-study inconsistency, respectively [20]. Tau-squared, an absolute measure of between-study heterogeneity, was also computed. Small-study effects (publication bias, etc.) were qualitatively evaluated using Doi plots and quantitatively assessed using the Luis Furuya-Kanamori (LFK) index [21, 22]. LFK indices of ± 1, between ± 1 and ± 2 and >  ± 2, are suggestive of no, minor, and major asymmetry, respectively.

2.9 Qualitative summarization

In addition to quantitative analysis, study results were summarized qualitatively (lowest and highest values) for interpretability.

2.10 Strength of evidence

The strength of the evidence was graded using a modified Grading of Recommendations Assessment, Development, and Evaluation (GRADE) approach [8]. The modified GRADE approach includes (1) risk of bias, (2) inconsistency, (3) imprecision, and (4) indirectness (Appendix A, Table A.3). For indirectness, per COSMIN guidelines, the level of evidence can be downgraded by 1 or 2 levels for “serious” or “very serious” indirectness if studies are performed in a population other than the specific population of interest. The evidence wasn’t downgraded in this review as it wasn’t deemed serious [8]. The quality of the evidence was graded for each measurement property and each PROM separately (high, moderate, low, and very low evidence).

2.11 Reporting evaluation

We utilized the COSMIN reporting guidelines to evaluate the level of transparent reporting in the included studies on the measurement properties of PROMs [23].

2.12 Summary of findings

The results for each measurement property of SGRQ and SGRQ-I were presented in the Summary of Findings (SoF) Table, along with a rating using symbols (+ / ─ /?) and a grading system (high, moderate, low, very low) to indicate the quality of evidence. Based on the findings in the SoF table, the two instruments under consideration were placed into one of three categories based on COSMIN guidelines [8]. Category A comprises PROMs that may be recommended as the most appropriate for the target construct in IPF patients, demonstrating adequate content validity. Category B consists of PROMs that show potential for recommendations but require additional validation studies. PROMs not falling into either categories A or C are classified in this category. Category C encompasses PROMs that should not be recommended due to strong evidence indicating insufficient measurement properties. Similar to study selection, data abstraction, and bias assessment, protocols for agreements and disagreements were followed by RM and YPM. Effect sizes were calculated using Microsoft Excel 2013, and data was analyzed using MetaXL, version 5.3, and STATA version 16.

2.13 Modifications to a priori protocol

We deviated from the original protocol by choosing not to email the study's corresponding author to request missing data. This decision was influenced by the large volume of missing information, which made the approach burdensome and time-consuming based on the resources available. Nonetheless, we thoroughly identified and documented all the reporting issues using the COSMIN reporting checklist.

3 Results

3.1 Study selection

A PRISMA flow diagram displays the study selection process (Fig. 1). We included 24 studies, where 19 assessed the psychometric properties of the SGRQ [6, 24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41] and seven assessed that of SGRQ-I [6, 28, 42,43,44,45,46]. A list of excluded studies with the reasons for exclusion can be found in Appendix B Table B.1.

Fig. 1
figure 1

PRISMA flowchart displaying study selection process. Diagram adapted from Reference [74]

3.2 Study and population characteristics

Study and patient characteristics are presented in Table 1. Across all studies, the mean age of the participants was reported to be between 52 and 81 years. The proportion of males ranged from 46.0 to 84.6%. While the sample size ranged from 20 to 1061 patients. Disease severity was heterogeneous due to different inclusion and exclusion criteria, such as excluding patients with comorbidities, malignancies, or if the patients had experienced any deterioration or appearance of any new symptom(s) the week before enrollment or the patient's pulmonary function tests (PFTs). Also, in nearly half of the studies, the enrollment period was prior to 2014 (the anti-fibrotic era). The version of the PROM used was never fully reported, noting there are different versions consisting of different recall periods and languages (e.g., American English vs. British English). Six studies reported how they handled missing data [6, 26, 36, 37, 45, 46]. Evaluation for studies according to COSMIN reporting guidelines is presented in Appendix C, Table C.1.

Table 1 Study characteristics

3.3 Psychometric properties

  • a) Content validity

No study assessed the content validity of either PROMs. However, one study briefly mentioned the degree to which the SGRQ content addressed quality-of-life aspects; the median rating was 8.5 on a scale of 1 to 10 as the worst and best, respectively [25]. We did not rate content validity as we found that the information provided did not count as a content validity analysis.

  • b) Structural validity

Only one study measured the structural validity of the SGRQ-I using Rasch analysis, a one-parameter logistic model [6]. The item trait interaction was χ2 = 10.4, p = 0.58, indicating a good fit to the Rasch model. We rated structural validity as 'indeterminate' due to the absence of information on unidimensionality, local independence, and monotonicity. The quality of evidence for both SGRQs was judged as 'moderate' (Appendix C, Table C.10 and Table 2).

  • c) Internal consistency

Table 2 Summary of findings

Appendix C, Table C.2, and Table 2 summarize internal consistency for the SGRQ and SGRQ-I. Cronbach's alpha coefficient was reported in all studies. Except for one study [36], all other studies met the threshold for a positive rating (i.e., ≥ 0.70), ranging from 0.91 to 0.94 for the total score, 0.83 to 0.86 for impact, and 0.84 to 0.86 for activity for the SGRQ. The reported Cronbach’s alpha for symptoms was 0.66. A similar finding was observed for the SGRQ-I, where the symptom domain showed low internal consistencies of 0.48 and 0.62 [6, 39]. All but two studies [27, 43] reported on the internal consistency of the subscales. The Cronbach alpha was reported to be greater than 0.70 in all studies. However, we qualitatively rated the internal consistency as ‘indeterminate’ due to the lack of evidence for sufficient structural validity, as only one study reported on the goodness of fit indices for the SGRQ-I [6]. The quality of evidence on internal consistency for both PROMs was deemed ‘moderate.’ Inadequate ratings for internal consistency were typically attributed to the internal consistency statistic (most commonly Cronbach’s alpha) being computed for the entire scale rather than for each subscale separately (Appendix C, Table C.2, and Table 2).

  • d) Test–retest reliability (repeatability)

The ICC was used for all studies to assess repeatability, with two studies using Bland–Altman plots for periods ranging between 2 to 52 weeks [44, 45]. For the SGRQ-I, the stability of patients was not assessed using any method. On the other hand, for the SGRQ, the stability of patients was assessed using different definitions, including change in Forced vital capacity percentage predicted (FVC%) (= < 2%), Patient Global Impression of Change (PGI-C) (no change), The University of California San Diego Shortness of Breath Questionnaire (UCSD-SOBQ) (< 5 points), and FVC% (= < 5%) between test–retest. Except for symptom and impact domains in the SGRQ, all ICC reports met the required threshold for a positive rating (i.e., ≥ 0.70). Thus, the reliability was qualitatively rated as "adequate" for both PROMs. We rated the test–retest reliability as 'insufficient' due to the low ICC for symptom and impact for the SGRQ and the lack of reporting of ICC for subscales for the SGRQ-I. For SGRQ, the discordance in the results rendered us to rate the quality of evidence on reliability as 'Moderate.' For the SGRQ-I, the quality of evidence on reliability was deemed "High" (Appendix C, Table C.3 and Table 2).

  • e) Criterion validity

Two studies assessed the criterion validity of the SGRQ-I, where the SGRQ was used as the criterion [6, 45]. Yorke et al. used the Bland–Altman plot and ICC, and an excellent agreement was observed (ICC > 0.95) [6]. Prior et al. used the Bland–Altman plot and correlation, and a high correlation was observed (r > 0.75) [45]. Thus, we rated the known-groups validity as "adequate" and 'sufficient' and the quality of evidence as ‘High’ (Table 2).

  • f) Construct validity

Construct (convergent) validity was the most assessed psychometric property. In our systematic review, we placed each comparator instrument in one of the following categories: either weak (< 0.4), moderate (0.4 ≥ r ≤ 0.7), or strong (< 0.7) [47]. Different measures were used to validate the SGRQ and SGRQ-I, PFTs, generic HRQoL questionnaires such as the 12-item and the 36-Item Short-Form Health Survey (SF-12 and-36), respiratory-specific HRQoL PROMs such as the King’s Brief Interstitial Lung Disease (KBILD), and respiratory symptom questionnaires such as the Baseline Dyspnea Index (BDI). As expected, the SGRQ and SGRQ-I should have stronger correlations with respiratory HRQoL and symptom-specific PROMS than generic PROMs and PFTs. The construct validity was rated as "adequate" for both PROMs and as 'sufficient' as the direction and magnitude of correlation matched our hypothesis 75% of the time. The quantitative synthesis is summarized in (Appendix C, Table C.4, and Table 2).

Meta-analysis of pooled results revealed the same pattern of correlations as individual studies (Figs. 2, 3, 4, 5, 6, 7, Appendix C, Table C.5). We observed substantial heterogeneity measured by Cochran’s Q for PFTs and 6-min walk distance (6MWD). We observed heterogeneity, large to moderate inconsistencies, and minor to major small-study effects. Alternatively, for the BDI, no heterogeneity was detected, no moderate inconsistency was observed, and no small-study effects were found. The inconsistency, as determined from the meta-analysis, led us to rate the quality of evidence as ‘Moderate’.

Fig. 2
figure 2

Forest plots and Doi plots for the construct validity (correlation coefficient) of SGRQ and 6MWD. The values presented in the figure represent the absolute correlation coefficients. The direction of each correlation is as shown in Table C.4.4. SGRQ St George's Respiratory Questionnaire, S symptom, A activity, I impact, T Total score, 6MWD 6-min walk distance

Fig. 3
figure 3

Forest plots and Doi plots for the construct validity (correlation coefficient) of SGRQ and BDI. The values presented in the figure represent the absolute correlation coefficients. The direction of each correlation is as shown in Table C.4.4. SGRQ St George’s Respiratory Questionnaire, S symptom, A activity, I impact, T Total score, BDI Baseline Dyspnea Index

Fig. 4
figure 4

Forest plots and Doi plots for the construct validity (correlation coefficient) of SGRQ and DLco%. The values presented in the figure represent the absolute correlation coefficients. The direction of each correlation is as shown in Table C.4.4. SGRQ St George's Respiratory Questionnaire, S symptom, A activity, I impact, T Total score, DLco% the diffusing capacity of the lungs for carbon monoxide percentage predicted

Fig. 5
figure 5

Forest plots and Doi plots for the construct validity (correlation coefficient) of SGRQ and FVC%. The values presented in the figure represent the absolute correlation coefficients. The direction of each correlation is as shown in Table C.4.4. SGRQ St George's Respiratory Questionnaire, S symptom, A activity, I impact, T Total score, FVC% forced vital capacity percentage predicted

Fig. 6
figure 6

Forest plots and Doi plots for the construct validity (correlation coefficient) of SGRQ and FEV1% and TLC%. The values presented in the figure represent the absolute correlation coefficients. The direction of each correlation is as shown in Table C.4.4. SGRQ St George's Respiratory Questionnaire, T Total score, 6MWD 6-min walk distance, FEV1% Forced expiratory volume percentage predicted, TLC% Total lung capacity percent predicted

Fig. 7
figure 7

Forest plots and Doi plots for the construct validity (correlation coefficient) of SGRQ-I and 6MWD and FVC%. The values presented in the figure represent the absolute correlation coefficients. The direction of each correlation is as shown in Table C.4.4. SGRQ-I IPF version of St George's Respiratory Questionnaire, T Total score, 6MWD 6-min walk distance, FVC% forced vital capacity percentage predicted

Sensitivity analysis and outlier analysis

We identified a study by Abdelaziz et al. [24], that was considered an outlier in the meta-analysis of the correlation between SGRQ domains and 6MWD. Excluding the study had two effects. First, it resulted in a decrease in heterogeneity and inconsistency among the remaining studies. Second, it improved the precision of the pooled correlation coefficients, enhancing the accuracy of the correlation estimates. The sensitivity and outlier analysis results are presented in (Appendix C, Table C.6). Also, our sensitivity analysis indicated that excluding the Lutogniewska et al. [32] study [32], which had patients with IPF comprising 35% of the total sample, did not significantly impact the overall findings (Supplement C, Table C.6).

  • g) Known-groups validity

The evidence for known-groups validity was strong. Studies that examined differences in HRQoL as assessed by the SGRQs across various levels of disease severity measured by various approaches, such as Composite Physiologic Index (CPI) and Gender-Age-Physiology (GAP), oxygen use, and PFTs, showed significant differences between groups. Thus, we rated the known-groups validity as “adequate” and ‘sufficient’ and the quality of evidence as 'High' (Appendix C, Table C.7 and Table 2).

  • h) Responsiveness

Construct and sub-group approaches were almost equally used. The change in patient condition was assessed using different definitions using the health transition question (SF2) in the SF-36, PFTs, 6MWD, and computerized tomography (CT) scans. Responsiveness was measured over various periods (6 months to 1 year). As we hypothesized, we found that the correlation was similar in direction but lower in magnitude. Thus, we rated the responsiveness as “adequate” and 'sufficient' and the quality of evidence as ‘High’ (Appendix C, Table C.8 and Table 2).

  • i) Interpretability

Floor and ceiling effects

For the SGRQ, three studies reported information on floor and ceiling effects. Furukawa et al. [26] reported that 9% of patients had scores =  < 10, Matsuda et al. [33] found minor ceiling effects were observed in the domains, and Nishiyama et al. [40]observed some ceiling effects. For the SGRQ-I, Akihiko et al. [43] found minor floor and ceiling effects; 17.3% of the patients scored worst, and 23.1% scored best. Two studies found no floor or ceiling effects for the SGRQ-I total or domain scores [39, 41].

Minimal important difference (MID)

Appendix C, Table C.9. summarizes MID for the SGRQ and SGRQ-I. A total of six studies provided MID estimates. For SGRQ, MID estimates for total score ranged from − 4.4 points to − 8.1 points for improvement and from 3 to 10.9 for deterioration. For SGRQ-I, MID total score values ranged from − 0.7 points to − 5.5 points for improvement and from 1.3 to 7.6 for deterioration. Three studies did not provide information on whether the reported MID was for improvement or deterioration [27, 35, 45]. The MID was assessed across various timeframes, spanning from 100 days to 12 months, and employing diverse methodologies, including both Anchor-based and distribution-based approaches. Different anchors were used, such as the SF2 question, PGI-C, PFTs, 6MWD, Transition Dyspnea Index (TDI), Medical Research Council (MRC), Dyspnea Scale, The UCSD-SOBQ, and Global Rating of Change Scales (GRCS). Distribution-based MID estimates were based on different multiples of Standard Deviation (0.5SD and 1SD), and standard errors of measurement (SEM).

4 Discussion

To our knowledge, this is the first systematic review following the COSMIN guidelines to assess and summarize the psychometric properties of the SGRQ and SGRQ-I. Despite being commonly used instruments to measure HRQoL in IPF, both PROMs' content and structural validity are still not well established among patients with IPF. In addition, the questionnaires did not demonstrate acceptable reliability (internal consistency and repeatability), with some findings indicating ceiling effects. However, it also emphasizes that reliability is a characteristic of an instrument used in a population, not just an instrument [48]. In fact, it is imperative to assess all psychometric properties within the target population to ensure that the instrument is fit for purpose in accordance with FDA guidelines [49]. Nevertheless, both PROMs demonstrated acceptable structural validity, responsiveness, and known-groups validity.

A major shortcoming was the lack of content validity for both PROMs in IPF. The SGRQ was developed initially for COPD and asthma patients [5]. It has been criticized for including items irrelevant to IPF, such as wheezing and episodes of chest discomfort [6]. Assessing the content validity of a PROM developed in a different population is critical to ensure that the instrument's content is relevant and meaningful to the new population. Different populations may have distinct experiences that can influence their understanding and interpretation of PRO items [48]. By systematically assessing the instrument's relevance and appropriateness in the new population, researchers can address potential biases or limitations associated with using an instrument developed in a different population [48]. The lack of studies assessing the content validity was also a limitation reported in previous systematic reviews of other PROMs [50,51,52]. Structural validity refers to the extent to which the measurement instrument scores adequately reflect the construct being measured [8]. Unfortunately, we did not find any study assessing the structural validity of the SGRQ among patients with IPF. For the SGRQ-I, structural validity was only assessed in one study [6]. In that study, we could not determine if it had sufficient structural validity due to missing important information.

Construct validity refers to the extent to which PROM scores correlate with other measures in accordance with theoretically derived hypotheses concerning the construct being assessed [8]. Pulmonary function tests, 6MWD, and PROM, such as the BDI, were the most reported outcomes against which SGRQ scores were compared. The SGRQ consistently showed a weak correlation with PFTs, which could be because they are measuring different constructs. While PFTs provide information on the mechanical function and capacity of the lungs, the SGRQ assesses the impact of respiratory symptoms, activity limitations, and social and psychological factors on a patient's overall quality of life. It is worth noting that forced expiratory volume in one-second percentage predicted (FEV1%) is typically used as a measure of lung function in obstructive lung diseases, such as COPD or asthma [53]. However, IPF is a restrictive lung disease [54]. Therefore, FEV1% may be considered less relevant or informative in the context of IPF. Also, different time frames could be another reason affecting the correlation between PFTs and the SGRQ as PFTs provide a snapshot of lung function at the time of testing, while the SGRQ evaluates the impact of respiratory symptoms and limitations over a more extended period. It is important to note that while PFTs and SGRQ scores may not correlate, both assessments serve different purposes in evaluating respiratory diseases. Conversely, the SGRQs are expected to have a moderate to high correlation with respiratory-specific questionnaires, such as the BDI. The BDI and SGRQ measure aspects related to dyspnea (shortness of breath) and its impact on a patient’s life. The BDI focuses on the severity of breathlessness, the magnitude of functional impairment, and the level of the patient’s activity, which overlap with SGRQ content. The developers of the COSMIN checklist recommend that authors formulate clear hypotheses regarding the strength and magnitude of the correlations before analyses [9]. Except for two studies [25, 42], none specified the expected magnitude of correlations. Also, studies did not demonstrate the reliability and validity of the comparator measurements among IPF patients. Furthermore, some studies did not report the psychometrics of the comparator instruments, while others referred to the psychometrics of the comparator in another population, which is questionable [48]. When the comparator instrument has not been validated among the same population as the new instrument being evaluated, it becomes challenging to determine if any observed differences or similarities between the instruments are due to actual differences in the measured constructs or if they are artifacts of the instruments [48].

In our analysis, Abdelaziz et al. [24] was considered an outlier study. The authors utilized the Arabic version of the SGRQ and found a strong correlation between SGRQ scores and PFTs. However, it is essential to acknowledge certain limitations of their study, such as the absence of a published cross-cultural validation study of the Arabic version of SGRQ, the small sample size, and a relatively low FVC% in their study population compared to the populations included in our meta-analysis. These factors may have influenced the observed correlation between SGRQ scores and FVC%. Future research should incorporate statistical techniques for cross-cultural validation, such as differential item functioning and considering variations in disease severity when assessing construct validity [55].

Internal consistency refers to the degree to which items in a measurement instrument are interrelated [8]. The internal consistency received an ‘indeterminate’ rating because there was insufficient evidence to establish satisfactory structural validity, per COSMIN guidelines [8]. Most researchers utilized Cronbach’s alpha to assess internal consistency, which is suggested by COSMIN. Also, the COSMIN guidelines recommend that the risk of bias be increased when studies do not report Cronbach’s alpha for its subscales, which was the case for two of the studies included [28, 44]. Consistent with a previous systematic review [7], we found that the symptom domain had low internal consistency, possibly due to including symptoms unrelated to IPF, rendering the interrelatedness among items weaker. Also, the symptom domain could be considered formative model-based, and measuring internal consistency might not be the proper approach. In a formative model, the observed indicators (cough and shortness of breath) are considered to be causal indicators of the latent construct (symptom domain) [48]. The indicators are seen as distinct elements that contribute to the overall construct [48]. As measured by Cronbach’s alpha, internal consistency may not be suitable for formative models because the indicators are not expected to be highly correlated [56, 57].

Test–retest reliability is crucial in PROMs because it allows researchers to determine whether the instrument consistently captures the same patient information over time [8]. High test–retest reliability indicates that the PROM is stable and produces consistent results [8]. Test–retest reliability was commonly assessed using ICC [58]. They estimate the proportion of variance in the measurements attributed to true differences between subjects relative to the total variance [59]. Different ICC formulas are available, and the choice of formula can have implications for the reliability assessment [58]. Only two studies specified the model used to calculate ICC [37, 42]. The two-way mixed effects model is recommended when using ICC formulas to assess absolute agreement in PROMs [58]. We emphasize the importance of documenting the chosen analytical formula or model to ensure transparency and allow other researchers to understand and replicate the reliability assessment. We also found that the ICC changed based on patient stability defined by other measurements. The stability of patients, as determined by these other measurements, impacted the ICC values. The authors did not explain why they selected the cut-offs used to define patients’ stability according to the anchors chosen. Also, using the same cutoff points yielded different results over different time intervals. We found that the ICC measured at shorter intervals was higher than when measured over an extended period. The intensity and frequency of IPF respiratory symptoms can fluctuate on a daily basis for individuals with IPF [60]. The variability of respiratory symptoms in IPF daily poses challenges for accurately recalling and appraising their severity [60]. Additionally, response shift, that is, a change in an individual's evaluation of their health status over time due to various factors, such as adaptation to illness and changes in priorities, can affect an instrument’s ability to detect change over time [61]. Recent research suggests utilizing tools such as patient diaries or electronic symptom-tracking systems to address these challenges [60, 62]. These tools allow patients to record their symptoms and severity daily, providing a more comprehensive and accurate assessment of symptom variability. By collecting real-time data, these methods can capture day-to-day fluctuations and help overcome recall bias associated with subjective symptom assessments [60, 62].

Ceiling and floor effects are observed when a substantial percentage of participants (≥ 15%) attain the highest or lowest possible score or rating on a specific item or the entire questionnaire [63, 64]. Minor ceiling effects for the SGRQ and minor floor and ceiling effects for the SGRQ-I were observed. One possible reason is that the instruments may not have enough response options or variability to accurately capture the full range of patients’ experiences at each end of the scale [64]. If the response options are limited or the scale does not adequately capture the severity of symptoms or limitations, it can result in a cluster of scores at the upper or lower end of the scale, leading to ceiling or floor effects. Additionally, the design and wording of the questionnaire items themselves can influence the occurrence of ceiling or floor effects. If the items are not sufficiently comprehensive or relevant to the experiences of the target population, it may limit the ability of the instruments to detect subtle differences and result in a restricted range of scores [63]. Overall, addressing these issues through refinement of the instruments and incorporating more comprehensive and relevant items can help minimize ceiling and floor effects and improve the sensitivity of the measures in capturing the full spectrum of patients' quality of life.

The MID refers to the smallest change in score in the construct that patients perceive as important [65]. The objective of establishing the MID is to create a standardized metric for assessing treatment efficacy and disease progression. We observed different MCID values for the SGRQ or SGRQ-I. Multiple factors might have influenced this variability, such as inconsistency in the assignment of cut-offs for different anchors, multiple SDs, and follow-up periods [66]. Even when applying the same methodology, MCID values differ due to their context specificity, influenced by patient baseline characteristics [67]. Therefore, factors affecting MCID values render them non-transferrable across different patient groups, which ultimately impedes the clinical relevance of findings. Our review underscores the necessity for careful consideration when reporting and interpreting the MCID. Failure to acknowledge these limitations poses the risk of erroneously categorizing patients as non-responders despite improvement, or conversely, as responders when there hasn’t been a significant change [68].

During our analysis using the COSMIN reporting guideline, a notable issue emerged concerning the dearth of reporting the amount and/or methods used to handle the PROMs' missing data. Investigating the factors contributing to missing data PROMs is essential as it greatly influences our understanding of the PROM's characteristics and subsequent analyses. Factors such as response burden, item irrelevance, comprehension difficulties, and questionnaire design problems can all contribute to missing data in PROMs [69]. Additionally, accurately reporting the extent of missing data is equally important, as anticipated missing data can impact the selection of PRO instruments and the use of PRO as an endpoint in clinical trials [70, 71]. Additionally, our analysis revealed a significant issue with the lack of detailed reporting on the version and mode of administration of PROMs, which hampers a thorough evaluation of their psychometric properties. The mode of administration, such as interviewer-administered or self-administered, can introduce variations in response patterns and influence participant responses [72]. Additionally, the SGRQs have different versions in multiple languages with varying recall periods, potentially impacting their psychometric properties. To ensure accurate understanding and interpretation of results, addressing this reporting gap and emphasizing the importance of comprehensive information on the version and mode of administration in research studies utilizing the SGRQs is crucial.

Future research on the psychometric properties of the SGRQ and SGRQ-I should address the previously mentioned gaps and limitations. Firstly, there is a need for further research on the content validity of both PROMs among patients with IPF. The SGRQ was originally developed for COPD and asthma patients, and its relevance to IPF needs to be established through independent studies assessing the content validity in this population. Secondly, more studies are required to assess the structural validity of the SGRQ and SGRQ-I in patients with IPF. Additionally, future research should aim to improve the reliability of the questionnaires, particularly in terms of internal consistency and test–retest reliability. Studies should ensure that the internal consistency analysis is performed separately for each subscale, and alternative measures may be considered for assessing the quality of the symptom domain in formative model. Furthermore, there is a need to explore the construct validity of the SGRQ and SGRQ-I using appropriate comparator instruments validated in the IPF population. Clear hypotheses regarding the expected strength and direction of correlations should be formulated before conducting analyses to guide the interpretation of findings. This would help establish the relationships between the PROMs and other measures in a manner consistent with the theoretical concepts being measured. Lastly, future studies should also investigate the floor and ceiling effects of the SGRQ and SGRQ-I in IPF. Understanding the presence and extent of floor and ceiling effects is important for evaluating the instruments’ ability to capture this population’s full range of HRQoL. Addressing these gaps would enhance the psychometric properties of the SGRQ and SGRQ-I for use in IPF research and clinical practice, facilitating more accurate and meaningful assessments of HRQoL in this patient population. Thus, given the former, we cannot recommend either the SGRQ or SGRQ-I over the other in IPF patients at this time. Both instruments have demonstrated strengths and weaknesses in various domains of measurement validity and reliability. Further research and clinical validation are necessary to determine the most suitable instrument for assessing health-related quality of life in IPF patients.

Our systematic review has multiple strengths. First, the review's comprehensiveness is evident through the exhaustive search across a wide range of databases. Second, the protocol for the review was pre-registered. Third, two reviewers independently conducted screening, data extraction, and quality assessments for 100% of the studies. Fourthly, the review rigorously assessed the quality of studies and the psychometric properties of included instruments following COSMIN 2018 criteria, thereby enhancing the credibility of the assessment. Finally, we conducted meta-analyses on construct validity for the total score and subdomains against various measures, providing a more comprehensive summary of the evidence on the measurement properties. However, there are several potential limitations to consider. Firstly, despite our comprehensive search, it's possible that some eligible studies may have been missing. Furthermore, considering that the conclusions drawn from the available data were derived from aggregate data, there exists the possibility of ecological fallacy, known as Simpson's paradox. Simpson’s paradox is “A type of ecological fallacy in probability and statistics where a trend appears in several groups of data but vanishes or reverses when the groups are combined” [73]. Also, we observed that some measures received an inadequate score for their psychometric properties despite being close to the COSMIN-established cutoff for a "sufficient" rating (e.g., coefficient alpha was 0.69 rather than 0.7 or above). As such, we believe that the current COSMIN criteria may underestimate the quality of an instrument's psychometric properties. The same applies to the assessment of the methodological quality of studies, which employs the “worst case counts” rule and, therefore, downgrades the quality of a study even if there’s only a single concern about study quality. Also, we acknowledge that the diverse versions of the SGRQs, including reported and unreported variations in recall periods, modes of administration, and language, could have potentially influenced the heterogeneity observed in our meta-analysis and the certainty of our conclusions. Furthermore, there was a slight deviation from the COSMIN criteria in that we assessed all the psychometric properties despite the absence of at least one study on content validity. Finally, while some may consider that this review may have needed to be updated, given that the last search was conducted in June 2022, it is important to understand that there is no firm consensus on when a review should be updated. In addition, the time and effort devoted to conducting the current review was substantial. Furthermore, it is highly unlikely that any update would change the overall direction of findings.

5 Conclusions

In summary, it is crucial to acknowledge the limitations of employing the SGRQs in IPF due to the absence of given the lack of evidence and research supporting their content validity and structural validity in IPF, considering that they were not originally designed for this patient population. However, the SGRQs demonstrated acceptable construct validity and responsiveness, supporting their usefulness as PROMs in IPF. According to COSMIN guidelines, our finding indicates that SGRQs show potential for recommendation as a suitable PROM for IPF but require further validation studies before a conclusive recommendation can be made.