Introduction

With an estimated 425 million individuals affected worldwide, clinically important obstructive sleep apnoea (OSA) poses a global public health problem [1]. Characterised by upper airway collapse, exaggerated negative intrathoracic pressure, oxidative stress, and systemic inflammation, OSA is associated with significant cardiovascular and metabolic complications, including hypertension, stroke, heart failure, and diabetes [2,3,4,5,6,7].

Despite the high prevalence and associated sequelae, most individuals with OSA remain undiagnosed, posing a significant risk to the individual patient and health care systems as complications develop [1, 8,9,10]. Barriers to the diagnosis and treatment of OSA are multifaceted and include geographical variation and inequity in the availability of sleep services and access to polysomnography (PSG), often limited by cost and long waiting times [11].

To support risk stratification and appropriate referrals in individuals at-risk, a simple and reliable screening tool may help triage patients at risk of OSA, for consideration of referral to specialist services for appropriate management [12,13,14]. Clinical prediction formulae have been developed but are limited by complexity and the requirement for a computer or mathematical calculations [15]. In contrast, OSA screening questionnaires are less complicated and may be a viable alternative to clinical prediction formulae in specific settings.

To date, there have been four systematic reviews exploring the accuracy of OSA screening tools in adults [12, 16,17,18]. One of the first systematic reviews and meta-analyses to explore the accuracy of screening tools for OSA identified four screening questionnaires; however, due to heterogeneity pertaining to the questionnaire, OSA definition, and threshold, these were not meta-analysed [16]. Ramachandran [17] reported that clinical prediction models performed better than the eight questionnaires studied to predict OSA in pre-operative cohorts. Abrishami [12] focused on a ‘sleep disorder’ cohort and a cohort ‘without a history of sleep disorders’. It was concluded that questionnaires were useful for early detection of OSA, especially in the surgical population. Despite finding it difficult to draw a definite conclusion about questionnaire accuracy, the STOP and STOP-Bang questionnaires were recommended for screening in a surgical population [12]. Recently, Chui [18] compared the diagnostic accuracy of the Berlin, STOP-Bang, STOP, and Epworth Sleepiness Scale. In line with Abrishami [12], they reported the STOP-Bang to have the highest sensitivity in both the sleep clinic and surgical populations.

Since the publication of these systematic reviews, new OSA screening questionnaires have emerged, further validation studies conducted, and different clinical settings and patient cohorts considered. As test performance often varies across clinical cohorts, it is recommended that tools are evaluated in clinically relevant cohorts [19]. Hence, the objective of this systematic review and meta-analysis was to evaluate the accuracy and clinical utility of existing questionnaires, when used alone, as screening tools for the identification of OSA in adults in different clinical cohorts.

Methods

The protocol was registered at the International Prospective Register of Systematic Reviews (PROSPERO) (CRD42018104018) and conducted according to the Preferred Reporting Items for Systematic Reviews and Meta-analysis (PRISMA) guidelines [20].

Types of studies

We included observational studies that met the following eligibility criteria:

Inclusion criteria: (1) prospective studies measuring the diagnostic value of screening questionnaires for OSA; (2) studies in adults (> 18 years of age); (3) studies in which the accuracy of the questionnaire was validated by level one or two PSG; (4) OSA was defined as apnoea-hypopnoea index (AHI) or Respiratory Disturbance Index (RDI) > 5; (5) data allowed for construction of 2 × 2 contingency tables; (6) publication in English, Spanish, or Portuguese.

Exclusion criteria: (1) studies measuring the diagnostic value of clinical scales, scores, and prediction equations as screening tools for OSA; (2) conference proceedings, reviews, or case reports; (3) insufficient data for analysis after several attempts to contact the author; (4) studies in children (< 18 years of age); (5) level three and four portable studies were used as the reference standard; (6) studies conducted in in-patient settings; (7) publication language is other than English, Spanish, or Portuguese.

Index test: the test under evaluation was only OSA screening questionnaires (self-reported or clinician completed).

Reference standard: the reference standard was a level one or two PSG.

Target conditions: the target condition was OSA, defined as AHI or RDI.

  • AHI/RDI ≥ 5—diagnostic cut-off for OSA

  • AHI/RDI ≥ 15—diagnostic cut-off for moderate to severe OSA

  • AHI/RDI ≥ 30—diagnostic cut-off for severe OSA

Search methods for identification of studies

Comprehensive literature searches in CINAHL PLUS, Scopus, PubMed, Web of Science, and the Latin American and Caribbean Health Sciences Literature (LILACS) database were conducted from inception to 18 December 2020. Detailed individual search strategies (Online Resource 1 & 2), with appropriate truncation and word combinations, were developed for each database. Additional records were identified from grey literature sources comprising ETHos, OpenGrey, Google Scholar, ProQuest, and New York Grey Literature Report. The reference lists from the final articles for analysis and related review articles were manually searched for references that could have been omitted during the electronic database searches.

Data collection and analysis

Study selection

Two reviewers (LB, EB) screened the titles and abstracts of the electronic search results independently to identify studies eligible for inclusion in the review. Records classified as ‘excluded’ by both reviewers were excluded. The full text of any study about which there was disagreement or uncertainty was assessed independently against the selection criteria and resolved through discussion and consultation with a third reviewer (IS or NR). Duplicates were identified and excluded before recording the selection process in sufficient detail to complete the PRISMA flow diagram and tables describing the characteristics of the excluded studies (Online Resource 3) [20].

Data extraction and management

Two reviewers (LB, EB) independently conducted data extraction on all studies included and extracted the data required to reconstruct the 2 × 2 contingency tables, including true positive (TP), false positive (FP), true negative (TN), and false negative (FN) values. Where these values were not documented, we extrapolated the values from equations when data allowed. A data collection form tailored to the research question and fulfilling the data entry requirements of MetaDTA (Diagnostic Test Accuracy Meta-Analysis v1.43) was utilised [21].

HP and JR extracted the study characteristics and demographic data for all included studies, and LB and EB entered the data into Review Manager 5.3 [22].

No studies with inconclusive results were identified.

Assessment of methodological quality

The quality of studies included was appraised independently by the reviewers (LB, EB) utilising the Quality Assessment for Diagnostic Accuracy Studies tool (QUADAS-2) with disagreements resolved through consultation with a third reviewer (IS or NR) [23].

Statistical analysis and data synthesis

Statistical analysis was performed according to “Chapter 10” of the Cochrane Handbook for Systematic Review of Diagnostic Test Accuracy [24].

Questionnaire screening was considered positive for OSA if the questionnaire score was above the defined threshold specified in the primary study and negative if the questionnaire score was below the defined threshold. The TP, FP, TN, and FN results were produced by cross-classifying the questionnaire results with those of the PSG results. These were based on the ability of screening questionnaires to classify and detect OSA correctly.

The sensitivity and specificity of individual studies were calculated using 2 × 2 contingency tables and presented as forest plots. The meta-analysis was conducted using MetaDTA version 1.43, which models sensitivity and specificity by fitting the random effects bivariate binomial model of Chu and Cole [25, 26]. The summary receiver operating characteristic (SROC) plot was drawn using the hierarchical SROC parameters, which are estimated from the bivariate model parameters using the equivalence equations of Harbord [27]. Following guidance from the Cochrane Handbook for Systematic Review of Diagnostic Test Accuracy, we did not pool the positive and negative predictive values due to the prevalence of OSA varying across studies [24].

As per the Cochrane DTA handbook, we investigated heterogeneity by plotting the observed study results and SROC curve in the ROC space alongside the 95% confidence region [24].

We conducted a meta-regression to investigate differences in sensitivity and specificity between questionnaires, including the type of questionnaire as a covariate. Meta-regression was conducted in R version 4.0.1 using the lme4 package [28].

To assess the robustness of the meta-analysis, sensitivity analyses were conducted by excluding studies based on their QUADAS-2 assessment score [23]. Those identified as high risk in any QUADAS-2 domain or as unclear in four domains were excluded. Different AASM (American Academy of Sleep Medicine) scoring criteria and desaturation (and arousal) thresholds were applied to the included studies. We conducted additional sensitivity analyses by analysing studies that applied the ≥ 3% desaturation scoring criteria together and those that applied the ≥ 4% desaturation scoring criteria (summarised in Table 1).

Table 1 Study characteristics

We neither explored reporting bias, nor assessed publication bias due to the uncertainty about the determinants of publication bias for diagnostic accuracy studies, and the inadequacy of tests for detecting funnel plot asymmetry [74].

Results

Search results and study characteristics

Search results are summarised in Fig. 1.

Fig. 1
figure 1

Flowchart of search results

Of 45 studies, 29 were included for meta-analysis in the sleep clinic population (n = 10,951), 7 were included for meta-analysis in the surgical population (n = 2275), and 2 were included in the resistant hypertension population (n = 541). The remaining 7 studies were excluded from the meta-analysis due to heterogeneity of included populations. Study characteristics and demographic data of the included studies are summarised in Tables 1 and 2 [29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73]. Overall, 10 clinical settings were identified, of which the sleep clinic, surgical, and resistant hypertension cohorts had sufficient studies for inclusion in the meta-analysis.

Table 2 Demographic data

OSA obstructive sleep apnoea, AHI apnoea-hypopnoea index, RDI respiratory disturbance index, Lab laboratory, PSG polysomnography, AASM American Academy of Sleep Medicine.

SD standard deviation, kg kilogramme, m metre, cm centimetre, NC neck circumference, WC waist circumference, AHI apnoea-hypopnoea index, n/a not applicable.

Methodological quality of included studies

Results of the QUADAS-2 assessment are summarised in Fig. 2 and Online Resource 4 [29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73].

Fig. 2
figure 2

Risk of bias summary using the QUADAS-2 tool

In the patient selection domain, 3 studies were rated as high risk of bias due to the case–control study design. For both the index test and reference standard domains, 18 studies were rated as unclear risk of bias due to inadequate information related to blinding; it was unclear if the index test and reference standard findings were interpreted without the knowledge of the other. Thirty-four studies were rated as unclear risk of bias in the flow and timing domain due to lack of reporting on the time interval between the index test and the reference standard. Applicability was rated as low risk in all 45 studies.

Sleep clinic population

In the sleep clinic population (N = 10,951) (Fig. 3), the Berlin (score cut-off ≥ 2) (Online Resource 5), STOP (score cut-off ≥ 2), and STOP-Bang (score cut-off ≥ 3) (Online Resource 6) questionnaires were included in the meta-analysis [58, 75]. The ASA checklist, SA-SDQ, and STOP-Bang (cut-off ≥ 5) questionnaires were excluded due to insufficient studies.

Fig. 3
figure 3

Questionnaire studies in sleep clinic population

Predictive parameters of the Berlin questionnaire (score cut-off ≥ 2)

The prevalence of AHI ≥ 5 (all OSA), AHI ≥ 15 (moderate to severe), and AHI ≥ 30 (severe) OSA was 84%, 64%, and 50% respectively. The pooled sensitivity of the Berlin questionnaire to predict all OSA, moderate–severe, and severe OSA was 85% (95% confidence interval (CI): 79%, 89%), 84% (95% CI: 79%, 89%), and 89% (95% CI: 80%, 94%) respectively. Pooled sensitivity remained consistent across OSA severity. Pooled specificity was 43% (95% CI: 30%, 58%), 30% (95% CI: 20%, 41%), and 33% (95% CI: 21%, 46%) respectively. The corresponding diagnostic odds ratio (DOR) were 4.3 (95% CI: 0.7, 7.8), 2.3 (95% CI: 1.3, 3.3), and 3.9 (95% CI: 2.1, 5.7) (Fig. 4, Table 3).

Fig. 4
figure 4

Forest plots for Berlin questionnaire in sleep clinic population (generated using the software Review Manager 5.3, The Cochrane Collaboration)

Table 3 Summary statistics for Berlin, STOP, and STOP-Bang questionnaires in the sleep clinic population

Predictive parameters of the STOP questionnaire (score cut-off ≥ 2)

The prevalence of AHI ≥ 5 (all OSA), AHI ≥ 15 (moderate to severe), and AHI ≥ 30 (severe) OSA was 67%, 58%, and 46% respectively. The pooled sensitivity of the STOP questionnaire to predict all OSA, moderate–severe, and severe OSA was 90% (95% CI: 82%, 95%), 90% (95% CI: 75%, 97%), and 95% (95% CI: 88%, 98%) respectively. The pooled specificity was 31% (95% CI: 15%, 53%), 29% (95% CI: 10%, 61%), and 21% (95% CI: 10%, 39%) respectively. The corresponding DOR were 4.2 (95% CI: 0.8, 7.6), 3.8 (95% CI: 1.7, 5.9), and 4.7 (95% CI: 2.6, 6.8) respectively (Fig. 5, Table 3). Greater uncertainty and variability in specificity were noted in the CI width and scatter of individual study estimates.

Fig. 5
figure 5

Forest plots for STOP questionnaire in sleep clinic population (generated using the software Review Manager 5.3, The Cochrane Collaboration)

Predictive parameters of the STOP-Bang questionnaire (score cut-off ≥ 3)

The prevalence of AHI ≥ 5 (all OSA), AHI ≥ 15 (moderate to severe), and AHI ≥ 30 (severe) OSA was 80%, 59%, and 39%, respectively. The pooled sensitivity of the STOP-Bang questionnaire to predict all OSA, moderate–severe, and severe OSA was 92% (95% CI: 87%, 95%), 95% (95% CI: 92%, 96%), and 96% (95% CI: 93%, 98%) respectively. The pooled specificity was 35% (95% CI: 25%, 46%), 27% (95% CI: 18%, 34%), and 28% (95% CI: 20%, 38%) respectively. The corresponding DOR were 6.0 (95% CI: 4.4, 7.6), 6.4 (95% CI: 3.3, 9.5), and 9.2 (95% CI: 5.9, 12.4) respectively (Fig. 6, Table 3). Greater uncertainty and variability in specificity were noted in the CI width and scatter of individual trial estimates, particularly for AHI ≥ 5.

Fig. 6
figure 6

Forest plots for STOP-Bang questionnaire in sleep clinic population (generated using the software Review Manager 5.3, The Cochrane Collaboration)

SROC plots were used to display the results of individual questionnaires in the ROC space, plotting each questionnaire as a single sensitivity–specificity point [24]. When we plotted the SROC for all three questionnaires on the same axes, the confidence regions of the Berlin, STOP, and STOP-Bang questionnaires, for all OSA (AHI ≥ 5) (Fig. 7) and severe OSA (AHI ≥ 30) (Fig. 9), overlapped, suggesting that there was no statistically significant difference in sensitivity among the 3 questionnaires.

Fig. 7
figure 7

Summary ROC for Berlin, STOP, and STOP-Bang questionnaires AHI ≥ 5 (generated using the software Review Manager 5.3, The Cochrane Collaboration)

Figure 8 shows no overlap of the confidence regions for the Berlin and STOP-Bang questionnaires, suggesting a possible difference in sensitivity between the two questionnaires. A meta-regression model assuming equal variances for logit sensitivity and logit specificity suggested that the expected sensitivity or specificity differed between the two tests (chi-square = 14.1, 2df, p = 0.0008) (Fig. 9).

Fig. 8
figure 8

Summary ROC for Berlin, STOP, and STOP-Bang questionnaires AHI ≥ 15 (generated using the software Review Manager 5.3, The Cochrane Collaboration)

Fig. 9
figure 9

Summary ROC for Berlin, STOP, and STOP-Bang questionnaires AHI ≥ 30 (generated using the software Review Manager 5.3, The Cochrane Collaboration)

Surgical population

In the surgical population (n = 2710) (Fig. 10), we identified the Berlin, STOP, and STOP-Bang questionnaires for inclusion in the meta-analysis. The ASA checklist and OSA50 questionnaires were excluded from meta-analysis due to an insufficient number of studies. Nunes included two surgical cohorts, abdominal and coronary artery bypass grafting, which were entered as separate cohorts [63].

Fig. 10
figure 10

Questionnaire studies in surgical population

Predictive parameters of the Berlin questionnaire (score cut-off ≥ 2)

Two studies were included in the meta-analysis of the Berlin Questionnaire for moderate to severe OSA (AHI ≥ 15) (Fig. 11). Due to insufficient data, we were unable to conduct a meta-analysis for all (AHI > 5) and severe OSA (AHI > 30).

Fig. 11
figure 11

Forest plot for Berlin questionnaire in surgical population for AHI ≥ 15 (generated using the software Review Manager 5.3, The Cochrane Collaboration)

The prevalence of moderate to severe OSA or AHI of ≥ 15 was 42%. The pooled sensitivity of the Berlin questionnaire to predict moderate to severe OSA (AHI ≥ 15) was 76% (95% CI: 66%, 84%), and the pooled specificity was 47% (95% CI: 32%, 62%). The DOR was 2.9 (95% CI: 0.2, 5.5) (Table 4).

Table 4 Pooled predictive parameters of Berlin and STOP-Bang questionnaires in surgical population for AHI ≥ 15

Predictive parameters of the STOP questionnaire (score cut-off ≥ 2)

Two studies were eligible for inclusion in the STOP questionnaire meta-analysis for moderate to severe OSA (AHI ≥ 15). However, due to insufficient studies and large heterogeneity around the specificity, the STOP questionnaire was excluded from the meta-analysis (Fig. 12).

Fig. 12
figure 12

Forest plot for STOP questionnaire in surgical population for AHI ≥ 15 (generated using the software Review Manager 5.3, The Cochrane Collaboration)

Predictive parameters of the STOP-Bang questionnaire (score cut-off ≥ 3)

We included 6 studies in the meta-analysis of the STOP-Bang questionnaire for moderate to severe OSA (AHI ≥ 15) (Fig. 13).

Fig. 13
figure 13

Forest plots for STOP-Bang questionnaire in surgical population (generated using the software Review Manager 5.3, The Cochrane Collaboration)

The prevalence of AHI ≥ 5 (all OSA), AHI ≥ 15 (moderate to severe), and AHI ≥ 30 (severe) OSA was 72%, 33%, and 21%, respectively. The pooled sensitivity of the STOP-Bang questionnaire to predict all OSA, moderate–severe, and severe OSA was 85% (95% CI: 81%, 88%), 90% (95% CI: 87%, 93%), and 96% (95% CI: 92%, 98%) respectively. The pooled specificity was 40% (95% CI: 30%, 50%), 27% (95% CI: 19%, 37%), and 26% (95% CI: 21%, 46%). The corresponding DOR were 3.6 (95% CI: 2.3, 4.8), 3.4 (95% CI: 1.9, 4.9), and 8.4 (95% CI: 2.7, 14.2), respectively (Table 4). Compared to the Berlin and STOP questionnaires, individual trial estimates of sensitivity appeared to be more homogeneous for the STOP-Bang questionnaire (Figs. 11, 12, and 13).

Predictive performance of STOP-Bang questionnaires at various questionnaire scores

In the surgical population, two of six studies reported data at multiple cut-off points for the STOP-Bang questionnaire for moderate-to-severe OSA (AHI ≥ 15) [62, 63]. Increasing the threshold from 4 to 7 increased specificity from 31% (95% CI: 0.2, 0.4) to 96% (95% CI: 0.89, 0.99) and was greatest at cut-off values ≥ 6 and ≥ 7 (Table 5). However, increase in specificity was at the expense of a reduction in sensitivity.

Table 5 STOP-Bang questionnaire at different questionnaire cut-offs for moderate–severe OSA (AHI ≥ 15) in the surgical population

Resistant hypertension population

We included 2 studies (n = 517) in the meta-analysis of the Berlin questionnaire (cut-off ≥ 2) for all OSA (AHI of ≥ 5) [65, 66]. Due to insufficient study data, we were unable to conduct a meta-analysis for moderate–severe (AHI > 15) and severe OSA (AHI > 30).

The prevalence of all OSA or an AHI of ≥ 5 was 80%. The Berlin questionnaire’s pooled sensitivity to predict all OSA or AHI of ≥ 5 was 80% (95% CI: 60%, 92%), and the pooled specificity was 36% (95% CI: 21%, 55%). The DOR was 2.2 (95% CI: 0.7, 3.8).

Other cohorts

Asthma, community clinic, highway bus drivers, neurology clinic, primary care, respiratory and snoring clinic cohorts were identified but were excluded from the meta-analysis due to having only one study per cohort (Online Resource 7) [67,68,69,70,71,72,73].

Sensitivity analyses

Risk of bias

No studies were evaluated as high risk in the surgical and resistant hypertension populations; therefore, no sensitivity analyses were conducted.

In the sleep clinic population, sensitivity analyses were conducted for the Berlin (Online Resource 8), STOP-Bang (Online Resource 9), and the STOP questionnaires for AHI > 5, AHI ≥ 15, and AHI ≥ 30 (Online Resource 10) excluding studies identified as high risk in any QUADAS-2 domain, unclear in four domains or outliers.

We excluded one study for the STOP questionnaire for AHI > 5 [49], AHI ≥ 15 [44], and AHI ≥ 30 [44]. For the STOP-Bang questionnaire, we excluded five studies for AHI > 5 [29,30,31, 34, 46] and four studies for AHI ≥ 15 [30, 35, 38, 44] and AHI ≥ 30 [30, 35, 38, 44]. For the Berlin questionnaire AHI > 5 [45, 46, 53, 55] and AHI ≥ 15 [45, 46, 53, 55], we excluded four studies, and for an AHI ≥ 30 [44, 45, 55], we excluded three studies.

Across all three questionnaires, exclusion of studies was associated with stable or slightly increased sensitivity. In contrast, sensitivity analysis was associated with reduced specificity (Online Resources 810). The STOP-Bang questionnaire remained the most effective questionnaire with the highest sensitivity compared to the Berlin and STOP questionnaires. Specificity among all three questionnaires remained low.

Desaturation and arousal criteria

Due to an insufficient number of studies, no sensitivity analysis was conducted in the resistant hypertension population.

In the surgical population, the Berlin and STOP questionnaire studies utilised the ≥ 3% desaturation scoring criteria; therefore, no sensitivity analyses were conducted. For the STOP-Bang questionnaire, studies applied either ≥ 3% or ≥ 4% desaturation criteria. When we applied the ≥ 3% desaturation criteria to the STOP-Bang questionnaire, we excluded one study for AHI > 5 [60], two studies for AHI ≥ 15 [60, 64], and one study for AHI ≥ 30 [60]. In turn, when we applied the ≥ 4% desaturation criteria, we excluded four studies for AHI ≥ 15 [59, 61,62,63]. Across the three AHI thresholds, sensitivity remained stable, compared to a stable or slightly decreased sensitivity with application of the ≥ 3% desaturation criteria. For AHI ≥ 15, application of the ≥ 4% desaturation criterion was associated with a slight reduction in sensitivity and an increase in specificity (Online Resource 11).

We conducted a sensitivity analysis in the sleep clinic population for the Berlin, STOP, and STOP-Bang questionnaires, applying both the ≥ 3% and ≥ 4% desaturation criteria respectively. Studies were excluded on the basis of high risk of bias, scoring criteria not specified, and desaturation criteria (≥ 3% or ≥ 4%) (Online Resource 12).

Across all three questionnaires in the sleep clinic population, exclusion of studies was associated with stable sensitivity and reduced specificity, particularly when applying the ≥ 4% desaturation criterion (Online Resources 13, 14, 15). Overall, the STOP-Bang questionnaire remained the most effective questionnaire with the highest sensitivity compared to the Berlin and STOP questionnaires. Specificity among all three questionnaires remained low.

Discussion

This systematic review and meta-analysis investigated questionnaires’ accuracy and clinical utility as screening tools for OSA in adults in different clinical cohorts.

Consistent with previous studies, our findings showed that the STOP-Bang questionnaire (score cut-off ≥ 3) suggested the highest sensitivity to detect OSA and the highest diagnostic odds ratio in both the sleep clinic and surgical populations [12, 18, 76]. However, the STOP-Bang questionnaire was limited by consistently low specificity across all AHI thresholds, resulting in high false positive rates. The Berlin questionnaire (score cut-off ≥ 2) appeared to be the least useful, demonstrating overall low sensitivity and low specificity across all three cohorts [12, 18, 77]. Although there was no comparison with other questionnaires in the resistant hypertension cohort, findings were comparable with the sleep clinic and surgical cohorts.

OSA screening questionnaires are intended to provide the information required to identify patients most likely to benefit from downstream management decisions, such as onward referral for objective sleep testing and possible treatment following a positive full diagnostic test. The potential utility of OSA screening questionnaires in risk stratification of patients has been demonstrated in several cohorts. Not only has OSA been associated with risk of peri-operative complications and consequent longer length of hospital stay, but it has also been linked to poor clinical outcomes including higher rates of post CABG atrial fibrillation [78,79,80]. In the context of the ongoing coronavirus disease 2019 (Covid-19) pandemic, a recent study reported worse clinical outcomes in patients with Covid-19 classified by the Berlin questionnaire as high risk, compared to those at low risk, of OSA [81]. The study also highlighted the challenges with objective assessment of OSA with PSG during the Covid-19 pandemic, emphasising the need for alternative approaches beyond PSG, such as validated screening questionnaires. In this context, we would encourage the assessment and validation of OSA screening questionnaires, in particular STOP-Bang, as screening tools for risk stratification appropriate clinical settings, with the aim of improving outcomes for patients.

Although sensitivity and specificity provide us with the necessary information to discern between the available screening questionnaires, the clinical value and application of the screening questionnaires are demonstrated by means of the positive and negative predictive values which are dependent on the prevalence of the disease in the given clinical population. Although we were unable to pool the predictive values of individual questionnaires due to variation in prevalence across studies, the point estimates of PPV and negative predictive value (NPV) for the STOP-Bang questionnaire in both the sleep clinic and surgical population (Online Resource 16) demonstrated an increase in NPV as OSA severity increases. The combination of high sensitivity and NPV of the STOP-Bang questionnaire is therefore useful to help clinicians exclude patients with low risk of clinically significant OSA.

At the same time, the low specificity of the STOP-Bang questionnaire (and therefore its relative inability to correctly identify patients without OSA) leads to a high rate of false positive findings; this may have emotional and cognitive implications for individual patients with added consequences for clinical services, not least cost [80, 82].

This systematic review’s main strength lies in our comprehensive literature search with stringent eligibility criteria to identify all relevant studies reporting on the accuracy and clinical utility of existing OSA screening questionnaires that were validated against the gold standard PSG. Our inclusion of the LILACS database expanded our search to include Latin America and the Caribbean studies. Of previous reports, the review by Ramachandran [17] was limited to a search of two databases, English publications only, and omitted any grey literature sources in their search strategy. Additionally, it was unclear if Ross [16] and Abrishami [12] included any grey literature sources in their searches.

Two independent reviewers completed data extraction, and we used the QUADAS-2 tool to assess rigorously all included studies for risk of bias. To evaluate the robustness of the meta-analysis, we conducted sensitivity analyses to investigate the potential influence on our findings from studies at high, or unclear, risk of bias. Although our study did not explore source differences from an ethnicity or geographical perspective, we conducted a further sensitivity analysis to evaluate the impact of varying scoring criteria on our study findings. The utilisation of different AASM scoring criteria and desaturation (and arousal) thresholds across studies created a source of variability [83,84,85]. Although the definition for apnoeas remained stable, there has been much controversy about the definition of hypopnoeas, specific to flow reduction, oxygen desaturation, and the presence or absence of arousal [86]. Varying definitions of hypopnoea not only impacts on prevalence estimates but is likely to underestimate OSA in patients who may benefit from treatment [86]. A study by Guilleminault et al. (2009) showed that by using the 30% flow reduction and 4% desaturation without arousal criteria would have missed 40% of patients who were identified using the criteria with arousal and who were responsive to CPAP therapy with reduction in AHI and symptomatic improvement [87].

On this background, our review is based on a larger number of studies than prior analyses [12, 16, 17]. Although the review by Chiu [18] encompassed a larger dataset, that report carried a greater risk of bias due to the inclusion of retrospective studies and studies that used PSG and portable monitoring as the reference standard.

This review considered all existing OSA screening questionnaires for inclusion. In contrast, Chui [18] pre-selected four questionnaires, including the ESS, which was not developed as a screening questionnaire, but as a measure of daytime sleepiness.

Similar to Abrishami [12] and Chui [18], our review focused on questionnaires only, in contrast to Ross [16] and Ramachandran [17], who also included portable monitoring and clinical prediction tools, respectively.

There are a number of limitations to this work. Our findings are influenced by the limitations of the included studies. In several, the true risk of bias was unclear in several of the QUADAS-2 domains due to underreporting in the index test, reference standard, and flow and timing domains. Similarly, it was often unclear if the results of the index test and the reference standard were interpreted independently. Very few studies provided adequate information to determine if the time interval between the index test and the reference standard was appropriate.

Our decision to exclude seven additional clinical cohorts may be considered a limitation; however, in the context of unclear, and possibly substantial, differences among these studies in the patient spectrum and disease prevalence, we felt it appropriate not to include these in the meta-analysis. Because the accuracy of screening tools varies according to the spectrum of disease, this further reiterates the need for validation studies in similar clinical cohorts.

There was a high degree of heterogeneity among included studies with the possibility of selection bias, especially in the sleep clinic population. Consequently, reported sensitivity estimates will be higher than lower-risk populations, making it difficult to extrapolate the true utility of the questionnaire in clinical practice.

In conclusion, our review investigated the accuracy and clinical utility of existing OSA screening questionnaires in different clinical cohorts. While the STOP-Bang questionnaire had a high sensitivity to detect OSA in both the sleep clinic and surgical cohorts, it lacked adequate specificity. This review highlights the issue of low specificity across OSA screening questionnaires. Research is required to explore reasons for low specificity and strategies for improvement, ideally without reducing sensitivity. The validation of screening questionnaires in sleep clinic populations is limited by possible selection and spectrum bias, reiterating the need for diagnostic validation studies in clinically similar cohorts. Additionally, further research is needed in resistant hypertension and other at-risk populations that we could not include in the meta-analysis. Improvement in the conduct and reporting of diagnostic validation studies must ensure quality and low risk of bias.

Finally, to enable the extrapolation of the true accuracy and clinical utility of screening questionnaires, validation studies of high methodological quality in comparable, clinically relevant cohorts are required.