Background

Blood oxygen saturation levels require monitoring for health reasons in a wide range of circumstances. Low blood oxygen saturation, if identified to be hypoxemia, requires medical intervention and has been linked to an increased risk of death [1]. The gold standard measure of blood oxygen saturation levels (SaO2) requires a sample of arterial blood and measurement using CO-oximetry. Pulse oximetry, measuring SpO2 as a proxy for SaO2 using a non-invasive and simple device, is frequently used to detect low blood oxygen levels. Pulse oximetry has been widely used during the COVID-19 pandemic, including in non-clinical settings, to detect hypoxemia and inform decisions to escalate care [2].

The current WHO COVID-19 management guideline recommends the ‘use of pulse oximetry monitoring at home as part of a package of care’ for symptomatic people with COVID-19 [3]. Many countries have specific guidance or services for home pulse oximetry in line with this recommendation [2, 4], such as the NHS England COVID Oximetry@home service [2]. The reporting of possible bias in pulse oximetry measurement, including due to skin pigmentation, raised a growing concern about the accuracy of oxygen self-monitoring [5]. Pulse oximetry works by beaming light through skin into the blood and inferring an SpO2 reading from the amount of light absorbed. Higher levels of skin pigmentation could, in theory, affect how light is absorbed, thus possibly affecting the accuracy of pulse oximetry readings. Measurement inaccuracy could have serious clinical implications including the delay of urgent medical care [6]. A recent US study analysed retrospective cohort data from more than 10,000 people, comparing where a diagnosis of occult hypoxemia (an SaO2 of less than 88%) was missed by pulse oximetry [7]. Results showed people described as Black had ‘nearly three times the frequency of occult hypoxemia that was not detected by pulse oximetry’ as those described as White [7]. In November 2021, the UK Health Secretary ordered a review into racial bias in medical equipment, including pulse oximeters.

It is an important time to consider the current evidence base for the impact of skin pigmentation on the accuracy of pulse oximetry compared with the gold standard measure of SaO2. The only current relevant systematic review, published in 1995, included three studies that explicitly considered the impact of skin pigmentation on pulse oximetry accuracy [8]. The review suggested that pulse oximeters may overestimate blood oxygen saturation in people with dark skin [8]. The recent rapid review by the NHS Race and Health Observatory came to similar conclusions but used a non-systematic review process, i.e., no comprehensive search, risk of bias assessment or meta-analysis [6]. Our objective was to conduct a rigorous systematic review of research on the influence of skin pigmentation on the accuracy of oxygen saturation measurement by pulse oximetry (SpO2) compared with SaO2 measured by standard CO-oximetry.

Methods

Search strategy and selection criteria

We report this review in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement [9]. The methods used were described in the registered protocol (https://osf.io/gm7ty).

We included any methods-comparison study that compared SpO2 values in any population, in any care setting, measured using any type of commercially available pulse oximeter, with SaO2 measured by standard CO-oximetry [10]; and investigated the accuracy of pulse oximetry based on both the level of skin pigmentation and ethnic group (Additional file 1: Table S1).

We excluded studies that used (1) prototype pulse oximetry devices, (2) pulse oximeters that require high-skilled specialists to operate (such as intra-partum pulse oximetry devices), and (3) pulse oximeters used for measuring venous blood oxygen saturation. We also excluded studies that reported diagnostic test accuracy measures and those with ineligible comparators, including reference pulse oximetry, use of ineligible reference values of oxygen saturation, e.g. arterial oxygen pressure (PaO2), calculated SaO2, fractional saturation (%O2Hb or FO2Hb) [10, 11].

Following the British Standards Institution 2019 standards for pulse oximetry [10], we included data on the overall accuracy (accuracy root-mean-square, Arms), mean bias, precision (standard deviation of mean bias, SD) and/or the limits of agreement for the SpO2 and SaO2 comparison, with mean bias as the review’s primary outcome (Additional file 1: Table S1). The Arms combines mean bias and precision in a single measure [10]. Arms, though being given a primacy in relation to other outcomes in the British Standards Institution standards, has no intuitive relevance to clinical decision-making. For example, an Arms value of 4% means that about 68% of pulse oximetry readings would be within ± 4% of the gold standard CO-oximetry reading. To aid clinical relevance and interpretation, we use mean bias as the review’s primary outcome. The mean difference between ‘true’ blood oxygen saturation levels and pulse oximetry readings can more clearly indicate how clinical decisions referring to threshold values (e.g. admission to hospital with a pulse oximetry reading of 92% or lower) could be impacted by bias.

We identified English language reports of relevant studies through searching (1) Ovid MEDLINE, Ovid Embase and EBSCO CINAHL Plus between the inception of databases and 5 August 2021, updated to 14 December 2021, using the same search strategies (Additional file 2: Box S1); (2) the ClinicalTrials.gov and World Health Organization International Clinical Trials Registry Platform for ongoing studies in August 2021; and (3) the reference lists of retrieved included studies, relevant systematic reviews, and guideline reports. We also contacted authors of key abstracts to request further information about their studies.

Two reviewers (CS and MG, or JH, OH) independently assessed titles and abstracts of the search results for relevance and the full texts of all potentially eligible studies for inclusion, with disagreements resolved through discussion or involving a third reviewer (GN) where necessary.

Data analysis

One reviewer (CS, or OH or JH) independently extracted data from included studies for items in Additional file 3: Box S2 and assessed the risk of bias for the included studies using an adapted QUADAS-2 (Additional file 4: Box S3) [12], all checked by another reviewer (JH, MG, OH, GN). We resolved any disagreements through discussion. Where necessary, we contacted study authors to clarify methods and data, and transformed data into a format needed for analyses, e.g. from reported 95% limits of agreement to standard deviation (SD) [13].

We pre-specified separate analysis of studies reporting level of skin pigmentation and ethnicity. When pooling data for mean bias and its SD across studies, we used the correlated hierarchical effects model with small-sample corrections under the robust variance estimation (RVE) framework. The approach enabled us to include single-measure design study data, together with multiple dependent effect size estimates of a repeated-measures design study in meta-analysis even when the dependence structure is unknown [14, 15]. We used Tau2, I2, the Q statistic and the related χ2 test to fully assess heterogeneity in meta-analysis. There is no established approach to pooling data for Arms and 95% limits of agreement across studies directly. We used the pooled mean bias and the pooled SDs produced by related meta-analyses and followed the British Standards Institution methods to calculate the Arms[10] and Bland and Altman’s methods to calculate the population 95% limits of agreement [16]. Using R (version 4.1.2), we performed RVE meta-analyses and produced forest plots as described in Additional file 5: Box S4. When meta-analysis was not appropriate, we synthesised relevant evidence following the Synthesis Without Meta-analysis in systematic reviews (SWiM) guidance [17].

One reviewer (CS) assessed the certainty of evidence on mean bias using the GRADE approach developed for the test accuracy topic, checked by another reviewer (GN) [18, 19]. Using this approach the certainty of mean bias findings could be assessed as at high, moderate, low or very low certainty. In interpreting review findings, we used the British Standards Institution-recommended thresholds described in the Additional file 1: Table S1 to judge the accuracy of pulse oximetry [10]. With the mean bias as the primary outcome, any pooled mean bias of > 0% would indicate overestimation with pulse oximetry and a risk of missing the detection of hypoxemia whilst a mean bias of < 0% (indicating underestimation) risks over-treatment. Given pulse oximeter devices commonly present integers in percentage, we rounded pooled estimates to be integers when interpreting the related findings such as rounding mean bias values within ± 0.50 to 0%.

We analysed data on pulse oximeters of different brands/manufacturers separately where possible. We undertook pre-planned sensitivity analyses through (1) excluding studies where all participants had similar skin pigmentation or the same ethnicity, (2) excluding studies with no data available for meta-analysis without transformation, and (3) excluding studies at high overall risk of bias. We undertook post hoc sensitivity analysis by excluding studies that used descriptors of ethnicity to indicate levels of skin pigmentation. We assessed publication bias following a qualitative approach given funnel plots or Egger’s tests were not considered appropriate for this review [20].

Results

Study selection and characteristics

We assessed titles and abstracts of 9920 records identified from electronic databases, 152 from trial registries, and 14 records identified by screening the reference lists of relevant publications. Of these records, we identified 33 publications of 32 studies—published between 1985 and 2021—as eligible for inclusion (Fig. 1) [21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53]. We identified one ongoing study from electronic searches [54]. We received raw or study-level summary data for two studies directly from study authors [29, 41].

Fig. 1
figure 1

The study selection flowchart. This flowchart shows the number of records and studies at each stage of the study selection process

Table 1 summarises included studies, with more details in the Additional file 6: Table S2. The 32 studies (6505 participants) reported SpO2-SaO2 comparison evaluations of 54 different pulse oximeters (26 manufacturers) cf. standard SaO2 (Additional file 7: Table S3). Of the 32 studies, 16 (50%) reported the ranges of SaO2 over which the accuracy of pulse oximeters was evaluated: the minimum values of these ranges had a median of 76% whilst the maximum values had a median of 100%. Of the 16 studies, four had SaO2 ranges that were in line with the recommended range of 70 to 100%; eight had narrower ranges such as 80 or 90 to 100%; and four had wider ranges such as 50 or 60 to 100%.

Table 1 Summary characteristics of the included studies

Assessment results of risk of bias and applicability

Using QUADAS-2, we considered 14/32 studies (43.75%) to be at unclear risk of bias for all four domains or high risk of bias for at least one domain, and the remaining 18 (56.25%) to be at low risk of bias for at least one of the four domains (Fig. 2).

Fig. 2
figure 2

Risk of bias assessment results. The left section of this figure shows risk of bias judgements for each domain of the QUADAS-2 tool for each study and the right section shows applicability judgements for each concern domain of the QUADAS-2 tool for each study. Please see Additional file 4: Box S3 for all signalling questions used in the QUADAS-2 assessment and further considerations

Key issues that led to downgrading for risk of bias were as follows: (1) for the patient selection domain, where specific sub-populations were inappropriately excluded from a study, or where selection criteria were unclear or not stated (19 studies); (2) for index test and reference standard domains, where there was no blinding information for either pulse oximetry SpO2 measurements (20 studies) or CO-oximeter SaO2 readings (30 studies); and (3) for the flow and timing domain, where the time intervals between SpO2 readings and the arterial blood sampling for SaO2 measurement were too long or participants were excluded from the analysis without rationale (2 studies).

We judged the applicability concern as high for one study, moderate for 13 studies, and low in terms of all three applicability considerations for the remaining 18 studies. Applicability concerns largely resulted from the lack of detail about the pulse oximeters being evaluated, CO-oximeter devices used, and/or arterial blood sampling procedures, meaning the study would be hard to reproduce.

Pulse oximetry accuracy by levels of skin pigmentation

Fifteen of the 32 studies (1800 participants) reported by level of skin pigmentation [22, 24, 25, 27,28,29,30,31,32,33,34, 41,42,43,44, 53]. Eight of these studies (1297 participants) had available data and were included in the meta-analyses: [22, 24, 25, 27, 29, 30, 41, 42, 53] Additional file 8: Table S4 presents the mapping of originally reported terms of skin pigmentation into ‘low’, ‘medium’ or ‘high’ pigmentation categories. The remaining seven studies (503 participants) were excluded from meta-analysis due to lack of mean bias data by levels of skin pigmentation (Additional file 9: Table S5). Table 2 presents pooled accuracy data. Further details and GRADE assessment results are in Additional files 10, 11, 12 and 13: Figures S1-S3 and Table S6.

Table 2 Result summaries of meta-analysis for levels of skin pigmentation and ethnic groups

Hospital-based pulse oximetry probably overestimates oxygen saturation for people with high levels of skin pigmentation compared with standard SaO2 (8 studies, 24 comparisons, 3270 SpO2-SaO2 pairs from 221 participants): pooled mean bias 1.11% (95% CI 0.29 to 1.93%), moderate-certainty evidence. This means that, on average, pulse oximetry probably overestimates blood oxygen saturation by approximately 1%, but overestimation may be as low as 0.29% or as high as 2%. The evidence for people with medium skin pigmentation is uncertain (very low certainty evidence). The evidence for people with low levels of skin pigmentation does not suggest clinically important systematic bias (pooled mean bias -0.35, 95% CI − 1.36 to 0.67), but the finding is of low certainty. For all the levels of skin pigmentation, the Arms values are around 2% or lower (95% CI non-estimable), and the pooled SD values are around 1.50% on average (Table 2). This means that, for people with any level of skin pigmentation, about 68% of their pulse oximetry readings would be within ± 2% of the CO-oximetry readings, with one SD indicating a variation around the mean bias of minus 1.50 to plus 1.50%. We tested the sensitivity of the findings: Arms and SD values were generally consistent but there was increased uncertainty for mean bias findings. Additional file 14: Figure S4 presents evidence for different types of pulse oximeter: overall, most devices slightly overestimated oxygen saturation in people with high levels of skin pigmentation, with imprecision around estimates.

Pulse oximetry accuracy by ethnicity

Twenty-two of the 32 studies (4910 participants) described participants by ethnicity rather than level of skin pigmentation [21, 23, 24, 26, 28, 29, 31, 35,36,37,38,39,40, 45,46,47,48,49,50,51,52,53]. We included 14 studies (3510 participants) in meta-analyses [21, 23, 24, 29, 35,36,37, 39, 40, 49,50,51,52,53]; the remaining eight (1400 participants) did not contribute to meta-analysis (Additional file 15: Table S7). Pooled data are shown in Table 2 (further data are reported in Additional files 16, 17 and 18: Figures S5-S7, and Additional file 13: Table S6). Oxygen saturation measured for people described in study reports as Black or African American may be overestimated using hospital pulse oximetry compared with standard SaO2 readings: mean bias 1.52% (95% CI 0.95 to 2.09%), low-certainty evidence. The 95% confidence interval of this estimate ranges between an overestimation of 1 and 2%. The evidence for people described in studies as Asian, Hispanic or of mixed ethnicity does not indicate a clinically important systematic bias (mean bias 0.31%, 0.09 to 0.54%), but it is of low certainty. The evidence is uncertain for groups described in papers as White/Caucasian, meaning further research is likely to alter findings (very low certainty evidence). The Arms values are around 2% or lower (95% CI non-estimable) for all these subgroups, and the pooled SD values are around 1.50% on average (Table 2). We tested the sensitivity of the findings: Arms and SD values were generally consistent but there was increased uncertainty for mean bias findings. Additional file 19: Figure S8 presents evidence for each type of pulse oximeter evaluated: overall, most devices overestimated oxygen saturation in people described as Black or African American.

Discussion

Summary of findings

This review suggests that for people with high levels of skin pigmentation and people described in studies as Black or African American, oxygen saturation may be overestimated by pulse oximetry in hospital compared with gold standard SaO2. Pulse oximetry for people with other levels of skin pigmentation is less likely to be overestimated but the evidence is uncertain. These results are for clinician-measured oximetry in controlled clinical environments and do not necessarily reflect the measurement bias of home pulse oximetry by patients or carers. The low certainty for much of the data presented means that further research could overturn these conclusions. For all the subgroups of populations evaluated, whilst the degree of mean bias is small or negligible over the ranges of SaO2 reported (median minimum value of 76% and maximum value of 100%), pulse oximetry readings appear unacceptably imprecise (pooled SDs > the recommended criterion of 1%) [10, 55]. Nevertheless, when the extents of measurement bias and precision are considered jointly in Arms, pulse oximetry measurements for all the subgroups appear acceptably accurate (with Arms < the internationally recommended threshold of 4% [10, 55], or even the more conservative threshold of 3% in the US FDA guidance) [56].

Evidence in context

Our findings have several implications. Even though our estimates suggest that the internationally recommended thresholds were met in terms of measurement bias [10, 55], the relatively small amount of mean bias identified could impact on clinical decision-making at threshold values for diagnosis of hypoxaemia. Overestimation could lead to clinically important hypoxaemia remaining undetected and untreated. Underestimated SpO2 readings could also be harmful, resulting in unnecessary treatment with oxygen (and the risk of hyperoxaemia) and wider impacts such as delayed hospital discharge. Two recent diagnostic studies provide evidence on clinical implications resulting from the bias in pulse oximetry for blood oxygen saturation levels [7, 57]. In these studies, people described as Black had a higher risk of ‘occult hypoxemia that was not detected by pulse oximetry’ compared with those described as White [7]. This may suggest that even small amounts of mean bias, when at the margins of diagnostic thresholds, could have an impact on diagnostic accuracy. Further understanding of these impacts could be explored via evidence synthesis of diagnostic accuracy (classification) studies to assess the clinical implications of measurement bias in relation to clinical decision-making thresholds. The amount of bias identified for people from ethnic groups such as Asian, Hispanic or mixed ethnicity appears negligible, although the certainty of the evidence is low. In terms of COVID-19 management, the 2021 WHO living guidance recommends using pulse oximetry monitoring at home as part of care package for symptomatic people in community settings but does not note the potential impact of level of skin pigmentation [3]. Our findings indicate that sub-population specific recommendations would be needed for future updates.

It is interesting to note that, despite clinically important mean bias and unacceptably large imprecision identified, the calculated Arms values are generally around 2% or less over the ranges of SaO2 reported, that is, the Arms values are far below the Arms threshold of 4% required by the current international and UK standards [10, 55]. The current standards did not point out evidence sources used to underpin such requirements, but the specified values of mean bias (SD for precision) (2% (± 1%)) are consistent with the outdated 1995 Jensen review results [8]. These suggested values are even larger than the average values of our estimates (1% (± 1.5%)) in people with darker skin. Given these, currently recommended thresholds may need re-evaluation, and use of the more conservative criterion of 3% applied by the US FDA guidance may have merit [56].

Findings also support calls for better calibrating algorithms used in oximeter device software to inherently address possible measurement bias. Manufacturers should ensure, and demonstrate, that their pulse oximeters are accurate for all levels of skin pigmentation. This review results offer some insights into the possible amount of bias to consider. This however may be complex, and future work could consider a more immediate approach to clinical pathways that recognise the potential impact of small overestimations in people with darker skin.

The evidence identified has limitations in its completeness and applicability. Firstly, pulse oximetry is widely used in clinical practice and promoted for home use during the COVID-19 pandemic [2]. Many factors could theoretically affect pulse oximetry accuracy in the real world such as types of pulse oximeter probe, comorbidities, movement, age of the patient and the range of SaO2 levels [8]. However, most included studies in this review were based in hospital settings and had limited information whether the pulse oximeters evaluated were appropriate for home self-monitoring. This review only addresses skin pigmentation and ethnicity. Therefore, little is known for the case of pulse oximetry undertaken by untrained people at home where other factors such as movement need to be considered. Secondly, pulse oximeters have been developed and upgraded since 1970s. The included studies were published between 1985 and 2021 and some of the older studies may have used discontinued devices. Nevertheless, the overestimation of oxygen saturation for darker skin appears consistent in general across most devices evaluated. To keep the completeness of evidence in this review, we included study data for all pulse oximeter devices included.

Strengths and limitations of this review

Before this review, our scoping exercise using a simple search of Ovid Medline with ‘pulse oximetry’ terms identified one systematic review in this area published by Jensen and colleagues in 1995 [8]. It evaluated the overall accuracy of pulse oximetry and explored possible factors that affected the accuracy. It included only one study with data on the impact of skin pigmentation, and findings were inconclusive. The comparators used for pulse oximetry measures in the Jensen review are reference measures of SaO2 such as PaO2, calculated SaO2 and %O2Hb that are now considered incorrect or outdated. We also identified a recent rapid review by the NHS Race and Health Observatory that had an unclear methodology [6]. In this rapid review, a summary of narrative findings suggested the overestimation of blood oxygen saturation levels in people with darker skin. Of the nine studies identified in this rapid review, seven had appropriate SpO2–SaO2 comparison data but the other two used inappropriate designs for the question being addressed.

Following prespecified methods to minimise the risk of bias in the review process, this review has important strengths. Our search for research is comprehensive and identified more studies. We used the gold standard CO-oximetry as the comparator for pulse oximetry, and accuracy outcomes as recommended in the British Standards Institution standards for pulse oximetry. We developed a correlated hierarchical effects model and used the novel RVE approaches to meta-analyse not only independent data (of 11 studies) but also data from studies (n = 21) with repeated-measures design [15]. This approach deals with correlations of multiple effect size estimates within a repeated-measures design study [14, 15].

This review has some limitations. Firstly, some included studies compared SpO2-SaO2 bias data between different subgroups of skin pigmentation or ethnicity and presented only tests of significance results, rather than SpO2 and SaO2 data per se at each subgroup level. At least two studies used diagnostic accuracy design and only presented proportions of participants with specific ranges of SpO2 in relation to specific SaO2 values, again rather than SpO2 and SaO2 data [7, 57]. We contacted authors of these studies to request relevant data and received data for two studies [29, 41]. If more data were received, then the review results could change.

Secondly, we are aware of the difference between the concepts of race and ethnicity. For simplicity, we chose to use the term of ‘ethnicity’ throughout this review given race and ethnicity are context/country-specific concepts and there is no globally accepted classification approach to distinguishing them [58]. If we had treated race and ethnicity data separately, the evidence base would change; however, we would not expect the overall conclusion to change. We also acknowledge the limitation of using scales like the Fitzpatrick scale to measure levels of skin pigmentation [59]. Such scales are criticised as being too blunt—an issue that impacts on the findings of this review and should be considered in future research.

Thirdly, we did not consider the differences between specific pulse oximeter devices, the differences between children and adults and their health conditions or the difference between skin pigmentation measurement methods. Regarding pulse oximeters evaluated, there may be differences between devices for the use of health professionals in hospitals and those for home self-monitoring. Because of these, meta-analyses in this review demonstrated between-studies heterogeneity (Table 2). However, we found, across devices evaluated and types of participants, included studies were largely consistent in suggesting oxygen saturation overestimation of using pulse oximetry. We therefore chose to pool study data, without undertaking further subgroups for these differences.

Fourthly, we only searched for English language peer-reviewed publications, without considering preprints. However, there are probably no major differences between summary treatment effects in English-language restricted meta-analyses and other language-inclusive meta-analyses [60], and the exclusion of non-English language publications from systematic reviews had no impacts on overall findings [61]. We considered the possible publication bias in assessing the certainty of evidence using GRADE approach.

Finally, no available approach to risk of bias and GRADE assessment is specific to the topic of this review. We were only able to use the relevant approaches developed for the test accuracy topic, and the GRADE approach used was only applicable to assess the certainty of evidence for mean bias, rather than precision, Arms and limits of agreement.

Conclusions

Pulse oximetry may overestimate blood oxygen saturation levels for people with dark skin in hospital settings compared with gold standard SaO2 measures. The evidence for the measurement bias identified for other levels of skin pigmentation or ethnicities is more uncertain. Whilst the extent of measurement bias and overall accuracy meet current international thresholds, the variation of pulse oximetry measurements appears unacceptably wide. Such a small overestimation may be crucial for some patients: particularly at the threshold that informs clinical decision-making.