Introduction

Technological advances have made non-invasive medical devices (e.g., pulse oximetry, heart rate monitors, artificial intelligence-based diagnostics) irreplaceable aspects of modern patient care. However, evidence shows many of these devices are susceptible to skin tone bias, which can worsen disparities in outcomes1. Modalities relying on transcutaneous measurements may produce bias due to skin tone variation and device validation on non-diverse populations, limiting device performance and overall generalizability1,2.

Recent evidence shows current racial bias in pulse oximetry. A retrospective analysis revealed a 3x increased frequency of undetected hypoxemia in Black patients (17%; 95% CI 12.2–23.3) compared to white patients (6.2%; 95% CI 5.4–7.1)3. Although there are mixed results4,5, several studies show overestimation in arterial oxygen saturation by 0.17–10% in darkly pigmented subjects, especially at lower SpO2 values6,7,8. Furthermore, darker skin tone is associated with a larger bias of undetected hypoxemia9.

Artificial Intelligence (AI) image classification is an emerging non-invasive tool that aims to improve diagnostic accuracy in medicine10, including classification of skin lesions10,11,12. However, increasing literature recognizes bias in their performance, with models performing worse on individuals with darker skin tones13. Daneshjou et al. showed that three state of the art algorithms had performance decreases on darker skin types (FST V, VI) compared to lighter skin types (FST I, II) (ModelDerm: 0.55 vs 0.64; DeepDerm: 0.50 vs 0.61; HAM10000: 0.57 vs 0.72)14. Groh et al. further exemplified that models are most accurate for skin types they were trained on, although some studies report no differences in model performance by skin tone15. The effect of skin tone on model performance is largely underreported—only 10% (7/70) of deep learning algorithms include information about skin tone16 and few report performance by skin tone categories17. Further, there is no gold standard for skin tone labeling, and commonly used practices like estimated Fitzpatrick Skin Type are limited by uncertainty18 and lack of inclusiveness. (ref. 11; ref. 12; ref. 10; ref. 19)

To improve generalizability of assessments and algorithm fairness, it is critical that patient skin tone variation be taken into account in validation studies of noninvasive technologies. There are initiatives to modify device monitoring regulation criteria, such as those released by the FDA in November 202320. In this review, we will (1) briefly present a review of background and methods for skin tone measurement within health care then (2) provide detailed study considerations for measuring skin tone in prospective trials.

Results

Part I. Review of skin tone assessment

Defining skin tone

Color is the perception of light based characterizations, such as hue, lightness, and saturation. Physiologically, the inherent color of the skin, defined as “skin tone”, is the result of light absorbing compounds called chromophores. The most abundant chromophores in humans are melanin (pheomelanin and eumelanin), carotene, oxygenated hemoglobin, and reduced hemoglobin21. In general, the two major contributors to skin tone are melanin, which produces a brown tint, and hemoglobin, which creates red and purple-blue hues22. Frequent methods for discriminating skin tone for the purpose of validating noninvasive assessment tools are administered visual scales and color measurement tools23 (Table 1). Skin tone can also be extracted from camera images through a variety of techniques24. For the purposes of this paper, automated skin color extraction, modeling, and labeling will not be discussed23,25,26.

Administered visual scales

Administered visual scales, such as Von Luschan, Monk, and Pantone, utilize numbered colored tiles that are matched to a person’s skin tone (Table 1). Fitzpatrick Skin Type (FST) was originally developed to assess tanning and burning propensity, however, many use FST as a proxy for skin tone27 despite evidence showing that FST is poorly correlated with objective measurements of skin color evaluation27,28. Although widely available and inexpensive to administer, visual scales can be limited by subjectivity. Furthermore, visual scales can be affected by complex human perception of color which is influenced by light, anatomic site, the context of the object, or a person’s unique experiences with similar objects29,30,31,32.

Table 1 Descriptions of common skin tone measurement tools. This is a nonexclusive list

Color measurement tools

Color measurements are objective measurements achieved through reflectance spectrophotometry (Konica Minolta CM700D, Variable Spectro) and colorimetry (Delfin SkinColorCatch) (Table 1). In current works, color measurement tools are being utilized as a by-product of the limitations of visual scales by to providing objective measurements to increase precision in color quantification33,34,35. Color measurements offer greater color precision, but the tools are expensive and devices are sensitive to environmental influences25,26.

Cameras and color spaces

Several differing color models provide a framework to systematically or mathematically describe color output. One of the most common, the RGB model (red, green, blue), was developed to mimic the primary colors perceived by the eye32. It encodes color in an additive fashion where a combination of all three colors results in white. Other color models include HSL (hue, saturation, lightness), CIELAB (lightness, green-red gradient, blue-yellow gradient), and CIECAM02 (brightness, lightness, chroma, saturation, hue, and “colorfulness”)32,36. Most reproduction of color on printed work is exported in a CMYK (cyan, magenta, yellow, black) color space. Color spaces are a specific organization of colors that are mapped to values in color models in a standardized way. The standardized RGB color space (sRGB) is the most commonly used space for representing digital images on displays26,37.

Part II. Considerations for study design

Body part assessed

A summary of recommendations for skin tone measurement can be found in Table 2. Unaltered skin tone represents a combination of genetic factors and environmental influence based on constitutive (baseline skin color) and facultative (skin color altered by sun exposure) grouping. Constitutive skin color is best characterized in sun-protected areas more likely to represent unaltered baseline pigmentation38. However, one’s perceived skin tone may also be influenced by exogenous factors including artificial tanner, makeup, or tattoo pigmentation. Depending on the technology being validated, the inclusion of at least one constitutive skin site may be important given its decreased variability across seasons and increased correlation with skin phototype39. The upper volar arm has been proposed as a reliable measurement of constitutive skin given its low seasonal variability and ease in access40. Otherwise, body part selection may be predetermined based on the application (e.g., using the finger/earlobe in pulse oximetry).

Table 2 Considerations for skin tone measurement in prospective research studies

Underlying conditions that can affect skin tone

Several conditions can affect the relative concentration and distribution of chromophores and alter skin tone assessment. Therefore, study designs incorporating skin tone measurement should consider pigmentary disorders (e.g., vitiligo or melasma) and medical conditions (e.g., anemia and jaundice) that influence skin tone. Perfusion-related changes in skin (e.g., flushing, blanching) can also affect skin tone assessment34. To minimize these effects, it is recommended that skin measurements occur in a pressure-free state and at rest, and to collect as much information about factors that influence skin tone at the time of measurement as feasible.

Ambient lighting

The impact of lighting on the perception of color is critical and commonly overlooked in study design. Ambient lighting can come in various forms, such as brightness and temperature. Ambient lighting can influence color perception based on time of day and location and may skew skin tone perception, making it appear lighter or darker41. Ambient lighting should be both sufficient and standardized to increase the accuracy and precision of skin tone assessment. To prevent variability in daylight conditions, artificial lighting with similar temperature to natural light (5000–6500 K) could be helpful42. A controlled illumination source, combined with an ambient light-blocking feature, can significantly enhance light isolation and improve the signal-to-noise ratio43.

Location

Considerations for skin tone assessment depend in part on the location of the patient population under study.

For example, an outpatient clinic may be a single location where ambient lighting and temperature may be more easily standardized. Patients are often mobile, making it more feasible to incorporate skin tone measurements on less accessible sun-protected body parts (e.g., lower back). These may be difficult when collecting remote photos from a patient’s home where lighting may not be standardized and number of body parts for measurement may be limited. Additionally, longitudinal study design may need to account for patients’ variable sun exposure. Previous studies have included non-sun exposed body parts, advised participants to avoid sun exposure and/or require the use of sunscreen on a daily basis to attempt to address this44.

In contrast, an inpatient population presents complications when attempting to achieve a more fixed environment for data collection and synchronization of measurements. A more fixed environment for data collection can potentially be improved by understanding the unique workflow of standard care and adjusting each patient’s room to mimic a standardized environment. Although dependent on study design, study procedures may need to take into consideration patient health status, iatrogenic complications, patient mobility, and other external factors that could potentially hinder temporal aspects that are essential for the completion of measurements. For instance, in the context of pulse oximetry, short timepoints for data collection may be needed to minimize potential discrepancies between arterial blood gas (ABG)-pulse oximetry measurements and skin tone readings45.

Dataset balance—skin tone and race

Clinical research of all kinds must incorporate racial and ethnic diversity to ensure results are generalizable, especially when evaluating devices for clinical use46,47. Underrepresentation of minority groups may lead to a higher risk of adverse reactions or reduced efficacy. In fact, the NIH has an issued policy and guideline requiring all phase III clinical trials to ensure analysis by sex/gender, race, and/or ethnicity48. However, multiple studies have demonstrated large variations in skin tone within racial and ethnic subgroups49, and skin tone may directly influence bias of technologies beyond race. This highlights the potential need to balance datasets specifically by skin tone in addition to race. Ensuring dataset balance by skin tone may pose several challenges. Since skin tone varies within a person and across time, one may consider balancing only by constitutive body site or averaging skin tone from multiple body sites. Both skin tone and racial/ethnic dataset variation will enhance the generalizability of research results. However, investigators may consider balancing by either skin tone or race with a minimum threshold of other parameters based on their research question. Initial explorations by the Food and Drug Administration are soliciting input on methods to integrate skin tone measures into clinical device studies, but a standard has not been established50.

Considerations for administered visual scales

Visual scales are low-cost tools that can distinguish skin tones with relatively high reliability, require minimal training, are widely available, and can be utilized in various forms of analyses (retrospective, prospective, and post-hoc). Limitations of these scales are that they are influenced by user perception (e.g., color blindness) and environmental conditions (e.g., ambient lighting). There are also several visual scales available which can make comparison of skin tone data across studies difficult.

Few studies have assessed the relative utility of visual scales. FST was designed as a questionnaire to determine tanning and burning propensity, but when used as a proxy for skin tone is only weakly associated with a visual color scale (p < 0.0001)51. There have been attempts to create RGB-defined FST visual scales, although these have not been widely adopted52. The Von Luschan scale has been shown to be highly correlated with narrow band spectrophotometry with one study showing correlation of VLS and Melanin + erythema index to be 0.90 (p < 0.001)53. The scales with more levels (Pantone, Taylor Hyperpigmentation) offer greater granularity and shade range for skin tone assessment, but may be more challenging to reliably administer. For the Pantone scale, one study investigating vascularized allotransplantation matching found inter-rater skin tone assessment to be fair (k = 0.454) and intra-rater reliability to be substantial (k = 0.725)54. A newer, 10-point Monk scale shows high reliability for crowdsourced annotators (ICC 0.86–0.94), but has not been tested in a medical context55. Price may also be a consideration when choosing a scale. While the Monk scale is freely available and free to use and the FST questionnaire is easily accessible online, the Pantone scale is only available for purchase.

Considerations for color measurements tools

Colorimetric and spectrophotometric devices are used in a wide range of study designs to assess skin tone by quantifying melanin, erythema, and overall skin pigmentation. Common reflectance colorimetric and spectrophotometric devices are composed of an illuminator, standard observer, and a tristimulus measurement system. The illuminator of the instrument applies a fixed light source to a desired surface and specific wavelengths are then isolated to obtain color details without influence of outside lighting conditions when pressed gently to the skin. Colorimetric and spectrophotometric devices are easily-operated, non-invasive methods of measurement to achieve objective skin tone measurement. In varying clinical settings, handheld colorimetric and spectrophotometric devices (e.g., Delfin SkinColorCatch, Konica Minolta, and Variable Spectro) may be easier to use to assess body parts while also maintaining patient comfort. Although the devices demonstrate moderate to high interobserver reliability, devices are potentially high-cost, and most devices have not had large-scale published validation. A few studies have attempted to compare the utility of objective color measurement tools, but were limited in scope33,56,57. Consequently, a particular type of colorimetric/spectrophotometric device has not been proven to be superior33.

Considerations for cameras

Cameras are widely used, and available in the pocket of almost everyone a patient interacts with across the medical practice. There are also large datasets of images that have been acquired to train AI algorithms and other applications11,58. However, it can be challenging to extract skin tone information from a photograph alone. Camera type, export compression level, and lighting will be critical to consider59. One of the most important modifiable factors is the white balance, which affects the relationship between red, green, and blue pixel values60. The use of cross polarizing filters can help reduce specular reflection61 and improve skin tone evaluation, especially in darker toned individuals43. Reference color charts or color calibration cards (eg. X-Rite, Douglas, DSC Labs, QPcard, Macbeth) can be used within the frame of the image or before/after in identical conditions to improve reproducibility across devices, but will not be available for retrospectively captured images59,60. After acquisition, proper image processing is necessary to maintain color accuracy across mediums. Although popular image storage mediums like JPEG increase computational efficiency, image compression can lead to artifact and lost image parameter information. Therefore, RAW image format may be helpful for maintaining color consistency, although large file size may limit its utility, and many photography devices (eg. phones) cannot acquire in RAW format32.

Scale/device reliability

The process of evaluating device bias against skin tone measurement is nascent. The portability and low cost benefits of visual scales will need to be balanced against the potential increased accuracy of color measurement technologies that also include continuous measurement compared to categorical bins. Having at least two raters use visual scales and conducting triplicate readings for color measurement tools will increase color precision. When possible, comparison between multiple color measurement instruments will be valuable to the specific study and field.

Discussion

Consideration of skin tone in device validation studies across medicine is important to reduce bias against patients with darker skin tones that exists in pulse oximetry, AI diagnosis, and many other areas of medicine. These biases will worsen existing healthcare disparities unless they are addressed and measured directly. In many cases, race and ethnicity play specific roles in equity focused-medicine and biased outcomes arise due to these socio-demographic factors62. A considerable amount of research is race and ethnicity focused, but for technologies that rely on light for measurement, their bias may be specifically related to skin tone. This highlights the need for increased awareness of limitations of current medical devices associated with systematic error and pronounced inaccuracies among patients with a darker skin tone.

In this review, we highlight common tools used for skin tone measurement and discuss pertinent study design considerations for accurate skin tone assessment. There is no current gold standard tool as each possesses relative pros and cons, and validation is largely absent. Visual scales are more readily available for prospective or post-hoc analysis, but may be influenced by user perception, while color measurement tools offer objective, sensitive measurements but can be expensive with variable reliability. In addition to tool choice, investigators should consider how patient level factors may affect skin tone validity, including selection of a body site, consideration for medical conditions affecting skin tone, and minimization of perfusion-related color changes. Furthermore, creation of a standardized environment with consistent lighting and camera settings will promote improved color consistency. This paper is a narrative review and therefore results are limited by the non-systematic approach. Further, the study is not powered to directly compare the utility of skin tone assessment modalities or quantify the potential effect of study design parameters on skin tone accuracy.

When prospectively evaluating devices that may be influenced by skin tone, incorporation of skin tone measurement will play an important role in considering these potential biases. The current review offers researchers a tool to aid in development of skin tone assessment protocols. We encourage researchers to continue to focus on validating devices against a diverse and representative dataset, and when possible, to make public the skin tone measurements for future use and calibration.

In conclusion, increasing evidence shows bias and increased error in noninvasive tools across medicine in patients with darker skin tones. We provide guidance and consideration when conducting skin tone assessments using administered scales (eg. Fitzpatrick, Pantone, Monk) and color measurement tools (colorimeters, spectrophotometers), encouraging device validation to include at least one color measurement tool. As our awareness as investigators consider skin tone as a variable in future work, we will be able to reduce skin tone biases in medical devices.