FormalPara Key Points

Wearable technologies are widely used in healthcare and wellness settings, offering continuous monitoring of various biometric parameters such as heart rate, physical activity, sleep patterns and more.

Many studies have investigated the accuracy of consumer wearables against established criterion standards. However, the results are varied due to differences in methodologies, devices, and targeted biometric outcomes, complicating the effort to develop a unified understanding of wearable accuracy.

We estimate that approximately 11% of the 310 consumer wearables that have been released to date have been validated for at least one biometric outcome; approximately 3.5% of biometric outcomes have been validated for these devices.

This research uncovers significant variability in device accuracy and measured outcomes, underscoring inconsistencies. It emphasizes the need for a standardized, rigorous, and adaptable validation protocol in the rapidly evolving field of wearable technologies.

By offering a ‘living’ review, this study aims to maintain an up-to-date representation of the wearable technology validation landscape, facilitating informed decision-making for researchers, clinicians and consumers navigating this domain.

1 Introduction

Consumer wearable devices—such as watches, wristbands, pendants, glasses, armbands and other accessories—are rapidly permeating various aspects of daily life, becoming ubiquitous tools for monitoring, assessing and enhancing human behaviour and health [1]. These devices encapsulate a wide array of sensors and software, facilitating the continuous collection of individualised data including physical activity, heart rate, sleep patterns and even mood states [2]. The increasing pervasiveness of wearable technology is demonstrated by its global market size, which is expected to reach US$186.14 billion by 2030, expanding at a compound annual growth rate of 14.6% from 2023 to 2030 [3]. The American College of Sports Medicine (ACSM) also recently identified wearable technologies as the ‘#1 fitness trend’ for 2024 on the basis of a survey of > 4500 health and fitness professionals [4].

Wearables are now heralding a new epoch in several fields of research too; they are generally unobtrusive, cost-effective and comfortable to wear and yield a high level of acceptability among users [5,6,7]. This is reflected in the proliferation of research incorporating wearable technologies for remote data capture. For instance, the Datenspende study by the Robert Koch Institute deployed wearables to tackle the coronavirus disease 2019 (COVID-19) pandemic through anonymous data donations [8], while Perez et al. 2019 exhibited the capacity of the Apple Watch to detect atrial fibrillation [9], sparking discussions on the potential and limitations of these devices among healthcare providers, researchers and media. Kimura et al. investigated associations between steps and sleep patterns, collected with a wristband, and markers of Alzheimer’s disease in older adults [10], while Shilaih et al. utilised wristbands to correlate heart rate and wrist temperature with menstrual cycle phases [11]. Wearables have also started to be incorporated in clinical trials: at the time of writing, there are currently 58 active trials on clinicaltrials.gov using an Apple Watch, 323 using a Fitbit and 71 using a Garmin device [12].

This research demonstrates how, by allowing for long-term data collection in naturalistic environments, wearables facilitate ecological momentary assessments, and can generate insights into individual health patterns [13, 14]. But despite their potential, the relentless pace of technological advancements and the multifaceted nature of these devices raise pertinent questions about their validity and reliability [15, 16]. For researchers, accurate data are crucial for robust study methodologies, especially when these devices are used to infer health status or predict disease outcomes [16, 17]. For healthcare providers, the validity of data can directly affect clinical decisions, patient care and monitoring, and could ultimately impact health outcomes [18]. For individuals, accurate self-monitoring could potentially help shape health-related behaviours and lifestyle modifications, fostering a sense of empowerment and facilitating personal health management [19].

Numerous factors can impact the validity of data collected by wearable devices. These include variability in the algorithms used by different devices to estimate metrics such as heart rate, sleep or physical activity [20]; user-specific factors such as age, body size or skin tone [21]; and differences in device placement and wear time [22]. Environmental conditions, such as temperature and humidity, can also affect sensor performance [23]. Perhaps most importantly, the dynamic nature of the wearable technology industry—with new devices, software updates and algorithms being continually introduced—necessitates ongoing validation studies [24,25,26,27]. As such, ensuring the accuracy and validity of wearable devices is an ongoing endeavour, one that has been taken on in a number of research initiatives. For instance, the International Federation of Sports Medicine (FIMS) has advocated for a global standard for sport and fitness wearables amidst the rising concerns for quality assurance related to the products [28], while the INTERLIVE network is a joint European initiative focused on developing best-practice recommendations for evaluating the validity of consumer wearables to measure direct and derived metrics [25,26,27].

However, the pace of academic research struggles to keep up with the more agile commercial ecosystem: primary research studies are slowed by the need to secure funding, develop validation protocols, recruit and test participants and navigate the peer review process [16], while systematic reviews and meta-analyses are often out of date by the time they are published [28]. Commercial entities typically release new hardware annually and push software updates multiple times a year, vastly outpacing the academic validation and synthesis cycle [16]. This disparity underlines the need to make the research in this field ‘living’ and continually updated, leveraging methodologies such as living systematic reviews to maintain pace with the rapidly evolving technological landscape [29]. The vision of a ‘living’ body of research in the wearable technology field champions the principles of real-time evidence synthesis, capable of dynamically incorporating new findings as they emerge [30].

The aim of this study is to conduct a systematic review of systematic reviews evaluating the accuracy (i.e. validity and/or reliability) of consumer wearable technologies. This review focuses on consumer wearables rather than research wearables, as these devices have the largest user base and undergo regular updates in both software and hardware. The review is not constrained by setting, reflecting the widespread use of wearables among the general population, athletes and clinical populations. This review will be ‘living’, meaning it will be updated as new systematic reviews are published. Our objectives are as follows: (1) to evaluate what biometric outcomes consumer wearable technologies can measure, including metrics such as heart rate, physical activity, sleep and stress; (2) to synthesise the accuracy (including validity and reliability) of consumer wearable technologies for the outcomes in (1); and (3) to evaluate the methodological quality of the existing systematic reviews on this topic. This approach will allow us to identify potential gaps in the literature and provide recommendations for future research. By doing so, this review will bridge the gap between the rapid pace of wearable technology development and the slower tempo of academic research, ensuring that decision-making in this field is informed by the best available evidence.

2 Methods

2.1 Design

This umbrella review has been written according to the Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA) guidelines [31]. The protocol was registered on PROSPERO (CRD42023402703) on 18 March 2023.

2.2 Search Strategy

A literature search was conducted on 5 June 2024 across the following databases: MEDLINE via PubMed, Embase, Cinahl and SPORTDiscus via EBSCO. We sought to identify systematic reviews evaluating the accuracy of consumer wearable devices compared with criterion measures. No filters were applied during the search. Authors C.D. and M.B., with the support of our institution’s librarian, devised and tailored the search strategies for each database. The strategy for MEDLINE was developed initially and later adapted for other databases using database-specific terminologies. The full search strategies are detailed in Supplemental file 1.

2.3 Study Selection Strategy

After the literature search, all identified studies were imported into Endnote 21 and subsequently exported to the Covidence systematic review software. Two authors (CD and MB) independently screened all articles through title/abstract screening, full-text screening and data extraction phases. Disagreements between CD and MB were resolved through discussion, and if consensus could not be reached, the supervising author (RA) provided the decisive opinion.

2.4 Eligibility Criteria

Included studies had to satisfy the following criteria: (1) they must be a systematic review and/or meta-analysis, (2) they must include studies that evaluated the validation of consumer wearable devices, (3) the research synthesis in the review must focus on validity or accuracy of consumer grade wearables specifically and (4) devices’ validation must be against an accepted reference standard. Consumer wearables were defined as any body-worn device available for public purchase, including those previously available but now discontinued. The primary exclusion criteria were: (1) validation of research-grade or non-consumer wearable devices and (2) validation of wearables against other wearables (i.e. convergent validity only). No restrictions were imposed on study populations, device wearing locations or biometric types, as long as the aforementioned criteria were met.

2.5 Risk of Bias

The methodological quality of each included systematic review was assessed by two independent investigators using the Risk of Bias Assessment Tool by Drucker et al. [32]. This tool identifies six bias domains: protocol pre-registration, evidence selection, bias assessment in the studied reviews, competing interests, ‘spin’ (i.e. misleading reporting, interpretation or extrapolation of results) and interpretation of findings. Reviews scoring 6/6 were deemed to be of high quality, providing an accurate and comprehensive summary of the results of the available studies that address the question of interest. Reviews scoring 4–5/6 were considered moderate quality, indicating more than one weakness but no critical flaws. Reviews scoring 2–3/6 were classified as low quality, indicating the presence of one critical flaw with or without non-critical weaknesses. Finally, reviews scoring less than 2/6 were deemed to be of critically low quality, indicating more than one critical flaw with or without non-critical weaknesses.

2.6 Data Extraction

Data extraction from eligible studies was undertaken by authors CD and MB. Extracted data comprised: (1) publication/author details [digital object identifier (DOI), title, authors, corresponding author’s contact, publication year, country of the first author and funding sources]; (2) review protocol [review criteria, target population, targeted outcome, research setting (free-living or controlled), sample size, sex distribution, age and health status]; (3) device-specific information (criterion measure, device brand and data acquisition protocol); and (4) results (summary of validity and authors’ conclusions). Where additional details were required, corresponding authors were approached via email.

2.7 Data Synthesis

To synthesise the results, we extracted and prioritised specific statistics from the included systematic reviews, focusing on metrics that allowed for meaningful comparison across studies. The primary statistics extracted included mean absolute percentage error (MAPE), pooled absolute bias, intraclass correlation coefficients (ICCs) and mean absolute differences. Where available, confidence intervals (CI) were also extracted to provide context to the reported metrics. In cases where multiple reviews covered the same biometric outcome, we prioritised data from reviews on the basis of their methodological quality, as assessed by the Risk of Bias Assessment Tool outlined above. Reviews deemed to be of high quality were given precedence in the synthesis, followed by those of moderate and then low quality. This hierarchical approach ensured that the most reliable and robust data informed our conclusions.

For biometric outcomes with varied reporting metrics, we standardised the data presentation to enhance clarity and comparability. For instance, where multiple forms of error metrics were reported (e.g. bias and ICC), we opted for the statistic most frequently used across reviews to maintain consistency. We cross-checked the primary research studies included in each systematic review to identify and remove duplicates, ensuring that each original study was only counted once in our synthesis. In instances where numerical data were not provided (e.g. descriptive summaries of sleep metrics), we included these qualitative assessments but noted the absence of specific statistical measures.

2.8 Living Review Implementation

Given the dynamism in the wearable technology sector and the associated research [16], we plan to continually update this synthesis. This will be achieved through a regular search update, whereby we will perform systematic searches every 6 months across all predetermined databases. Any new systematic reviews identified will be screened for eligibility and, if suitable, will be incorporated into the living review. Upon identification of new eligible systematic reviews, the same extraction method described above will be employed. Data from new reviews will be synthesised with the existing evidence. We will upload review updates to OSF.io (https://osf.io/fqvms/?view_only=e49e7c42dfd3475db5cf2da3b15e4b3f). Every update will result in a new version of the living review. Each version will have a unique identifier and a changelog to outline the differences from the previous version. All versions of the review will be made accessible to readers, allowing them to trace the evolution of evidence over time. The need for the review to remain ‘living’ will be assessed annually. Factors considered will include the pace of emerging evidence, the evolution of wearable technologies and feedback from the community. If there is a reduction in the pace of new evidence or if the topic reaches a point of saturation, the review might transition from a ‘living’ status to a traditional static review.

3 Results

3.1 Literature Search

From an initial pool of 904 studies identified for the review, 92 duplicates were removed, leaving 812 studies to be screened by title and abstract. Of these, 771 did not meet the inclusion criteria and were excluded. This resulted in 41 studies that were assessed for full-text eligibility. Upon full-text assessment, 17 of these studies were further excluded: 11 did not discuss commercially available wearables as distinct entities, 4 had no available full text (for instance, conference submissions), and 2 utilised a dataset duplicated in another review. Consequently, a total of 24 systematic reviews (8 of which performed some kind of meta-analysis) were included in this umbrella review [15, 33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55]. The process of study selection is depicted in the PRISMA flow diagram (Fig. 1).

Fig. 1
figure 1

Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) flow diagram

3.2 Characteristics of the Systematic Reviews Included in the Umbrella Review

The 24 reviews collectively included 653 studies. The eligible systematic reviews (16 out of 24) and meta-analyses (8 out of 24) were published between 2013 and 2024. Due to the overlap of studies included in different reviews, we extracted the following information from the primary research: (1) publication/author details (DOI, title, authors, publication year), (2) population group (target population, age range, number of males/females), (3) biometric outcome studied (e.g. heart rate), (4) wearable device used (including device manufacturer and name), and (5) criterion measure used for comparison and (6) statistical analysis.

After removing duplicated studies across the reviews, 391 primary research studies of 888,033 participants were identified. However, only 249 of these were validation studies of consumer-grade wearable devices, collectively involving 430,465 participants (243,068 males and 181,064 females; 80 studies did not provide sufficient detail to determine the sex distribution of the participants).

The characteristics of the 24 reviews, including the population groups, number of participants, biometric outcomes of interest and criterion measures used, are presented in Table 1. The full characteristics of the dataset, including the narrative synthesis of the results and authors’ conclusions, are available in Supplemental File 2. Article metadata for the individual studies included in the 24 reviews is also available in Supplemental File 2.

Table 1 Characteristics of included systematic reviews

3.3 Overview of Consumer Wearable Devices and Validation Status

To determine the number of consumer wearable devices that have been validated to date, we first compiled a list of devices that have been on the market since 2003 (the year the Garmin Forerunner 101 was released). This list includes 310 consumer wearable devices released by the following manufacturers: Amazfit (21), Apple (14), Coros (8), Fitbit (32), Fossil (11), Garmin (72), Google (2), Huawei (24), Jawbone (6), Mi (10), Misfit (2), Oura (3), Polar (24), Redmi (3), Samsung (24), Skagen (2), Suunto (13), TicWatch (15), Whoop (4) and Withings (20).

Of these 310 devices, 34 (11%) were validated for at least one biometric outcome in at least one of the primary research studies included in the reviews identified via our search. The most commonly validated device manufacturers were Fitbit (31 studies), Apple (12 studies) and Polar (6 studies).

However, most consumer wearables can evaluate a multitude of biometric outcomes, with some of the most common being step count, heart rate, sleep, physical activity and energy expenditure. Assuming that each wearable in our list of 310 devices can measure these five biometric outcomes, the potential number of discrete validation studies required to have a complete picture of consumer wearable validity is 1550 (i.e. 310 devices × 5 outcomes). The actual number of validity studies that have been conducted on these wearables for the specified biometric outcomes is 54, representing only 3.5% of the total number of potential validations.

The full details of the devices that have been validated linked in specific studies are available in Supplemental file 2.

3.4 Credibility of Evidence of Measurement Using Wearable Devices

The results of the risk-of-bias (RoB) assessment are presented in Table 2. Only two reviews (8%, both of which included meta-analyses) showed a high quality of evaluation; 10 (42%) evaluations were of moderate quality, and a further 12 (50%) were deemed to be of low quality. No reviews were deemed to be of critically low quality. In the following sections, we mainly describe the results of evaluations that performed a meta-analysis (where available) of the highest available quality (i.e. sequentially, on the basis of the availability of high, moderate and finally low quality).

Table 2 Results of the risk of bias assessment

3.5 Summary of Evaluations

The 24 systematic reviews evaluated 11 biometric outcomes across three broad domains, i.e. cardiovascular, physical activity and sleep; 15 reviews evaluated a single biometric outcome, with 9 reviews evaluating more than one. The biometric outcomes evaluated included: heart rate (six reviews [15, 36, 40, 45, 47, 55]), heart rate variability (two reviews [34, 39]), cardiac arrhythmia (six reviews [33, 41, 44, 47, 49, 51]), aerobic capacity (one review [50]), blood oxygen saturation (one review [54]), step counting (eight reviews [15, 36,37,38, 40, 43, 45, 46]), wheelchair push counts (one review [35]), physical activity duration (four reviews [37, 38, 40, 43]), energy expenditure (eight reviews [15, 36,37,38, 40, 43, 48, 52]) and sleep (four reviews [37, 38, 42, 53]) The specific devices included in the studies identified by each review and the biometric outcome(s) they were validated against are presented in Table 3 and illustrated in Fig. 2.

Table 3 Biometric outcomes and wearable devices evaluated in each review
Fig. 2
figure 2

Biometric outcomes and wearable devices evaluated in each review

3.6 Validity of Wearables for Cardiovascular Health Biometrics

3.6.1 Heart Rate

Six systematic reviews (including two meta-analyses) assessed the validity of wearables for heart rate measurements [15, 36, 40, 45, 47, 55]. These reviews included 165 non-duplicate studies of 5816 participants (~ 50% female; Supplemental file 3). Only one review was deemed to be of high quality, with two reviews deemed to be of moderate quality and three of low quality. However the high-quality review by Chevance et al. focused specifically on the accuracy of one device manufacturer—Fitbit—in comparison to the reference standard, so the findings of the moderate quality reviews are also summarised below. The reference standard used for the measurement of heart rate included electrocardiograms (ECG), polar chest straps or pulse oximetry.

In the 52 studies included in the review by Chevance et al. [36], the pooled estimate of the meta-analysis showed a mean bias of − 3.39 bpm, indicating an underestimation of Fitbit devices for measuring heart rate compared with criterion measures. Subgroup analyses by population characteristics, intensities and types of activities and device models were consistent with this; however, the results of the subgroup meta-analyses for different intensities and types of activities revealed an underestimation of heart rate for cycling activities compared with daily living and treadmill activities as well as overground walking [36]. Measurement accuracy was also better for treadmills than for overground walking at moderate to vigorous intensity activities compared with light-intensity activities [36].

Neither of the reviews deemed to be of moderate quality performed a meta-analysis [15, 45]. In a narrative synthesis of 29 studies, Fuller et al. found that wrist-worn wearables typically exhibit measurement errors of approximately ± 3% regardless of the device (including those manufactured by Apple, Fitbit and Garmin) or criterion used [15]. The review by Irwin and Gary focused specifically on the Fitbit Charge 2 devices; five of the eight studies included in this review reported mean absolute percent error values of < 10% [45].

3.6.2 Heart Rate Variability

Two systematic reviews, one deemed to be of moderate quality and one of low quality, assessed the validity of wearables for measuring heart rate variability [34, 39]. Collectively, these reviews included 22 non-duplicate studies of 714 participants (~ 32% female; Supplemental File 3). Neither undertook a meta-analysis. Both reviews identified 2-lead, 3-lead, 5-lead and 12-lead ECG recordings as suitable criterion measures for the measurement of heart rate variability.

The review by Board et al. was deemed to be of moderate quality and included 13 studies, solely of Polar devices, finding “near perfect validity” (ICC 0.98–1.00) for temporal and spectral power heart rate variability (HRV) measures computed from inter-beat interval data when measurements were taken at rest [34]. In contrast, the review deemed to be of low quality by Georgiou et al. [39] evaluated a range of devices during both rest and exercise across 18 studies. Again, agreement between the indices of HRV was very good to excellent (ICCs ranging from 0.85 to 0.99) when measurements were taken at rest, however, this decreased to 0.85 as the level of exercise and/or motion increased.

3.6.3 Arrhythmia Detection

Six systematic reviews (two meta-analyses), two deemed to be of moderate quality and four of low quality, assessed the validity of wearables for measuring cardiac arrhythmia (including atrial fibrillation) [33, 41, 44, 47, 49, 51]. Collectively, these reviews included 85 non-duplicate studies of 877,127 participants (~ 42% female; Supplemental File 3). The reference standard for arrhythmia detection in these reviews included a 12-lead ECG, a Holter monitor, an ECG patch, telemetry, or an internet-enabled mobile ECG.

The meta-analysis by Nazarian et al. [51] measured diagnostic accuracy using smartwatches in 424,371 subjects across 18 studies. The Apple watch was used in seven studies, Samsung smartwatches were used in five studies and the remaining studies used a Huawei, Huami or Empatica smartwatch. Wearables demonstrated a pooled sensitivity of 100% (95% CI 0.99–1.00) and a pooled specificity of 95% (95% CI 0.93–0.97) for detecting cardiac arrhythmias in the sample populations. The pooled accuracy for arrhythmia detection was 97% (95% CI 0.96–0.99).

The other moderate quality systematic review by Lopez Perales et al. [49] evaluated several kinds of mobile health applications for detecting atrial fibrillation. Of the 11 included studies that evaluated a wearable device (including smartwatches and smart bands), diagnostic accuracy varied depending on the different algorithms utilised, the populations studied and the testing conditions. Smartwatches showed a sensitivity of 67.7–100% and a specificity of 67.6–98%, while smart bands showed a sensitivity of 75.4–97% and a specificity of 94–100% [49].

3.6.4 Aerobic Capacity

One high-quality systematic review and meta-analysis assessed the validity of wearables for measuring aerobic capacity (or VO2max) [50], the criterion measure for which was a graded exercise test to exhaustion with direct or indirect calorimetry. This review included 14 studies of 403 participants (45% female).

The results of the meta-analysis by Molina-Garcia et al. [50] showed that wearables using a resting test significantly overestimated VO2max (bias = 2.17 ml kg−1 min−1; 95% CI 0.28–4.07). Conversely, wearables estimating VO2max through exercise tests showed a bias close to nil compared with the reference standard (bias =  − 0.09 ml kg−1 min−1; 95% CI − 1.66 to 1.48). The limits of agreements in the resting test spanned from − 13.07 to 17.41 ml kg−1 min−1 (i.e. ± 15.24; 95% CI − 22.18 to 26.53), while limits were narrower in exercise testing conditions, spanning from − 9.92 to 9.74 ml kg−1 min−1 (i.e. ± 9.83; 95% CI − 16.79 to 16.61).

3.6.5 Blood Oxygen Saturation

One low-quality systematic review and meta-analysis assessed the validity of wearables for measuring blood oxygen saturation [54]. This review did not stipulate a criterion ‘ground truth’ criterion measure; however, of the five studies of 973 participants (44% female) that were included, the criterion measures were typically a pulse oximeter. None of the included studies reported the sex distribution in their respective cohorts. Each of the five included studies only evaluated the accuracy of the Apple Watch (series 6) against their chosen criterion.

The results showed that the Apple Watch Series 6 generally had a moderate to strong correlation with conventional pulse oximeters in measuring blood oxygen saturation, with Pearson correlation coefficients ranging from 0.76 to 0.89 in various studies. The mean absolute differences in oxygen saturation (SpO2) measurements between the Apple Watch and conventional oximeters ranged up to 2.0%. However, limits of agreement varied, with some studies showing ranges of − 5.8% to + 5.9%.

3.7 Validity of Wearables for Measuring Physical Activity

3.7.1 Step Counting

Eight systematic reviews (including one meta-analysis) assessed the validity of wearables for counting steps [15, 36,37,38, 40, 43, 45, 46]. Collectively, these reviews included 184 non-duplicate studies of 6197 participants (~ 53% female; Supplemental File 3). Only one review was deemed to be of high quality, with three reviews deemed to be of moderate quality and four of low quality. As previously mentioned, the high-quality review by Chevance et al. [36] focused specifically on the accuracy of Fitbit devices, so the findings of the moderate quality reviews are also summarised below. The criterion measure for counting steps included manual counting (either in controlled settings or through video recording) or using accelerometers and pedometers (in free-living settings).

Chevance et al. [36] found in their analysis of 15 studies on Fitbit devices that most either reported underestimations or were inconclusive; the mean bias, when excluding inferior quality studies, was − 3.11 steps per minute (range − 13 to 7).

None of the reviews deemed to be of moderate quality performed a meta-analysis [15, 38, 45]. The review by Irwin and Gary focused specifically on the Fitbit Charge 2 device; they observed mean absolute percent error values of 12% [45]. Feehan et al. [38] also focused on Fitbit devices, which tended to underestimated counts by − 9% (median error =  − 3%). Fuller et al. found [15] that wrist-worn wearables typically underestimated step count (mean: − 9%, median: − 2%), with Withings and Misfit wearables consistently underestimating step count, and Apple and Samsung demonstrating less measurement variability than other brands.

3.7.2 Wheelchair Push Counts

One systematic review by Byrne et al. [35] that was deemed to be of low quality investigated the accuracy of wheelchair push counts measured by various fitness watches. This review included seven studies involving 131 wheelchair users (both able-bodied and those with disabilities; 40% female) aged over 18 years. The criterion measure used was direct observation in controlled laboratory settings. This method involved directly counting the wheelchair pushes and rotations to validate the accuracy of the wearable devices’ measurements.

The devices evaluated included various generations of the Apple Watch (Series 1 through 4), Garmin VivoFit, Fitbit Flex, Fitbit Flex 2, Jawbone UP24, and the Activ8 activity monitor. Among these, the Apple Watch Series 4 demonstrated the highest accuracy, with a mean absolute percentage error (MAPE) of 9.20%, compared with the Apple Watch Series 1 with a MAPE of 20.62%. The calibrated Apple Watch had a MAPE of 13.9%, whereas the uncalibrated version showed a higher MAPE of 22.8%. The Fitbit Flex 2 had the highest MAPE at 148.4%.

3.7.3 Physical Activity Intensity

Four systematic reviews (none of which included meta-analysis), one deemed to be of moderate quality and three of low quality, assessed the validity of wearables for measuring physical activity intensity [37, 38, 40, 43]. However the moderate quality review by Feehan et al. [38] focused specifically on the accuracy of one device manufacturer—Fitbit—in comparison to the reference standard so the findings of the low quality reviews are also summarised below.

Collectively, these reviews included 126 non-duplicate studies of 4087 participants (~ 55% female; Supplemental File 3). Their validation, primarily against the reference standard of research-grade accelerometers (e.g. Actigraph), revealed mixed results. Specifically, correlations with accelerometers varied considerably, and were influenced by elements such as activity intensity and duration. For Fitbit devices, the measurement error was less than − 10% for sedentary time, but over 80% of comparisons showed a measurement error greater than 10% for time spent in light to vigorous activity. In some cases, Fitbit devices tended to overcount minutes of moderate-to-vigorous physical activity, with mean absolute differences reaching up to 89.8 min per day. Other devices (e.g. Polar) showed varied performance, with some models offering strong correlations for light to vigorous physical activity intensities, but generally presenting poor agreement with reference devices, and mean absolute percentage errors ranging from 29 to 80% [37].

3.7.4 Energy Expenditure

Eight systematic reviews (three meta-analyses), one deemed to be of high quality, four of moderate quality and three of low quality, assessed the validity of wearables for measuring energy expenditure [15, 36,37,38, 40, 43, 48, 52]. Collectively, these reviews included 218 non-duplicate studies of 7734 participants (~ 54% female; Supplemental File 3). The reference standard for energy expenditure in these reviews included the doubly labelled water method, indirect and direct calorimetry in controlled (laboratory) environments and accelerometry in free-living settings. Again, the single high-quality review by Chevance et al. focused specifically on the accuracy of Fitbit devices [36], but so too did the moderate quality reviews by Feehan et al. [38] and Leung et al. [48].

Collectively, these reviews showed that wearables (Fitbit devices specifically) tended to underestimate energy expenditure by approximately 3 kcal per min (limits of agreement − 13 to 7 kcal per min) [36] or by − 3% [38].

In the systematic review and meta-analysis of 64 studies by O’Driscoll et al. [52], findings for a more diverse collection of devices showed that, again, wearables tended to underestimate energy expenditure [effect size (ES): − 0.23, 95% CI − 0.44 to − 0.03; n = 104; p = 0.03], with error in primary research studies ranging from − 21.27 to 14.76%. However, sensitivity analysis revealed that, upon the removal of specific devices from the analysis, the comparison to the criterion measured became non-significant (i.e. select wearables were highly accurate). Study heterogeneity was significant though, and the authors urged caution when interpreting their results: “while it is initially encouraging that the effect size for many devices was not significantly different from criterion, the 95% CI observed in many cases indicates the potential for these devices to produce erroneous estimates of mean energy expenditure and as such we would be hesitant to consider any device sufficiently accurate.”

3.8 Validity of Wearables for Measuring Sleep

Four systematic reviews (including one meta-analysis), two deemed to be of moderate quality [38, 42] and two of low quality, assessed the validity of wearables for measuring various aspects of sleep [37, 53]. Collectively, these reviews included 95 non-duplicate studies of 2985 participants (~ 54% female; Supplemental File 3). The criterion measure used in these reviews for sleep was polysomnography.

Evenson et al. [37] in their review of four studies involving 180 participants evaluating the accuracy with which wearables could measure total sleep time (TST) or wakefulness, noted that consumer wearables tended to overestimate TST, while concurrently underestimating the time of wakefulness after sleep onset (mean absolute difference of approximately 22.0 min/day and ICC 0.85). Haghayegh et al. [42] derived their conclusions from 22 studies of 438 participants, which comprised both regular sleepers and individuals with diagnosed sleep disorders. The collective output from these studies was that wearables predominantly overestimated TST [42]. Eight studies within this review reported significant overestimations, and two posited non-significant overestimations compared with various aspects of sleep [42]. In contrast, these devices had a consistent trend of underestimating wakefulness after sleep onset. Five studies underscored significant underestimations, with another suggesting a non-significant underestimation. However, there were instances (based on three studies), where the readings between wearables and polysomnography (PSG) for wake time after sleep onset did not exhibit substantial disparities.

For sleep efficiency and onset latency, Haghayegh et al. [42] further reported on sleep efficiency, with a study indicating wearables significantly overestimating it against PSG. Sleep onset latency, the duration taken to transition from full wakefulness to sleep, was uniformly underestimated by wearables. Their review found three studies supporting this observation, with one highlighting a significant difference. Feehan et al. [38] specifically evaluated Fitbit devices under controlled settings. Their conclusions, derived from three studies, reiterated the patterns seen elsewhere: Fitbits typically overestimated TST and sleep efficiency, often by more than 10%. Their exploration into sleep-onset latency and wakefulness after sleep onset underscored vast inconsistencies, with measurement errors swinging between 12 and 180% [38].

3.9 Meta-analysis

A meta-analysis of the individual studies was not undertaken as part of this umbrella review due to the heterogeneity in protocols, criterion measures, devices (including firmware versions) and outcomes evaluated. Specifically, of the 249 studies that validated at least one consumer wearable device for assessing at least one biometric outcome, only 166 used an accepted gold standard criterion. Of the studies that used an accepted gold standard, only 75 used appropriate statistical analysis [27] to determine device accuracy. Consequently, in all cases, for each biometric outcome there were fewer than five studies evaluating the same device for the same biometric outcome that used an accepted reference standard and appropriate statistical analysis. This precluded meta-analysis.

Going forward, subject to a sufficiently homogeneous body of literature emerging for specific devices and outcomes following recommended protocols [25,26,27], we will endeavour to undertake a meta-analysis of individual research studies as part of our ‘living’ research synthesis.

4 Discussion

This umbrella review aimed to synthesise evidence from systematic reviews evaluating the validity of consumer wearable technologies for measuring biometric outcomes such as heart rate, aerobic capacity, energy expenditure, sleep and surrogate measures of physical activity. A total of 24 systematic reviews were deemed eligible for inclusion. Upon removal of duplicated studies within reviews, the net unique studies was adjusted to 391. These studies collectively included 888,033 participants (approximately 42% female).

For biometrics related to the cardiovascular system, our review revealed a subfield of research where accuracy varied on the basis of the specific measure assessed and the context in which the wearables were utilised. Heart rate measurements for example were associated with error rates of approximately − 3.39 bpm [36], or ± 3% [15], contingent upon user characteristics (e.g. skin tone), intensities of exercise and types of activities and device models [15, 36]. In contrast, the available research evaluating heart rate variability showed a high level of accuracy in rest measurements, with strong agreement with criterion measures, but this deteriorated during periods of physical exercise or motion [34, 39]. For arrhythmia detection, research primarily on devices manufactured by Apple and Samsung demonstrated good sensitivity and specificity [51], although it is important to note that the participants included in these studies were generally derived from clinical populations who had a prior diagnosis of cardiac arrhythmias such as atrial fibrillation. For aerobic capacity, a powerful measure of overall mortality risk [56, 57], the review conducted by the INTERLIVE consortium noted a tendency for wearables to overestimate VO2max during resting tests but found them to be more accurate during exercise tests [50]. Finally, when measuring blood oxygen saturation, the mean absolute differences in SpO2 measurements taken from a wearable device (specifically, the Apple Watch series 6) and the criterion of pulse oximetry were up to 2.0% [54].

In the realm of physical activity, there was a trend for wearables to underestimate step counts, with disparities observed across various brands and models [15, 36]. Regarding the estimation of time spent in different physical activity intensity zones, there was marked inconsistency, with high measurement errors influenced by the type and intensity of activities [37]. Energy expenditure assessments by wearables seem to veer towards underestimation, with a significant range of error margins and device-specific accuracy nuances [36, 38, 48, 52].

Finally, in relation to sleep, wearable devices showed a consistent trend of overestimating total sleep time (TST) and sleep efficiency. For example, wearables typically overestimated TST by more than 10%, and sleep efficiency similarly showed overestimations in multiple studies. Additionally, wearables tended to underestimate sleep onset latency and wakefulness after sleep onset, with errors ranging from 12 to 180% [37, 38, 42]. These findings indicate significant disparities when benchmarked against polysomnography standards, highlighting the need for further refinement and validation of sleep tracking features in consumer wearables.

Taken at face value, our results would suggest that consumer wearables appear moderately proficient in capturing various health outcomes such as heart rate, heart rate variability, aerobic capacity and others. However, this does not capture the nuance of the current state of wearable technology research. To explore the current research landscape, we collated a list of 310 consumer wearables spanning various manufacturers. Of these devices, only 34 (11%) have been validated for at least one biometric outcome in the primary research studies included in our review. Given that most consumer wearables can measure multiple biometric outcomes, the potential number of discrete validation studies required is substantial. Specifically, if each device were validated for five outcomes (step count, heart rate, sleep, physical activity and energy expenditure), this list would necessitate 1550 individual validation studies. However, only 54 validation studies (3.5% of the potential total) have been conducted to date, highlighting a significant gap in the existing evidence landscape. This highlights the disparity between the number of wearable devices on the market and those that have undergone validation—an issue that is widely acknowledged among researchers [16]. This gap underscores the challenges of conducting validation research that keeps pace with the extremely agile commercial ecosystem. The field has attempted to stay current with a multitude of devices, which often have annual release cycles and use diverse methodologies, leading to significant heterogeneity in research outcomes [16].

Thus, our findings do not conclusively indicate that wearables consistently underestimate heart rate, overestimate sleep time or fail to accurately measure energy expenditure (for example). Instead, this umbrella review reveals the intricate variability across devices, outcomes, user contexts and reference standards, making a definitive assessment of wearables’ accuracy challenging. The pervasive heterogeneity in research methodologies and findings limits the practical applicability of these technologies, highlighting the urgent need for standardised validation protocols that are both rigorous and adaptable to the rapidly evolving wearable technology landscape. This need for standardisation is evident from the fact that only 71% of the primary research studies included in the reviews utilised an accepted gold standard criterion, and only 40% of these studies followed best practice statistical analysis [27] to determine device accuracy. Developing robust frameworks and fostering collaborative partnerships with industry are essential to enhance the reliability, consistency, and scientific integrity of wearable technology assessments [16]. Therefore, the question of wearables’ accuracy remains inherently indeterminate, influenced by device-specific, outcome-centric, user-related and contextual variables, making a decisive answer elusive based on the currently available literature.

One of the cardinal challenges encountered in reviewing the literature in this field is the swift and continuous evolution of the commercial landscape for wearable technologies. Given this rapid advancement, the research captured within our review inevitably serves as a historical snapshot, reflecting the accuracy and validity of devices as they existed approximately 2 years prior to the date the search was conducted. This temporal disconnect is exemplified by the chronology of our reference materials—only one of the included reviews was published in 2024 [53], with most being published in 2022 [36, 40, 44, 45, 47, 48, 50]—and the most recent primary study included therein was published in August 2022 [58]. Illustrative of the ongoing output of the commercial engine, consider that, as at the time of writing (June 2024), the market has seen the release of three new iterations of Apple Watch alone since the most recent primary research study included in our results. A universal observation across the wearables reviewed in this research synthesis is their transient market presence; each device analysed has since either been retired or superseded by a more recent model. Despite a semblance of hardware continuity in newer models, the frequent deployment of updated firmware and algorithms can profoundly impact device performance and measurement accuracy. This illustrates a fundamental tension: the measured and methodical pace of rigorous academic inquiry versus the agile, ever-fluxing dynamism inherent in the commercial technology ecosystem. Consequently, our findings, while insightful, may not fully mirror the current state of wearable technology capabilities.

This underscores the importance of making this a ‘living’ research synthesis that evolves concurrently with ongoing technological advancements and refinements. In the realm of wearable technology validation, there exists a kind of ‘validation economy’ marked by diverse, simultaneous efforts aimed at assessing and ensuring device accuracy and reliability. At one end of the spectrum, the popular media landscape is populated by vloggers and influencers who wield substantial reach and influence. Their ‘reviews’—often reactionary, anecdotal and based on single-subject analyses—command vast audiences, albeit with intrinsic methodological limitations. For instance, a review by Marques Brownlee amassed 3 million views within the first 3 weeks of the latest Apple Watch release, reflecting the potent sway of such platforms despite their often informal and experiential evaluation methods [59]. In parallel, more structured and formalised efforts are underway within academic and professional circles to cultivate rigorous validation frameworks. Initiatives such as INTERLIVE spearhead these endeavours by advocating for standardised protocols and best practices in evaluating wearable technologies [25,26,27]. Through its collaborative network, INTERLIVE strives to formalise and standardise many aspects of validity assessment for consumer-grade wearables, towards the development of recommendations and guidelines that bolster the utility and reliability of wearable-derived data in capturing physical activity indicators. In addition, organisations such as FIMS have instituted quality assurance standards to scrutinise and verify the marketing claims of wearable device companies [24]. Their approach envisions a proactive validation process, wherein manufacturers engage in a pre-market validation exercise, submitting their devices for rigorous evaluation against established research benchmarks or appropriate proxies [24]. Meanwhile, vast data repositories are being cultivated through national initiatives such as the All of Us program from the National Institutes of Health (NIH) [60] and the UK Biobank [61]. These databanks harvest extensive, longitudinal health data from wearable devices, fostering a richer, more nuanced understanding of various health conditions and the role of wearable technologies in managing them. Private enterprises, too, are making significant strides, specialising in harnessing and analysing wearable-derived data to bolster both individual and corporate wellness endeavours [62,63,64,65].

In this context, we hope that this living umbrella review will serve to synthesise and harmonise the disparate threads of research and evaluation emanating from various quarters. While its primary focus is to provide researchers with a consolidated and continuously updated synthesis of the latest evidence, we recognise that our readership will likely extend beyond the traditional academic audience. This includes healthcare professionals, policy makers, technology developers and an informed public interested in the accuracy and reliability of wearable technologies. Given the broad and diverse interest in wearable technology, our dissemination strategy aims to maximise the impact of our findings across multiple stakeholder groups. Our public engagement activities will include dissemination of these findings via personal and institutional social media, on our YouTube channel [66] and in specific undergraduate and postgraduate modules in digital health and medical devices. We will involve patient and public representatives in creating a plain language summary of findings to be distributed to the general population, informing policy makers across different countries with written communications [16]. By adopting this approach, our review seeks to become a resource and ally for a multitude of stakeholders. For the vloggers and influencers navigating the vibrant but volatile landscape of new device releases and software updates, it will offer a repository of up-to-date research findings. For formal validation bodies and research consortiums such as INTERLIVE and FIMS, it will provide a cohesive and continuous synthesis of global research efforts, bolstering their frameworks and recommendations with a broader perspective and the latest insights. End-users, too, stand to gain from this living umbrella review. With wearables becoming more ingrained in society [67], and as their user base expands and diversifies, the review will be a valuable resource to help users determine the accuracy and reliability of various devices. It is anticipated that the value of the review will grow as the pace of research accelerates and wearable devices permeate deeper into everyday health management and lifestyle practices.

Yet, despite its strengths and potential value, this review is not without limitations. As previously mentioned, an intrinsic challenge lies in the temporal incongruence between the fast-paced evolution of consumer wearable technologies and the more deliberate and methodical pace of academic research and publishing. Due to the procedural necessities of academic rigor—protocol development, study implementation, data analysis and the subsequent publication processes—our synthesis inherently lags behind the latest commercial innovations and releases [16]. Second, our review operates primarily as a tertiary research synthesis, grounding its insights in secondary research—systematic reviews of primary studies. This methodology, while offering a powerful and practical way of collating and synthesising large amounts of data, introduces vulnerability to potential biases or errors that might pervade the secondary research layers—conclusions drawn herein are reflections of the interpretations, methodologies and potential biases of the underlying reviews and their authors. Indeed, our effort to mitigate these limitations manifested in our risk of bias assessment, which showed that the 24 systematic reviews deemed eligible for inclusion rarely met all of the established criteria. This underscores the need for cautious interpretation and application of our findings. Ultimately, this umbrella review should first be considered a directory to research in the field and, second, a high-level overview of results. The variability that pervades various dimensions of the incorporated studies—including their chosen protocols, selected devices, criterion measures, demographic considerations and statistical methodologies—and the nested structure of conclusions necessitate a cautious approach in extracting coherent and reliable signals amidst the noise of disparate findings [68].

5 Conclusion

In conclusion, this umbrella review illuminates the varied landscape in consumer wearable technology research, showcasing the potential and complexity inherent in their validation and utilisation. Our findings also highlight the need to formalise and standardise device- and biometric-specific validation protocols, and the potential benefit of an agile research model according to which new devices can be evaluated. Furthermore, fostering collaborative synergies between formal certification bodies, academic research consortia, popular media influencers and industry could augment the depth, reach and inclusivity of wearable technology evaluations, enabling a richer, multifaceted dialogue that resonates with a broad spectrum of stakeholders. Opportunities also lie in the expansion of validity assessments into diverse and unexplored terrains of wearable utility, such as stress, training readiness and ‘body battery’ scores; wearable technology companies such as Fitbit, Garmin, Oura and Whoop have each created variants of these ‘bespoke biometrics’ which collate multiple biometric signals to give a user an idea of their current health status; however, none have undergone formal validation. Crucially, as wearable technologies burgeon, penetrating various facets of health and lifestyle, a continued commitment to ethical considerations, data privacy and user autonomy remains imperative. The pursuit of enhancing accuracy and reliability should unfold alongside endeavours to nurture an ecosystem of ethical technology use, marked by transparency, user empowerment and a conscientious alignment with overarching health and societal objectives.