Validity of Estimating the Maximal Oxygen Consumption by Consumer Wearables: A Systematic Review with Meta-analysis and Expert Statement of the INTERLIVE Network

Background Technological advances have recently made possible the estimation of maximal oxygen consumption (VO2max) by consumer wearables. However, the validity of such estimations has not been systematically summarized using meta-analytic methods and there are no standards guiding the validation protocols. Objective The aim was to (1) quantitatively summarize previous studies investigating the validity of the VO2max estimated by consumer wearables and (2) provide best-practice recommendations for future validation studies. Methods First, we conducted a systematic review and meta-analysis of studies validating the estimation of VO2max by wearables. Second, based on the state of knowledge (derived from the systematic review) combined with the expert discussion between the members of the Towards Intelligent Health and Well-Being Network of Physical Activity Assessment (INTERLIVE) consortium, we provided a set of best-practice recommendations for validation protocols. Results Fourteen validation studies were included in the systematic review and meta-analysis. Meta-analysis results revealed that wearables using resting condition information in their algorithms significantly overestimated VO2max (bias 2.17 ml·kg−1·min−1; limits of agreement − 13.07 to 17.41 ml·kg−1·min−1), while devices using exercise-based information in their algorithms showed a lower systematic and random error (bias − 0.09 ml·kg−1·min−1; limits of agreement − 9.92 to 9.74 ml·kg−1·min−1). The INTERLIVE consortium proposed six key domains to be considered for validating wearable devices estimating VO2max, concerning the following: the target population, reference standard, index measure, testing conditions, data processing, and statistical analysis. Conclusions Our meta-analysis suggests that the estimations of VO2max by wearables that use exercise-based algorithms provide higher accuracy than those based on resting conditions. The exercise-based estimation seems to be optimal for measuring VO2max at the population level, yet the estimation error at the individual level is large, and, therefore, for sport/clinical purposes these methods still need improvement. The INTERLIVE network hereby provides best-practice recommendations to be used in future protocols to move towards a more accurate, transparent and comparable validation of VO2max derived from wearables. PROSPERO ID CRD42021246192. Supplementary Information The online version contains supplementary material available at 10.1007/s40279-021-01639-y.


Introduction
The use and development of wearable technology monitoring fitness and activity have grown exponentially over the last few years. In 2020, 396 million wearable units were shipped worldwide, and it is forecasted that this will increase up to 631.7 million units by 2024 [1]. Wearable devices give users the opportunity to monitor health-related metrics, such as daily steps, heart rate (HR), energy expenditure, or cardiorespiratory fitness, therefore, promoting physical activity

Key Points
Wearables using exercise-based algorithms provide higher accuracy in the estimation of maximal oxygen consumption (VO 2max ) than those based on resting conditions.
Wearables using exercise-based estimation seem to be optimal for measuring VO 2max at the population level, yet the estimation error at the individual level still needs further improvement.
In this article, the Towards Intelligent Health and Well-Being Network of Physical Activity Assessment (INTERLIVE) network provides best-practice recommendations to be used in future protocols to move towards a more accurate, transparent and comparable validation of VO 2max derived from wearables. reviews have already assessed how well wearable devices estimate most of the health measures such as step count [12,13], HR [14,15], and energy expenditure [14,16]; however, to the best of our knowledge, no systematic review or meta-analysis focusing on the validity of the estimated VO 2max is available. Furthermore, the current science behind the validation protocols of wearable devices suffers major limitations, mainly due to a lack of consensus and guidelines ensuring good practices [17,18]. This is precisely one of the main goals of the Towards Intelligent Health and Well-Being Network of Physical Activity Assessment (INTER-LIVE) consortium, which is to develop best-practice protocols for the validation of consumer wearable fitness and activity measures. The INTERLIVE consortium has already published guidelines adapted to the nature of specific fitness/ physical activity measures such as step count [19] and HR [20]. However, to date there are no specific standards guiding both manufacturers and the scientific community in the validation of estimating VO 2max by consumer wearables.
Therefore, in this article, INTERLIVE had two main objectives: (1) to systematically summarize previous studies investigating the validity of VO 2max as estimated by consumer wearable devices based on a meta-analysis, and (2) to provide best-practice validation recommendations based on the systematic review of the literature together with an evidence-informed INTERLIVE consortium discussion.

The INTERLIVE Network
INTERLIVE (https:// www. inter live. org/) is a consortium composed of six universities-University of Lisbon (Portugal), German Sport University (Germany), University of Southern Denmark (Denmark), Norwegian School of Sport Sciences (Norway), University College Dublin (Ireland), and University of Granada (Spain)-and one technology company, Huawei Technologies (Finland). The consortium was founded in 2019 and strives towards developing bestpractice protocols for evaluating the validity of consumer wearables with regard to the measurement of exercise/activity metrics. Moreover, INTERLIVE aims to increase awareness of the advantages and limitations of different validation methods and to introduce novel health and performancerelated metrics, fostering a widespread use of physical activity indicators.

Expert Validation Process
The consortium followed the same process as was used previously [19,20]. First, we conducted a systematic review and optimizing health and sports performance [2,3]. Furthermore, the omnipresence of wearables enhances digital phenotyping at a population level, which offers valuable information about physical activity and fitness levels from around the world that can be used to guide global health promotion actions [2,4]. The most accepted measure of cardiorespiratory fitness is maximal oxygen consumption (VO 2max ), which has been shown to be a powerful marker of health and has recently been proposed as a clinical vital sign by the American Heart Association [5]. Furthermore, VO 2max is widely known as a key indicator of endurance performance and, therefore, its measurement is of vital importance for sports performance in general [6]. The current guidelines for accurate testing of VO 2max require measurement of gas exchange by indirect calorimetry usually in a laboratory during an exercise test to exhaustion [7]. These tests require expensive equipment (e.g., gas analyzer) and trained technicians to collect and interpret the data, which makes VO 2max assessments less feasible for risk prediction in clinical practice and unaffordable for most recreational athletes and for the general population. Indirect estimation of VO 2max by submaximal field tests overcomes some of these disadvantages and offers acceptable estimations of VO 2max [8,9]. However, the abovementioned digital era of consumer wearable devices opens new horizons for fitness monitoring without the need for laboratory or field testing.
In view of the enormous potential of these devices, wearable companies are making significant investments in research and development to provide valid fitness and activity measures, such as VO 2max [10,11]. Previous systematic of the scientific literature on the studies validating VO 2max estimated by consumer wearables against a reference standard (criterion measure). Second, the information obtained from the systematic review, together with previous related statements [17][18][19][20][21], was critically discussed within the consortium to provide guidelines and recommendations on how to conduct optimal validation protocols. Third, a set of key domains for best-practice recommendations was proposed based on the evidence-informed expert opinion of the INTERLIVE members.

Systematic Review and Meta-Analysis Process
This systematic review was guided by the Preferred Reporting Items for Systematic Reviews and Meta-Analyses diagnostic test accuracy guideline. The protocol was registered in advance in the PROSPERO database (ID: CRD42021246192).

Data Sources and Search Strategy
PubMed, Web of Sciences, and Scopus databases were searched dating up to January 14, 2021. Members from the INTERLIVE network defined the search strategy, which can be found for replication in Supplementary Material 1 (see the electronic supplementary material). Additionally, a hand-search using the same search strategy was performed in Google Scholar to identify additional studies.

Inclusion and Exclusion Criteria
We considered studies meeting the following criteria: (1) any kind of population, (2) VO 2max estimated through consumer wearable devices and measured with the reference standard (a graded exercise test to exhaustion with direct or indirect [gas analysis] calorimetry using a mode of test that involves large muscle groups), and (3) criterion validity studies.
We excluded studies following these criteria: (1) nonconsumer wearable devices (e.g., research-based accelerometers), (2) not original articles (e.g., reviews or editorials) and grey literature (e.g., meeting abstracts), and (3) articles validating new algorithms in the estimation of VO 2max that are not yet incorporated in any commercial brand.

Study Selection
Two authors (PM-G and HLN) independently performed both the title, abstract, and full-text screening of potential articles and any discrepancy was solved in a consensus meeting with a third author (MS). This systematic review process was performed using the Covidence software (www. covid ence. org; Veritas Health Innovation).

Data Extraction
For each included article we extracted the following information: (1) author's name and publication year, (2) target population (e.g., healthy adults), sample size, and age range, (3) protocol used for the VO 2max assessment via reference standard (e.g., indirect calorimetry), (4) gas analyzer brand used, (5) wearable device used, (6) protocol followed for the estimation of VO 2max via wearable devices, and (7) statistical analysis used to test the validity of wearable VO 2max against the reference standard. Two independent authors (PM-G and HLN) performed the data extraction, and any discrepancies were discussed until consensus was reached.

Risk of Bias
The Consensus-based Standards for the selection of health Measurement Instruments (COSMIN) checklist was adapted and used to assess the risk of bias of included studies. The COSMIN checklist contains standards for evaluating the methodological quality of studies validating health measurement instruments [22], and it encompasses four domains: (1) participants included, (2) index measure (i.e., wearable device), (3) reference standard (i.e., indirect calorimetry), and (4) statistical analysis. Each domain contains several items with three possible answers ("yes," "unclear," and "no") according to the fulfillment of the criterion and, therefore, the presence or absence of bias (Supplementary Material 2; see the electronic supplementary material). According to the Risk of Bias 2 (RoB 2) criteria proposed by Cochrane [23], an article having at least one "no" or more than two "unclear" items was categorized as having "high risk" of bias; having one "unclear" item was categorized as "some concerns" in the risk of bias; and having all items answered as "yes" was categorized as "low risk" of bias. Two independent researchers (PM-G and AG) accomplished this process, and disagreements were discussed to reach a consensus including a third author (FBO).

Meta-Analysis
We identified two main methodologies to estimate VO 2max through wearable devices: (1) the resting conditions that evaluate users lying in a supine position and/or standing still, and (2) exercise-based methodologies that evaluate users while performing physical activity. Therefore, we performed and reported the meta-analysis separately for these two methods-the resting and exercise tests. The bias of the estimation of VO 2max by the wearables (i.e., the mean difference between the wearable and the reference standard) and the standard errors of this bias in all included studies were used to calculate the pooled bias and its 95% confidence interval (CI) for both the resting and exercise test. A negative bias represents an underestimation of the wearable VO 2max relative to the reference VO 2max , while a positive value represents an overestimation. The Higgins I 2 statistic and P value were used to test the heterogeneity of included studies, which were classified as not important (0-40%), moderate (30-50%), substantial (50-75%), or considerable (75-100%) [24]. Due to the presence of considerable heterogeneity in both meta-analyses (Higgins I 2 = 77% and 88% in resting and exercise test, respectively), we used a random-effects model of the inverse variance method. Klepin et al. [25] averaged the gas exchange data every 15 and 60 s, and we selected the 15 s time averaging according to previous recommendations [26]. Two studies examined the wearable validity separately in men and women [27,28], and we maintained this division when including the data in the meta-analysis. There were five studies [29][30][31] that did not report the bias to test the validity or reported it in plots. Therefore, validity was estimated from correlation coefficients between the wearable and reference VO 2max , as suggested elsewhere [32], or extracted from plots through the WebplotDigitizer software (Ankit Rohatgi, website: https:// autom eris. io/ WebPl otDig itizer/), which has demonstrated an excellent validity and reliability in extracting graphed data [33].
The framework for the meta-analysis of Bland-Altman studies proposed by Tipton and Shuster [34] was used to obtain a pooled limit of agreement in both the resting and exercise test, which was calculated with the following formula: δ ± 2 √ σ 2 + τ 2 , where δ is the average bias across studies, σ 2 is the average within-study variation in differences, and τ 2 is the variation in bias across studies [34]. The weighted least-squares models from the abovementioned random-effect meta-analysis were used to estimate δ and σ 2 , while the DerSimonian and Laird procedure was used to estimate τ 2 [35]. The R code provided in the study of Tipton and Shuster [34] was used to conduct all these analyses with the RStudio statistical program.
Three sensitivity analyses were performed: (1) to test the robustness of the results, (2) to evaluate the presence of publication bias, and (3) to divide the meta-analyses results into those studies using photoplethysmography (PPG) technology to assess HR versus those using chest straps. For the robustness analysis, studies were removed one at a time and we tested whether the overall effect size (i.e., z score and P value) was significantly modified in magnitude or direction. The publication bias was assessed by a funnel plot and the Egger regression asymmetry test, considering the level of significance < 0.100 [36]. The meta-analysis was repeated in the two following conditions: (1) splitting the results into studies using PPG and chest straps to measure HR and (2) including studies from the last 3 years. Thus, we tested the impact of the different types of HR recordings (PPG vs. chest straps) and of old articles testing obsolete devices on the error estimates.
The meta-analysis was performed using the Review Manager Version 5.3 (The Nordic Cochrane Center, The Cochrane Collaboration, 2014, Copenhagen, Denmark), and the limit of agreement meta-analyses were performed using the RStudio statistical program (version 1.4.1106, R Core Team 2020; R Foundation for Statistical Computing, Vienna, Austria; https:// www.R-proje ct. org/).

Summary of the Included Studies in the Systematic Review
The flow chart ( Fig. 1) shows that among the 1224 nonduplicated studies initially included, 1189 were excluded after the first screening of title and abstract and another 27 were further excluded after the full-text screening. Consequently, 14 articles meeting the inclusion criteria were included in the systematic review and the meta-analysis; eight and eight studies reporting on the validity of an exercise-based and resting state-based methodology, respectively, were included. Table 1 summarizes the main information extracted from the 14 included studies, including a total of 403 participants. The risk of bias assessment of included studies is reported in Fig. 2 and Supplementary Material 3 (see the electronic supplementary material). The overall risk of bias assessed across all domains was deemed to be "some concerns" for three (21%) and "high" for 11 (79%) of the 14 studies included.

Validity of the VO 2max Estimated by Wearables: Meta-Analysis
The forest plots with the pooled bias between the reference VO 2max and the wearable estimation are presented in Fig. 3 for both the wearables using the resting methodology and the exercise test.  16.61). Therefore, the difference in limits of agreement was smaller by 5.4 ml·kg −1 ·min −1 in exercise tests compared to the resting conditions. The limits of agreement in the different studies using the resting conditions ranged from ± 17.75 [40] to ± 38.97 ml·kg −1 ·min −1 [41], while it spanned from ± 11.18 [42] to ± 23.53 ml·kg −1 ·min −1 [25] in the exercise tests. Lastly, studies using PPG technology in the HR recording had a greater span of the limits of agreement in comparison with those using chest strap in the exercise tests

The Current State of Knowledge in Validation Protocols Relevant to Inform Best-Practice Recommendations
Similar to the previous statements of the INTERLIVE consortium [19,20], we present and discuss the information found in these studies divided into the six key domains to take into consideration when designing validation protocols of consumer wearables estimating VO 2max (Fig. 5).

Target Population
The total sample size studied was 403 participants (218 men and 185 women), with a mean sample per article of 29 participants. For future validation studies, we recommend performing a priori sample size calculation following the approach by Lu et al. [43], which uses the Bland-Altman limit of agreement analysis. The required sample size to obtain a power of 80-90% is calculated considering the expected mean absolute difference between the index measure and the reference standard, the expected SD of this difference, and the maximum allowed difference predefined by the researchers. It is advised to conduct a pilot study to obtain this information directly from the devices to be validated. If this is not feasible, our meta-analysis reveals that the expected mean absolute difference in the resting conditions is 2.30 ml·kg −1 ·min −1 and the expected SD is 7.20 ml·kg −1 ·min −1 , whereas the expected mean absolute difference in the exercise test is 1.32 ml·kg −1 ·min −1 and the expected SD is 4.03 ml·kg −1 ·min −1 . Regarding the maximum allowed difference, there is no agreement on this size with respect to relevance for performance, health promotion, or clinical practice. In the second paragraph of the "Discussion" section, we argue the potential meaningfulness of the estimation errors by wearables considering previous meta-analyses on VO 2max changes and mortality risk. However, it is important to know that this maximum allowed difference must be greater than the expected mean difference ± 1.96 × the expected SD. Thus, considering our metaanalysis results, these values should be at least 16.41 and 9.22 ml·kg −1 ·min −1 in the resting conditions and exercise test, respectively. Raising the sample size will not affect the estimated size of the limit of agreement but will provide greater precision (i.e., tighter confidence bands around the limit of agreement).
Participants from the included studies were adults with a pooled age of 24.6 ± 5.7 years old. However, children, adolescents and older adults also use these wearable devices in real life, and, therefore, we recommend that future validation  [28] showed opposite results, with a greater error in men compared to women. We suggest future studies to test whether the validity of existing methods/algorithms systematically differs according to sex.
In the risk of bias assessment, we identified that the majority of articles (10 of 14) adequately delimited the target population they wanted to study and nearly all participants contributed with data to be included in the validity analysis. Pooled bias and SE for wearables VO 2max using resting conditions (A) and exercise tests (B) relative to the reference standard. A negative bias represents an underestimation and a positive bias an overestimation of the VO 2max estimated from wearables in comparison to the reference standard. CI confidence interval, SE standard error, VO 2max maximal oxygen consumption. *Heart rate was measured with chest strap. In the remaining articles not flagged with an asterisk, heart rate was measured using photoplethysmography technology on the wrist Participants from the included studies were all physically active people categorized as "healthy" or "active," recreational runners [29,44] or soccer players [40]. In order to have a wider representation of the general population, VO 2max estimations from consumer wearables should be tested in further clinical populations such as old adults, individuals with more sedentary behaviors, with overweight/ obesity, or highly trained athletes. We, therefore, recommend expanding the population included beyond healthy young people (e.g., from very untrained sedentary people to highly trained athletes), as well as to clearly define and report the inclusion/exclusion criteria used to define these target populations.

Reference Standard
All studies included indirect calorimetry through gas analysis as a reference standard of VO 2max , as was previously recommended [45]. In brief, indirect calorimetry measures VO 2 and VCO 2 concentrations and calculates the respiratory exchange ratio (RER), allowing for the obtainment of VO 2max while exercising [45]. The gas analysis systems used were reported in all studies, where Parvo Medics was the most popular brand, used in ten studies [27-31, 37, 38, 40, 44, 46], followed by Cosmed [25,47] and Metalyzer [39,42], with two studies each. Although the validity and reliability of indirect calorimetry systems may seem obvious, available devices are not always reliable [48,49] and only one of the included studies provided a reference with regards to the validity within the study [29]. Similarly, only two studies included in this review specified whether the gas exchange was recorded breath by breath [39,42]. Furthermore, none of the included articles reported whether the gas analyzer used both VO 2 and VCO 2 for VO 2max assessment, even though it is known that systems without CO 2 sensors decrease the precision and should be treated with caution [50]. Lastly, four studies [39,42,44,47] did not clarify whether the device was calibrated [45], and we recommend that a proper calibration process according to the manufacturer's instructions be performed before the VO 2max assessment. We urge Fig. 4 Bland-Altman meta-analysis for the comparison of wearablederived VO 2max using resting conditions and exercise tests with the reference VO 2max . The y-axis is the bias between the wearable and reference VO 2max (wearable − reference), with positive values indicating an overestimation and negative values an underestimation by the wearable. The x-axis is the mean VO 2max between the wearable and reference. CI confidence interval, VO 2max maximal oxygen consumption. *Heart rate was measured with chest strap. In the remaining articles not flagged with an asterisk, heart rate was measured using photoplethysmography technology on the wrist authors and developers to improve transparent reporting by including at a minimum the brand used, the type of recording technology (e.g., breath by breath or mixing chamber), and previous validity/reliability of the instruments.
Three out of the 14 included studies did not follow an ecological validity procedure [28,29,44], defined as a validation process that resembles the use of the device in the consumer's real life. Two of the studies introduced bias when including the setup information, an aspect that will be discussed in the "Testing Protocols and Conditions" section [28,44], while one study did not place the device in an ecological manner according to manufacture instructions [29]. Regarding the ecological placement, Anderson et al. [29] fixed the device to the wrist with additional tape, and this is not recommended since it may artificially improve the precision of the HR readings through PPG, biasing the validity of the device in ecological settings. Overall, we recommend that wearable devices be worn on ecological body locations in accordance with the manufacturer's instructions, and this location should be adequately described within the methods. If multiple wrist-worn devices are being tested, a maximum of two devices per wrist should be used at the same time, with placement being randomly counterbalanced between participants.
Apart from the wrist-worn wearables, nine devices incorporated a chest strap to record HR during the VO 2max estimation [28,30,37,38,40,44,47]. Chest-strap technology has been the most used method for HR monitoring in the past. Moreover, it is widely accepted as a valid and reliable method to measure HR in free-living conditions, but it presents limitations in 24 h recording over multiple days. Recently, many wearables are built with the possibility to measure HR at the wrist using the PPG technology, which allows longer recording time and a more comfortable measurement by not incorporating additional devices along with the wrist bracelet (e.g., chest strap). A recent meta-analysis has also revealed an acceptable validity of the PPG technology during treadmill running and walking (mean difference − 0.51 bpm; 95% CI − 1.60 to 0.58 bpm), yet an underestimation when performing endurance sports (mean difference − 7.26 bpm; 95% CI − 10.46 to − 4.07 bpm) [52]. Therefore, the type of HR measurement is relevant and should be reported in the validation protocols. Future research is necessary to determine whether the VO 2max estimation is more accurate using the HR obtained by PPG or chest strap. Furthermore, the validity of HR measures from wearables should be tested before being used in the VO 2max estimation following the recently published recommendations by the INTERLIVE consortium [19].

Reference Standard
All of the included studies tested VO 2max in laboratory conditions. The two previous expert statements of the INTER-LIVE consortium on step count and HR provided recommendations for semi-free-living and free-living conditions besides the laboratory setting to test the ecological validity [19,20]. However, reference VO 2max is still recommended to be performed in laboratory conditions, and, therefore, the free-living and semi-free-living conditions do not apply in this context. Regarding the type of activity, all included studies applied treadmill running protocols. It is known that running protocols may provide small differences in VO 2max in comparison to cycle protocols [53], and, therefore, our recommendation is to incorporate protocols that are as close as possible to the type of activity for which the consumer wearable has been designed.
In regards with the work rate progression, some protocols gradually increased the speed [25,39], the treadmill inclination [27,42,46], or both intensity conditions within the protocol [28-31, 40, 41, 44, 47, 51]. Five studies used ramp protocols [25,27,39,42,46] in which work rate increases more gradually (e.g., each 30-60 s), while the remainder studies included blocks of 2 [44] or 3 min [28-31, 37, 40, 47, 51]. It seems that VO 2max does not vary whether treadmill inclination or speed increase is used [53]. Likewise, the use of a ramp versus a more accentuated increase in the work rate does not affect the VO 2max measure, although each progression has pros and cons depending on the target population and whether treadmill or cycle ergometer is used [54]. We recommend selecting an appropriate work rate progression according to the type of population in which the consumer wearable is intended to be validated and the selected physical activity (e.g., running or cycling).
Maximal graded exercise testing requires participants to terminate the test at volitional fatigue, and accepted criteria exist to ensure that maximal VO 2 during the test was reached. For more information, we refer readers to chapter 4 of the American College of Sports Medicine's (ACSM's) Guidelines for Exercise Testing and Prescription, in which a detailed description of test termination criteria can be found [7]. Among the included studies, five did not consider at least two maximum-effort criteria apart from voluntary exhaustion and are likely to have measured VO 2peak instead of VO 2max [25,30,31,39,44]. In the last years, an alternative/complementary solution named "verification phase" has been proposed, which includes an extra effort lasting between 2 and 3 min at a supramaximal work rate (i.e., 110% of maximum power) after the test termination to corroborate the results [55]. This approach was only followed by Freeberg et al. [46] and may be an interesting method to use in future validation protocols.
A maximal graded exercise test normally requires several standardized conditions to ensure that the participants reach their true VO 2max . Five out of the 14 included articles considered at least some of these standardized conditions before the exercise testing [27,29,[38][39][40], whereas the remainder did not report this information. The INTERLIVE consortium recommends taking into account the following standardized conditions when measuring the VO 2max reference standard: caloric uptake, caffeine or alcohol consumption, intensive sports activities, medications, and an appropriate warm-up (e.g., 5-10 min of light-intensity aerobic exercise and dynamic stretching) before commencing the exercise test [7,53].

Wearable Device
Included studies that estimated VO 2max from a resting test were Polar devices and the test used was the patented "Polar fitness test" [56]. Polar devices record the resting HR and heart rate variability (HRV) via Polar chest strap or the PPG technology incorporated into the device and use these data to estimate VO 2max [57]. This protocol slightly differed based on the wearable model, but always ranged from 5 to 10 min in a supine position (e.g., Polar A300, FT40, and F6), while only one of the included models additionally added a few minutes in a standing position (e.g., Polar V800). On the other hand, only Garmin and Fitbit were the brands that used exercise testing. The Fitbit exercise test consists of a run at a comfortable pace for at least 10 min while the GPS is being recorded [58]. Garmin devices offer different methods to estimate VO 2max depending on three types of activity: running, cycling, or walking [59]. However, only the running protocol was used in all studies included in this review [28-30, 42, 44], requiring a run of at least 10 min, while recording the GPS signal and HR data (through PPG technology or chest strap). Garmin's instructions recommend an intensity of at least 70% of the user's maximal HR for the entire exercise, which can be either estimated or manually input by the user [59]. Overall, we recommend researchers systematically follow the manufacturer's recommendations when estimating VO 2max from the wearable device among study participants.
Some of the included wearable devices require a previous setup in which personal data such as age, sex, height, weight, or physical activity level are recorded to improve the accuracy of the VO 2max estimation. Only two of the included studies did not specify whether previous setup information was input prior to commencing the validation protocol [39,46], while the remainder of the studies recorded some basic information. As a general recommendation, all the setup information required by the device should be included and reported, and this should be similar to the information customers are provided outside of a research context. For instance, both Snyder et al. [28] and Carrier et al. [44] introduced the maximum heart rate (HR max ) obtained from the reference standard test into the consumer wearables, which is not ecological since few users have HR max data from a maximal graded exercise test in laboratory conditions.

Reference Standard
Indirect calorimetry for either mixing-chamber or breathby-breath technology requires several decisions on data processing while conducting VO 2max tests. A major factor for removing variability in indirect calorimetry is the time and breath averages used to estimate VO 2max . Only three [25,27,46] of the studies included in this review reported this relevant information. Following Robergs et al. [26] recommendations, between 15 and 30 s time averages and 15-breath running averages should be used to have a reasonable reduction in data variability without losing relevant physiological information. For researchers implementing digital filters, a low cut-off frequency of 0.04 Hz is recommended [26].

The Time Interval Between Evaluations
With regards to wearable devices, modifying data processing is not possible since the wearables directly compute the VO 2max using algorithms that are usually proprietary information and the exact equations are not disclosed. An important consideration, however, is the time interval between both assessments, since the fatigue after the maximal exercise test may affect the wearable VO 2max estimation. Since the resting methodology is conducted in resting conditions, these wearable protocols can be performed before the reference standard protocol without influencing either test. This should not be performed in the opposite order, since the maximal test required for the reference standard could affect the resting HR or HRV. Concerning the wearable estimations based on the exercise test, 24-48 h between tests is recommended to ensure optimal recovery from high-intensity exercise and avoid associated muscle fatigue hampering the performance [60]. Furthermore, randomization or counterbalancing the order of the wearable and laboratory tests is important to control the potential carryover effects. Five of the included studies in this review either did not meet this time-interval criterion or did not report any information [25,28,29,39,42], and none mentioned any randomization or counterbalancing strategy, which is an aspect to consider in future validation studies.

Statistical Analysis
The Bland-Altman limits of agreement analysis is the most popular method used in validation studies and has been widely accepted as the most appropriate type of statistical analysis in these types of studies [61,62]. In brief, Bland-Altman analysis provides both the systematic error (i.e., bias or average difference between methods) and the random error or precision (i.e., 95% limit of agreement of the systematic error), thus providing valuable information for the comparison of the wearable devices to the reference standard. The lower and upper bound of the limits of agreement provides an estimate in which 95% of future observations of the differences in VO 2max between the wearable device and a criterion reference assessment are expected to fall. In addition, the Bland-Altman plots represent the individual difference between methods against the mean of the methods, providing visual information on other relevant dimensions of agreement, such as heteroscedasticity (a trend to increase/decrease the error between methods as the magnitude of the measurement increases). Additionally, percentage error measures, such as the mean absolute percentage error (MAPE), represent a helpful option to report the error of the device in an easy-to-understand manner [63]. Therefore, we recommend reporting percentage error measures complementary to the limit of agreement analysis. In the risk of bias assessment, we detected that five studies did not apply an appropriate analysis of agreement between the wearable devices and the reference standard, since they only performed mean difference (t test or analysis of variance [ANOVA], but did not report the limits of agreement or the Bland-Altman plots) or Pearson correlation analyses [27, 29-31, 47, 51]. Among the statistical tests used, Bland-Altman [25,28,37,39,40,42,44,46], t test [27, 29-31, 37-39, 44], and Pearson's r [27-29, 31, 37, 44, 46, 47] were the most popular tests, with eight studies using each of these analyses, followed by MAPE in five studies [25,39,40,44,46] and intraclass correlation coefficient [39,42,46] or ANOVA [28,46,47] in three studies each.
The last point to consider is the contextual validity of wearable devices in estimating VO 2max , which should be considered within the statistical analysis. For instance, if a wearable device is designed to monitor VO 2max changes that improve users' health, the systematic and random errors should be critically analyzed to ensure that the device is capable of detecting individual changes, which are considered clinically significant in the scientific literature. We have already proposed in the "Methods" section that 3.5 and 1.75 ml·kg −1 ·min −1 might be potential thresholds since both are normal VO 2max changes in the general population and have been associated with health improvements. Therefore, companies should report the level of error in a transparent manner according to the purpose of the device and the target population. This would guide researchers in the statistical analysis and the interpretation of the results.

Recommended Validation Protocol
Based on the abovementioned state of knowledge and the critical discussion between the members of the INTERLIVE consortium, we present best-practice recommendations for validation protocols of VO 2max derived from consumer wearable devices in Table 2. Furthermore, a checklist is provided in Table 3, including the items to be considered when planning validation protocols of VO 2max consumer wearables. A graphical overview of the six domains to consider in these validation protocols is presented in Fig. 5.

Discussions, Future Directions, and Statement
In the present article, we combined a systematic review and meta-analysis with an expert statement aiming (1) to provide a summary of the validity of VO 2max estimations by consumer wearables that use different methods/algorithms and (2) to provide recommendations for future validation studies. Our meta-analysis suggests that consumer wearables using exercise tests provided a more accurate estimation of VO 2max in comparison to consumer wearables using resting tests. Overall, the wearables using exercise tests to estimate VO 2max had a systematic error close to zero (− 0.09 ml·kg −1 ·min −1 ) in comparison to maximal graded exercise tests using indirect calorimetry in laboratory conditions. However, the random error observed in both types of methods was still large, i.e., limits of agreements span of ± 15.24 (95% CI − 22.18 to 26.53) and ± 9.83 (95% CI − 16.79 to 16.61) ml·kg −1 ·min −1 for the resting and exercise tests, respectively. Consequently, even if this random error was markedly smaller in the exercise-based estimations, it is still a large error when estimating VO 2max at an individual level.
We are unaware of any well-established and accepted estimation error to strongly indicate when the validity of a wearable is acceptable or not. Our aim here was to inform the public about the observed estimation errors based on existing literature. It is ultimately up to the users to consider whether the error is good enough for their specific purposes. Just to put into context the potential meaningfulness of estimation errors observed in VO 2max , we need to consider that previous meta-analyses have reported that increases in VO 2max of 1.75-3.5 ml·kg −1 ·min −1 are associated with a lower risk of all-cause mortality and incidence of coronary heart disease or cardiovascular disease [5,64]. Therefore, systematic and random errors in the estimation by wearables beyond the range of 3.5 ml·kg −1 ·min −1 will be missing clinically relevant changes. Reliability is also an important concept to understand the quality of the wearables estimates; however, only three of the included studies evaluated it [40,41,47]. Overall, good test-retest reliability of wearable VO 2max has been reported with r and intraclass correlation coefficient (ICC) values above 0.90, but further studies using a more recommendable approach (i.e., Bland-Altman limits of agreement) are needed to confirm that wearable VO 2max is reliable. Given the lack of evidence regarding reliability, caution should be paid when wearables are used for testing individual changes for either research, clinical, or sports purposes. On the other hand, the estimation errors of the exercise-based algorithms at the group level show a high level of accuracy. This fact allows digital phenotyping of cardiorespiratory fitness using wearables at a population level, which opens new opportunities for fitness monitoring at regional, national, or global levels. We cannot determine the number of people for which the exercise-based algorithms are accurate, but considering our results come from 244 participants, we can establish this population cutoff point for now.
In order to better understand the different errors observed in the two types of estimation methods, it is important to discuss how the different brands estimate VO 2max through different methodologies. Polar devices use resting HR, HRV, gender, age, height, body weight, and self-reported physical activity to estimate VO 2max . The company explains in a white paper that they used data from several validation studies to develop an artificial neural network that calculates VO 2max through the fitness test [65]. They claim that the mean error of the prediction varies between 8% (3.7 ml·kg −1 ·min −1 approximately) and 15% compared with laboratory test. Our results reveal an assumable systematic error of 2.17 ml·kg −1 ·min −1 , but an overly wide random error span of ± 30.48 ml·kg −1 ·min −1 . Polar claims the main benefit of the Polar fitness test is that it is "easy, safe and convenient for setting a baseline and tracking relative progress" [57]. We agree that a test in resting conditions is very convenient, feasible, and safe and, therefore, a good solution when more valid methods are not feasible. However, based on the wide random error observed in the meta-analysis, we would not A verification phase after the maximal test is recommended to compare both VO 2max results. Schaun [55] provides an update of the literature on how to perform this verification phase Any type of exercise testing is accepted (e.g., walking, running, or biking) as long as it adapts to the type of activity in which the consumer wearable is intended to be validated In populations unable to perform maximal test, submaximal exercise-based equations might be an alternative to predict VO 2max , since overall these have demonstrated a moderate to strong relationship with maximal tests. However, authors should select the most appropriate equation for their target population [9,70] Report whether maximal or submaximal exercise test is being used. In the case of submaximal test, provide a rationale of its implementation and specify the exercisebased equations used In maximal exercise test, report the need for reaching volitional fatigue and indicate the maximum-effort criteria included (at least two criteria) Report the type of exercise testing used as well as its characteristics (e.g., increase in the ramp inclination in treadmill tests or power increase in cycle-ergometer tests)  [29,44,46]. This method uses the following calculation steps [66]: (1) logging of personal information (at least age), (2) an exercise test with the wearable measuring HR and speed, (3) HR data are segmented to different zones and the reliability of these segments is calculated, and (4) the most reliable data segments are used to estimate VO 2max by using linear or nonlinear dependency between HR and speed data. The white paper published by Firstbeat stated that this estimation had 5% MAPE for running, 8% for cycling, and 6% for walking against indirect calorimetry VO 2max in laboratory settings [66]. Four studies in this systematic review reported MAPE analyses of Fitbit and Garmin devices in running tests [25,39,44,46], and results were always greater than the 5% reported by Firstbeat, with values ranging from 8 to 10.2%. There are no standard thresholds to determine an optimal MAPE, but previous validity studies of consumerbased wearables considered ≥ 10% as an indicator of inaccuracy, which are values close to those found in the exercise protocols [67]. Although the systematic error we found in the meta-analysis for these wearables using exercise tests is negligible (i.e., 0.09 ml·kg −1 ·min −1 ), the random error span of ± 9.83 ml·kg −1 ·min −1 represents a considerable range that may consider its use inappropriate to adequately assess and monitor VO 2max changes. Nevertheless, this estimation methodology is clearly superior to the resting approach with 2.08 and 10.82 ml·kg −1 ·min −1 less systematic and random error, respectively. By removing articles prior to 2017, the resting condition demonstrated an improvement in the accuracy of 0.51 ml·kg −1 ·min −1 . This analysis supports the notion that new devices and/or algorithms are providing more accurate estimates. Nevertheless, results from this article should encourage developers to opt for exercise methodologies for a more accurate VO 2max estimation.
This article has detected several weaknesses in the validation process, which highlights the need for further and more rigorous studies. Future validation studies should consider the best-practice recommendations provided in this article by the INTERLIVE consortium in the six main domains. Our review has detected that the validity of wearables has been tested only in healthy and physically active people with a narrow age range (i.e., 25 ± 6 years). A recent systematic review identified several determinants of cardiorespiratory fitness such as sex, age, education, socioeconomic status, ethnicity, body mass index (BMI), body weight, waist circumference, body fat, resting HR, C-reactive protein, smoking, alcohol consumption, and physical activity level [68]. Future validity studies should include participants across the spectrum of some of these influencing factors to determine how the wearable VO 2max performs in different populations. Moreover, the reference standard and its associated protocol and data processing were, without a doubt, the most critical point in terms of risk of bias in the included studies. Therefore, future studies should improve the indirect calorimetry protocols used according to the current exercise testing guidelines.
Regarding the wearable devices, greater transparency from companies regarding not only the algorithms but also the data used to estimate VO 2max would be desirable (yet limited by proprietary issues). This would help researchers to better control variables during validation protocols. For instance, if running speed and inclination are used in the estimation, then the quality of GPS signal, track maps, and altimeter sensors should be key components to consider in validation studies. HR seems to provide key data in the VO 2max estimation, and a great proportion of the consumer wearables in this review included chest strap for the HR measurement instead of PPG. Overall, our results in the meta-analyses demonstrated a greater bias and limit of agreement in those devices using PPG compared to chest strap. This is a somewhat expected finding since the measurement error of the chest strap seems minimal compared to electrocardiogram monitoring [69]. However, since wearing chest straps is uncomfortable for many people and the greater acceptability in the general population of HR monitoring via PPG (usually placed on the wrist, i.e., smartwatches and bracelets), it is important that future validity studies use PPG technology and aim to obtain accurate VO 2max estimations with it. In a previous INTERLIVE article, we discussed several factors affecting the accuracy of PPG technology, such as skin tone, motion artifacts, contact pressure, and ambient temperature [19]. Recommendations from this article should be considered to ensure best practice in the validity, testing, and reporting of PPG-based HR wearables estimating VO 2max . Lastly, all available literature estimated VO 2max while running. Thus, future validity studies are needed in other activities, such as cycling or walking, to cover a broader range of activities.
The statistical analysis used in the available validity studies was often inappropriate, and consequently, future protocols should use the statistical approaches considered appropriate in validation studies. We recommend using the Bland-Altman limits of agreement as the main analysis and some percentage error (e.g., MAPE) as complementary and informative information. Overall, the application of the best-practice recommendations from the INTERLIVE consortium would be beneficial for stakeholders by ensuring a more valid and transparent metric derived from their devices as well as for users who would receive more accurate and reliable information about their VO 2max level and, therefore, their health status.

Conclusion
This systematic review and meta-analysis from the INTER-LIVE consortium summarizes the validity of VO 2max estimated from consumer wearables and provides best-practice recommendations for future validation protocols. The metaanalysis suggests that the estimation of VO 2max by wearables that use exercise-based algorithms provides higher accuracy than those based on resting methods. The exercise-based estimation seems to be optimal for application at the population level, yet the estimation error at the individual level and, therefore, use for sport/clinical purposes still needs further improvement. The INTERLIVE network hereby provides best-practice recommendations to be used in future protocols To consider at least 2 maximal-effort criteria during the incremental test A verification phase after the maximal test is recommended to corroborate the VO 2max Any type of exercise testing is accepted (e.g., walking, running, or biking) as long as it adapts to the type of activity in which the consumer wearable is intended to be validated Control the standardized conditions before the maximal exercise test

Consumer wearable
Follow the manufacturer's instructions for the VO 2max estimation protocol Provide all the setup information required by the devices If exercise mode is available, choose the one that best reflects the activity to be performed Ensure an optimal GPS connection when this data is used Processing

Reference standard
If VO 2max is averaged within a time window, it is recommended to use a 15-to 30-s window If a breath-by-breath average is used, a 15-breath running average is recommended Confirm that the maximum-effort criteria were met when interpreting the VO 2max values

Time interval between evaluations
In those wearables using resting conditions, no time interval is needed In exercise conditions, an interval between 24 and 48 h is recommended Statistical analysis Bland-Altman with limits of agreement Least products regression of the differencesagainst the means MAPE