1 Introduction

An increased participation in physical activity (PA) and a more active lifestyle is associated with a reduced risk of obesity and prevention of weight regain following weight loss [1,2,3,4,5,6]. Increases in PA can not only elevate energy expenditure (EE), but also influence the control of appetite and energy intake [7]. Thus, the quantification of PA and EE represent primary areas of interest in the study appetite and energy balance. Wearable devices, relying primarily on accelerometery, have been available for the assessment of PA and EE in research environments for some time [8,9,10]. Commercial-grade wearable devices are increasingly used in large-scale PA and dietary research, but their use in such environments is dependent on their ability to accurately and precisely track and estimate the energy cost of a wide range of activities.

The ability to estimate EE using cost effective and practical wearable devices has long been of scientific interest [11,12,13] as such devices would help overcome limitations associated with currently available techniques. For example, indirect calorimetry methods are generally limited to laboratory environments and expensive stable isotopic criterion techniques provide mean estimates of daily EE over 10–14 days and do not capture daily variation in EE [14]. These issues constrain their use in large-scale research and limit their utility for the collection of continuous EE data over long-term periods of time in free-living individuals. Accurate estimates of EE from discrete wearable devices would add a new dimension to the assessment of free-living EE across a range of activities and population groups in health and disease. Recent developments in wearable technology and cloud storage capacity means it is now theoretically possible and practical to continuously monitor EE patterns in the free-living individual [15]. However, inaccurate instruments are undesirable as they may bias interpretation of data outcomes [16].

A body of literature validating wearable devices exists [17, 18] but product release is often faster than validation studies [19] and thus, the accuracy of newer devices remains uncertain. Physiological sensors, including heart rate (HR) sensors [20] are commonplace in newer activity monitors [21] and such innovation may be bringing the accuracy of commercial devices in line with more established research-grade devices [22]. A linear relationship exists between oxygen consumption (VO2) and HR during moderate to high intensity activities [23, 24] and therefore monitoring HR at the minute-level enables relative PA intensity [25, 26] or EE [27] to be estimated. It seems that combination approaches, in which physiological and movement variables are incorporated into predictive algorithms, improves the estimation of PA or EE relative to accelerometery alone [21, 28]. For HR to be used to monitor PA or EE in wearable activity monitors it is imperative that HR estimates are valid in populations and activities of interest.

There is considerable interest in measuring HR and EE with accuracy and precision in research, clinical and consumer environments. The purpose of the present study is to evaluate the validity the HR and EE estimates of the Fitbit Charge 2 (FC2), a modern commercial grade wearable device and the EE estimates of the research-grade SenseWear Armband Mini (SWA) during sedentary, household, ambulatory and cycling tasks in a heterogeneous population.

2 Methods

2.1 Participants

A diverse sample (n = 59) was enrolled in the study (age range: 22–73 years, weight range 49.2–105.99 kg) and participant characteristics are presented in Table 1. Participants were primarily recruited from the Leeds centre of the NoHoW trial (n = 44), a randomized controlled trial testing the efficacy of an ICT based toolkit for weight loss maintenance across three European centres: United Kingdom, (Leeds), Denmark (Copenhagen), and Portugal (Lisbon). The main trial is registered with the ISRCTN registry (ISRCTN88405328). Participants recruited from the NoHoW trial were provided with their own FC2. In addition, 15 participants were recruited from the local area. Exclusion criteria for the present study included: pregnancy, medications associated with alteration to metabolic rate, the inability to ambulate without assistance, the presence or sign of cardiovascular, metabolic, renal disorders, illness or injury that provide an increased risk of medical events during PA [29]. This study was conducted at the Appetite Control and Energy Balance research laboratory at The University of Leeds, and participants provided written informed consent for this specific study prior to participation. The experimental protocol was approved by The University of Leeds, School of Psychology ethics committee (PSC-407, 18/08/2018).

Table 1 Characteristics of the participants

2.2 Study protocol

Following body composition and RMR measurements (described below), participants transitioned to the exercise laboratory where the PA protocol was performed. Participants were initially seated for 5 min, followed by 5 min standing. Next participants performed 5 min of treadmill walking (4 km/h), incline walking (4 km/h, 5% incline), running (6–8 km/h, 5% incline) and incline running (6–8 km/h, 5% incline). Participants were then given a 3-min resting period and then transitioned to a cycle ergometer and performed 5 min of low-intensity (30 watts), and moderate intensity cycling (60 watts). Lastly, after another resting period, participants performed a 5-min folding task and a 5-min sweeping task. Throughout this protocol, participants wore a polar HR monitor, FC2 and a SWA at all times whilst breath by breath respiratory data was collected using a stationary metabolic cart.

2.3 Physical measurements

Participants arrived at the laboratory in a fasted state having refrained from the intake of food, caffeine and exercise in the 12 h prior to testing. After completing a medical screening questionnaire and providing informed consent, height was measured without shoes using a stadiometer (Leicester height measure, SECA; UK). Blood pressure and resting HR were measured using an automatic sphygmomanometer (Microlife BP A2 Basic, Gentle Technology, Microlife, Clearwater, FL, USA, Inc.). Next, body composition was estimated using a 2-compartment model via air displacement plethysmography (BodPod, Life Measurement, Inc.; USA). The Siri equation [30] was used to derive absolute and percentage fat mass (FM) and fat-free mass (FFM), while body weight was obtained from the BodPod scales. The BodPod has been demonstrated to show excellent accuracy for the estimation of body composition [31].

2.4 Resting metabolic rate

Resting metabolic rate (RMR) was measured in a dimly lit room, in the supine position for 30 min by an indirect calorimeter fitted with a ventilated hood (GEM, Nutren Technology Ltd.; UK). The GEM was calibrated in accordance with manufacturer’s instructions prior to each measurement. Resting metabolic rate was calculated from VO2 and VCO2 in the steady state, defined as the 5 min block with the lowest coefficient of variation, after the removal of the first 5 min of data [32]. If RMR data were unavailable (n = 2), RMR was estimated a body mass index specific RMR algorithm of Müller [33].

2.5 Instruments

2.5.1 Polar HR monitor

HR was assessed during the PA protocol using a Polar m400 HR Monitor Watch (Polar Electro, Kempele, Finland) and a Polar H7 chest strap (Polar Electro, Kempele, Finland), which transmitted second-level data via a Bluetooth connection. Data were uploaded to the Polar flow online application, then downloaded and aggregated to minute-level for analysis. The Polar H7 served as a criterion measure of HR in the present study and it has been shown to have near perfect correlation with electrocardiogram during many exercise modalities [34].

2.5.2 Fitbit Charge 2

The FC2 (Fitbit Inc., San Francisco, CA, USA) is a wrist-worn activity monitor which estimates HR, steps, EE and PA, based on data obtained from incorporated sensors via proprietary algorithms. HR estimates are obtained through a patented technology called ‘PurePulse’, which uses light-emitting diodes on the surface of the skin to monitor blood volume continuously [35]. Data are aggregated to the minute-level and synced via the Fitbit mobile application to Fitbit servers through an application programming interface. Participants used the devices provided to them as part of the NoHoW trial and if participants were not part of this trial a FC2 was provided for the duration of this study. The device was fitted a finger’s width above the non-dominant wrist and was configured with participant weight, height, sex and date of birth.

2.5.3 SenseWear armband Mini

The SWA (BodyMedia Inc., Pittsburgh, PA) is a research-grade device which utilises a tri-axial accelerometer, heat-related sensors (heat flux, skin temperature, near body ambient temperature) and galvanic skin response to estimate EE. Data were downloaded and processed using the SenseWear® Pro 8.0 software, algorithm v5.2. The SWA was fitted with an elastic strap around the non-dominant arm and initialised using participant weight, height, sex, date of birth and smoking status.

2.5.4 Vyntus CPX

A stationary metabolic cart fitted with a respiratory facemask (Vyntus CPX, Jaeger-CareFusion, UK) was used as the criterion measure of EE in the present study. The Vyntus CPX has been demonstrated to be valid and to have excellent reliability (coefficient of variation <0.5%) [36] and is therefore used as a reference for the validation of portable systems [37]. The unit was calibrated prior to each lab visit in accordance with manufacturer’s instructions. Breath by breath data from the device were aggregated to minute level and EE (kcal/min−1) values were calculated from VO2 and VCO2 data assuming a minimal contribution of protein oxidation [38].

2.6 Statistical analysis

All analyses were conducted in R version 3.5.1 and Rstudio Version 1.1.447. Statistical significance was accepted at p < 0.05 for all analyses. Descriptive statistics (mean ± SD) were calculated for age, weight, height, FM, FFM and RMR. Data from the devices and criterion measures were averaged to provide mean HR in beats per minute (BPM) or EE (kcal/min−1) for each participant. Data for each of the outputs were matched by time for each participant. Next, the first minute of data from each activity performed in the activity protocol was removed leaving minutes 2–5, which we considered as steady-state. These data were then averaged for each participant’s activity bout and this figure was used in analyses.

Analyses for each of the devices, HR and EE were conducted separately. In line with previous research [39] we employed a range of statistical tests. Firstly, agreement between criterion measure and devices was assessed with Pearson’s correlation coefficient. The method of Bland-Altman [40] was used to investigate mean difference between criterion and device estimates, with limits of agreement set to  ± 1.96 x standard deviation of mean difference, using the ‘BlandAltmanLeh’ package in R. Root mean squared error (RMSE), mean absolute error (MAE) and mean absolute percentage error (MAPE), were calculated with the R package ‘metrics’. Lastly,  equivalence tests were conducted to compare devices and criterion estimates using the ‘TOSTpaired.raw’ function within the ‘TOSTER’ package in R. For estimates to be considered equivalent, the 90% confidence interval needed to fall within the equivalence zone, which was considered to be ±10% of the criterion mean [41]. Lastly, the absolute percentage error, defined as the absolute value of the percentage error relative to the criterion were explored. Differences in absolute percentage error for sex were investigated with a one-way analysis of variance (ANOVA) and a post-hoc Tukey honest significant difference test, conducted using ‘aov’ from the ‘stats’ package in R. We investigated the relationship between continuous variables (age, RMR, height, weight, FM, FFM, resting HR, systolic and diastolic blood pressure) and absolute error rate in EE and HR estimates with Pearson’s correlations, using the ‘cor’ function from the ‘stats’ package in R.

3 Results

The PA protocol was performed by all participants (n = 59) however the running task (n = 49), the 5% incline run (n = 30) and the moderate cycling tasks (n = 58) were not performed by all participants due to ranges in physical fitness within the sample.

3.1 Energy expenditure

3.1.1 Fitbit Charge 2

Synchronisation errors occurred for two participant’s FC2 data and therefore 57 participant’s data were included in FC2 analyses. The pooled result of all available bouts was a mean overestimation by the FC2 of 0.8 (kcal/min−1), RMSE = 2.3 (kcal/min−1), correlation coefficient of r = 0.77, MAPE = 44% and a non-significant equivalence test (p > 0.05) indicating that the FC2 was not equivalent to the criterion measure overall. The activity specific statistics, and the number of bouts included in the analyses are presented in Table 2. The poorest accuracy was observed in the folding and sweeping tasks, in which the FC2 overestimated with MAPE values of 93% and 81%, respectively (Fig. 1). The best accuracy, and statistical equivalence was observed in incline running tasks (MAPE = 12%). A Bland-Altman plot of the overall error is shown in Fig. 2, for which the 95% limits of agreement were: −3.52, 5.14 (kcal/min−1).

Table 2 Statistics detailing the validity of EE estimates obtained from the FC2 (above) and SWA (below)
Fig. 1
figure 1

A bar plot detailing the mean absolute percentage error (MAPE) of EE estimates from the SWA (yellow) and the FC2 (grey) for each of the activities performed in this study

Fig. 2
figure 2

Overall Bland-Altman plots of EE estimates from the SWA and FC2 relative to the criterion indirect calorimetry measure (Vyntus CPX). Data are displayed as kcal/min. ‘Differences’ represents device estimates – criterion estimates and is shown by the middle dashed line. The upper and lower dashed lines represent the upper and lower 95% limits. Mean of measures represents the average value of the criterion and device estimate. The density plots visualise the distribution of data points over the differences between the measures and the means of the measures

3.1.2 SenseWear Armband

EE data were available for all participants from the SWA and thus 59 participant’s data were included in the SWA analyses. The pooled result of all available bouts was a mean overestimation of 0.03 (kcal/min−1), RMSE = 1.7 (kcal/min−1) correlation coefficient of r = 0.82, MAPE = 29% and a significant equivalence test (p < 0.001), indicating that the SWA was equivalent to the criterion measure overall. The activity specific statistics, and the number of bouts included in the analyses are presented in Table 2. The SWA demonstrated the poorest accuracy in the folding task, in which it overestimated EE (MAPE = 83%). The lowest MAPE values were observed in the walking (MAPE = 14%) and walk 5% incline tasks (MAPE = 13%), which were overestimations and underestimations relative to the criterion measure, respectively (Fig. 1). Equivalence testing showed statistical equivalence between the SWA and the criterion measure during walking only. A Bland-Altman plot of the overall error is shown in Fig. 2, for which the 95% limits of agreement were: −3.33, 3.38 (kcal/min−1).

3.1.3 Heart rate (HR)

Polar HR connectivity error occurred for one participant and thus HR analyses were conducted with 56 of the 57 participants with FC2 data. The pooled result of all available bouts was 98 ± 27 BPM (polar) vs 99 ± 29 BPM (FC2), RMSE = 20 BPM, correlation coefficient of r = 0.75, MAPE = 13% and a significant equivalence test (p < 0.001), indicating statistical equivalence. A Bland-Altman plot for errors in HR illustrates the agreement between criterion HR and FC2 HR by displaying the mean difference and 95% limits of agreement (Fig. 3) and the 95% limits of Agreement were: −37.94, 39.73 (BPM). Activity specific Bland-Altman plots are presented for all tasks in Fig. 4 and accuracy statistics are presented in Table 3.

Fig. 3
figure 3

Overall Bland-Altman plots of HR estimates from the FC2 relative to the criterion measure (Polar chest strap). Data are displayed as beats per minute. ‘Differences’ represents device estimates – criterion estimates and is shown by the middle dashed line. The upper and lower dashed lines represent the upper and lower 95% limits. Mean of measures represents the average value of the criterion and device estimate. The density plots visualise the distribution of data points over the differences between the measures and the means of the measures

Fig. 4
figure 4

Activity specific Bland-Altman plots for HR estimates from the FC2 relative to the criterion measure (Polar chest strap). Data are displayed as beats per minute. ‘Differences’ represents device estimates – criterion estimates and is shown by the middle dashed line. The upper and lower dashed lines represent the upper and lower 95% limits. Mean of measures represents the average value of the criterion and device estimate. The density plots visualise the distribution of data points over the differences between the measures and the means of the measures

Table 3 Statistics detailing the validity of HR estimates obtained from the FC2, measured in beats per minute

3.2 Predictors of absolute percentage error

Using the available data, no significant correlations were observed for any continuous variables and the absolute percentage error for HR and EE. ANOVA tests for the sex differences were not significant for EE absolute percentage errors for the SWA and FC2. In the HR comparison, a significant difference was observed between male bouts (n = 184) and female bouts (n = 348), with the absolute percentage error for males being significantly higher (F = 4.158, p = 0.042).

4 Discussion

This study investigated the validity of EE and HR estimates from the FC2 and EE estimates from the SWA in a heterogenous population performing a variety of tasks by comparing HR estimates to a HR chest strap (Polar) and EE estimates to a stationary metabolic cart (Vyntus CPX). The principal findings are i) the research-grade SWA was observed to be more accurate than the commercial-grade FC2 overall ii) the HR estimates of the FC2 are generally in closer agreement with the criterion measures compared to EE estimates.

The FC2, one of the newest Fitbit activity monitors, has been investigated previously for its validity in estimating EE, relative to indirect calorimetry [19, 42], but this study provides a direct comparison with the SWA, a more established and commonly used research-grade device, using a range of activities. Our results substantiate previous research concluding that the SWA is more valid for the estimation of EE when compared to commercial activity monitors [21, 22]. This being said, the FC2 nor the SWA were consistently equivalent across the range of activities performed, with MAPE values >25% in some activities.

Large overestimations were observed for the FC2 during the household tasks. This most likely originates from the reliance on wrist accelerometery and this is a recognised limitation of devices located at this wear site [43]. Movements such as folding and sweeping, which involve rapid movements of the hand but are not particularly energetically demanding (typically ~4 metabolic equivalents) [44] were overestimated. This is opposite to the issue faced by more traditional devices, which were worn on the hip and underestimate the energy cost of tasks with limited ambulation (i.e. household tasks) [45, 46]. Notably, the MAPE values for the FC2 were lowest in running activities (indicating a high degree of accuracy) and higher during walking activities. This finding is reflective of the results of a recent meta-analysis published by our group, in which the pooled results from five comparisons for the Fitbit Charge HR (prior model to the FC2) showed significant, moderate to large overestimation relative to criterion measures of EE during ambulation and a non-significant overestimation during running [21]. Whilst we are limited in our ability to comment on the underlying cause of this error due to the proprietary nature of the algorithms, it is interesting to note that the greatest overestimate in HR estimates was observed in the walking tasks. If HR is incorporated in the FC2 EE prediction algorithm, this could partially explain this result.

The performance of the SWA for the estimation of total daily EE is well recognised [47,48,49]. However, its accuracy in specific activity types is less established [50]. Indeed, significant underestimations relative to indirect calorimetry in running at higher speeds (> 9.9 km/h) have been reported [51] and in a validation study involving cycling, the SWA again significantly underestimated EE [52]. Data from the CALERIE study showed a mean bias in total daily EE estimates of  − 1.6 ± 261 kcal/d when compared with doubly labelled water, yet when the data were tertiled by total daily EE an underestimation of 162 kcal/d in the highest total daily EE group was observed [53]. The complimentary results overall and in comparisons to doubly labelled water may be largely influenced by the accuracy of the resting EE equations selected by the manufacturers, which are derived from participant characteristics [46]. The present results offer some support for this supposition and indicate that the SWA accuracy is dependent on the PA level of the individual.

The conclusion that the estimates of HR from the FC2 are typically more accurate than EE estimates is reflective of previous research [54, 55]. When HR estimates were aggregated across all available bouts, the HR estimates of the FC2 were statistically equivalent to the criterion measure. Error in specific activity types was greater but the FC2 was statistically equivalent in most activity types. A recent study reported that erratic movements and a greater HR were associated with an increased error in HR [56] and another concluded that the error was exacerbated with increasing exercise intensity [57]. In contrast, our results showed the highest error in the walking task, yet the greatest accuracy in the running and sedentary tasks. The observation of the greatest error in walking is similar to that reported in a previous study investigating the Fitbit Surge device which showed a greater error in HR during ambulatory tasks [55]. In contrast, two other studies investigating the FC2 report small underestimations in HR during walking [42, 56].

We identified no significant continuous correlates of the error for each device and this includes body composition, which we believe to be a novel investigation within this field. However, the percentage error in HR was significantly greater in males, when compared to females. Whilst the proprietary nature of the smoothing algorithms makes understanding the observed error challenging, photoplethysmography technology is likely to be influenced by device position and skin conditions which may differ between males and females [58]. Prior to the exercise condition the position and tightness of the FC2 were standardised for all participants and it therefore seems unlikely that this played a role in the observed error. It remains to be seen whether the free-living performance of the FC2 will differ between participants in less controlled environments and this should be addressed in future research.

4.1 Implications

The seeming inability of the ‘out of the box’ FC2 estimates to accurately estimate EE is a primary limitation for energy balance research, particularly when the numerous benefits of cost, cloud storage and acceptance from participants are considered [59, 60]. Our data indicate that it may be more appropriate to use commercial activity trackers, in their current format, to infer PA from step counts or to estimate HR, which are generally observed to be more valid than EE estimates [17]. Alternatively, the application of metrics such as the heart rate reserve [26], which can be used to define minute level relative intensity from HR data may be preferred. These findings are important for studies utilising the FC2 for longitudinal data collection.

An accurate and objective estimate of EE, in combination with an estimate of change in energy storage, can be used to estimate energy intake [61] and therefore determine misreporting through the ‘solving’ of the energy balance equation [53]. Given the centrality of energy intake and EE to the development of obesity, it is vital to be able to estimate energy intake and EE with precision and accuracy in free-living individuals. Self-reported energy intake is still widely used in research, yet it is well established that this approach is limited by issues of misreporting [16]. Mathematical models to estimate energy intake from body weight have been developed and validated [62]. However, these models make assumptions about the EE levels, which are unlikely to be constant between and within individuals during weight loss and maintenance interventions [1]. An inexpensive, objective estimate of EE will therefore improve energy intake estimates from mathematical models and whilst devices such as the FC2 show large inaccuracies, it is likely that in their current form, they would be superior than an estimation of constant PA EE.

Considering that it is possible to access minute-level data from commercial wearables in many instances, this raises the possibility of the application of non-linear modelling to improve estimates of EE from commercial wearable devices. Advanced statistical learning techniques are being used to estimate EE and PA of tasks with better accuracy than linear regression approaches [63,64,65] and future research should investigate whether data from commercial activity monitors can be used to more accurately predict EE from sensor outputs. The incorporation of body composition and participant characteristics to non-linear models could improve estimates of EE beyond the estimates of current activity monitors [66].

4.2 Limitations

In this study, a number of different FC2 devices were used and data were synced with each participant’s mobile phone application. The lack of standardisation of devices may be considered a limitation, as different firmware could have been employed for different participants. However, this reflects the use of wearable devices in research environments, in which a study population are each provided with their own activity tracker and data are collected via an application programming interface.

Secondly, whilst this study provides analysis of the accuracy of two activity monitors for a relatively limited series of prescribed activities, it provides little insight into the ecological validity of these devices. Substantial over and underestimations from the FC2, depending on the specific activity in question, were observed and therefore the error observed in free-living individuals is likely to vary depending on the activities performed. Given that wearable devices will be used in free-living research, validation studies in free-living conditions are urgently required. Thirdly, this study was conducted in healthy, ambulatory individuals who were not pregnant, using medications associated with alteration to metabolic rate, and did not have cardiovascular, metabolic, renal disorders, illness or injury. It is possible that results would vary as the characteristics of study populations differ, however, with the exception gender difference in HR error, we found no evidence that this is the case.

5 Conclusion

The SWA is more valid for the estimation of EE when compared to the commercial grade FC2, yet neither activity monitor can consistently estimate EE with equivalence to a criterion measure. The FC2 provides better estimates of HR than it does EE, which are broadly, but not always, equivalent to criterion estimates across a broad range of activity types. It may therefore be more appropriate to focus on HR metrics for the assessment of PA, rather than EE in the FC2.