Background

There are many unresolved issues regarding data reduction and quality assessment of accelerometry data to determine physical activity (PA) and sedentary behavior (SED). These challenges leads to great variation in procedures used and criteria applied to define a valid measurement [1]. As behavior vary greatly over time, an important aspect of accelerometer measurements is how many days or periods of measurement that need to be included to obtain reliable estimates of habitual activity level. This is particularly true when children live in an area with a significant change in weather during different seasons [2,3,4]. Measurement error caused by poor reproducibility (amongst other sources of error) may preclude researchers from arriving at valid conclusions and possibly misinform the society regarding PA as a target for public health initiatives [5].

Most studies in children apply a criterion of a minimum 3 or 4 wear days to constitute a reproducible accelerometer-measurement [1]. Although findings vary between studies in both adults [6,7,8,9,10] and children [11,12,13,14,15,16,17,18,19,20,21,22], most evidence suggest that a reasonable reliability (i.e., intra-class correlation (ICC)) of ~ 0.70–0.80 are achieved with 3–7 days of monitoring. However, the reproducibility might vary across PA domains, as suggested by a study in adults [7]. Given the importance of evaluating effects of interventions in preschool [23, 24] and school settings [25, 26], especially considering that there might be a reactivity effect for PA [27], it is critical to determine activity level during school hours and leisure time separately. Moreover, many studies target in-school versus out-of-school, or weekday versus weekend activity patterns [28,29,30,31]. The validity of such studies depends on whether the intended associations, patterns or effects can be reliably captured during the applied monitoring period.

Most previous studies have estimated the reliability and the number of days needed based on the Spearman Brown prophecy formula for measurements conducted over a single 7-day period [8, 9, 11, 13,14,15,16,17,18,19,20,21,22]. Such study designs have received critique for being likely to underestimate the number of days needed, and they should therefore be interpreted with caution [32,33,34]. Importantly, such results are in principle only generalizable to the included days, as inclusion of additional days, weeks or seasons will add variability. Few studies have determined the reliability for longer periods, of which all have shown considerable intra-individual variation [34,35,36,37,38]. Reliability has been shown to be ~ 0.70–0.80 with limits of agreement (LoA) ~ 1 standard deviation (SD) for one out of two and three consecutive weeks of measurement in preschool children and adults, respectively [37, 38]. However, poorer estimates are found in studies considering seasonality [34,35,36], leaving reliability estimates of ~ 0.50 for one week monitoring in children [34, 36]. These findings agrees with studies showing substantial seasonal variation in activity level in children and adolescents [2,3,4], which are obviously not captured when relying on a single measurement period. Finally, agreement (i.e., LoA and/or standard error of the measurement), which provide researchers a direct quantification of how much outcomes should be expected to vary over time [39,40,41], has not been reported for domain-specific PA or SED in children.

The aim of the present study was to determine the domain-specific reproducibility of accelerometer-determined PA and SED for one week of measurement obtained during two different seasons separated by 3–4 months in a large sample of children. We hypothesized great variability across weeks and reliability estimates lower than ICC = 0.80 for all accelerometer outcomes, but somewhat better reliability for school hours compared to leisure time.

Methods

Participants

The present analyses are based on data obtained in fifth grade children from the Active Smarter Kids (ASK) cluster-randomized trial, conducted in Norway during 2014–2015 [26, 42]. The main aim of the ASK study was to investigate the effect of school-based PA on academic performance and various health outcomes. Physical activity was measured with accelerometry at baseline (mainly May to June 2014) and follow-up (April to May 2015) in all children, as well as in approximately two-thirds of the children that we invited to complete a mid-term (January to February 2015) measurement. In the present study, we include the mid-term and the follow-up measurement, to allow for comparison of PA over two different seasons separated by 3–4 months. Additionally, as the intervention was ongoing at both these time-points, and we found no effect of the intervention on PA [26], we included both the intervention and the control groups. We have previously published a detailed description of the study [42], and do only provide a brief overview of the accelerometer handling herein.

Our procedures and methods conform to ethical guidelines defined by the World Medical Association’s Declaration of Helsinki and its subsequent revisions. The Regional Committee for Medical Research Ethics approved the study protocol. We obtained written informed consent from each child’s parents or legal guardian and from the responsible school authorities prior to all testing. The study is registered in Clinicaltrials.gov with identification number: NCT02132494.

Procedures

Physical activity was measured using the ActiGraph GT3X+ accelerometer (Pensacola, FL, USA) [43]. During both measurements, participants were instructed to wear the accelerometer at all times over 7 consecutive days, except during water activities (swimming, showering) or while sleeping. Units were initialized at a sampling rate of 30 Hz. Files were analyzed at 10 s epochs using the KineSoft analytical software version 3.3.80 (KineSoft, Loughborough, UK). In all analyses, consecutive periods of ≥20 min of zero counts were defined as non-wear time [1, 44]. Results are reported for overall PA level (cpm), as well as minutes per day spent SED (< 100 cpm), in light PA (LPA) (100–2295 cpm), in moderate PA (MPA) (2296–4011 cpm), in vigorous PA (VPA) (≥ 4012 cpm), and in moderate-to-vigorous PA (MVPA) (≥ 2296 cpm), determined using previously established and validated cut points [45, 46].

Results were reported for a 7-day week (mean of up to 7 days providing ≥8 h of wear time), weekdays (mean of up to 5 days providing ≥8 h of wear time), weekend days (mean of up to 2 days providing ≥8 h of wear time), school hours (mean of up to 5 days providing ≥4 h of wear time), afternoon (mean of up to 5 days providing ≥4 h of wear time), and total leisure time (mean of afternoon and weekend days). Data for a full day was restricted to hours 06:00 to 23:59, whereas we defined school hours from 09:00 to 14:00, and afternoon from 15:00 to 23.59. We only included children that provided data for all domains for analysis, that is, ≥ 3 weekdays, school days and afternoons, as well as ≥1 weekend day. We also conducted sensitivity analyses restricted to children providing ≥4 weekdays, school days and afternoons, and 2 weekend days.

Statistical analyses

Children’s characteristics were reported as frequencies, means and SDs. Differences between included and excluded children and differences in PA and SED between measurements was tested using a mixed effect model including random intercepts for children. Wear time was included as a covariate for analyses of PA and SED. Differences in PA and SED between measurements were reported as standardized effect sizes (ES = mean difference between seasons/square root((SDwinter2 + SDspring2/2))).

We estimated reproducibility using a week-by-week approach [37, 38]. Reliability for a single week of measurement (ICCs) was assessed using variance partitioning applying a one-way random effect model not controlling for season (i.e., determining reliability based on an absolute agreement definition) and a two-way mixed effect model controlling for season (i.e., determining reliability based on a consistency definition) [47]. In the latter model, the variance attributed to season is removed, thus, it determines to what degree children retain their rank between the two time points, without respect to the mean difference between the two measurements. All models were adjusted for wear time. Number of weeks needed to obtain a reliability of 0.80 (N) was estimated using the Spearman Brown prophecy formula (ICC for average measurements [ICCk]) [6, 47]: N = ICCt/(1-ICCt)*[(1-ICCs)/ICCs], where N = the number of weeks needed, ICCt = the desired level of reliability, and ICCs = the reliability for a single week.

We additionally applied Bland Altman plots to assess agreement, showing the difference between two subsequent weeks as a function of the mean of the two weeks [39]. We calculated 95% LoA from the residual variance (i.e., within-subjects) error term based on the variance partitioning models (LoA = √residual variance *√2*1.96) [41].

All analyses were performed using IBM SPSS v. 23 (IBM SPSS Statistics for Windows, Armonk, NY: IBM Corp., USA). A p-value < .05 indicated statistically significant findings.

Results

Participants’ characteristics

Of the 1129 included children in the ASK-study, 676 provided accelerometer data at the mid-term and post measurement, of whom 465 children aged 10.4 to 11.6 years (69% of the total sample; 48% boys) provided valid data for analyses of PA (Table 1). There were no differences between the included (n = 465) and excluded (664) children in anthropometry (p ≥ .064). During school hours, the children included in the present analyses were more SED (mean (95% CI) 3.7 (1.5–5.9) min/day, p = .001), and demonstrated lower overall (− 41 (− 64–-18) cpm, p = .001) and intensity-specific PA levels (MPA: − 1.7 (− 2.4–-0.9) min/day; VPA: − 1.3 (− 2.1–-0.6) min/day; MVPA: − 3.0 (− 4.3–-1.7) min/day; all p ≤ .001), compared to the excluded children. Differences were smaller among other domains.

Table 1 The children’s characteristics (n = 447–465). Values are mean (SD) if not otherwise stated

In general, overall PA level (cpm) and intensity-specific PA was significantly higher, and SED was significantly lower, in the spring than in the winter (ES = 0.08–0.75 for the total 7-d week) (Table 2; Additional file 4: Table S1). Differences were modest during school hours (ES = 0.01–0.21), but much greater during the afternoon (ES = 0.27–0.87) and during total leisure time (ES = 0.23–0.91). Differences for weekdays (ES = 0.06–0.67) and weekend days (ES = 0.10–0.60) were similar.

Table 2 Physical activity level during the winter and spring measurements. Values are mean (SD)

Reproducibility

As PA and SED differed substantially between the two measurements for most outcomes, reliability estimates improved when controlling for season (i.e., applying a consistency definition of reliability) versus when not controlling for season (i.e., applying an absolute agreement definition of reliability). The difference between estimates was greatest for variables having the greatest bias. A 7-d week provided ICC values 0.31–0.66 when not controlling for season, and 0.51–0.66 when controlling for season (Table 3). Reliability estimates for weekdays was very similar to an overall 7-d week, whereas weekend days clearly provided poor reliability estimates whether controlling for season or not (ICC = 0.01–0.43). On the contrary, reliability estimates for school hours was modest, being more or less identical whether controlling for season or not (ICC = 0.56–0.63) (Table 4), due to the small bias. Estimates for afternoon was lower than for school hours, with clear improvements when controlling for season, due to the much larger bias between seasons than for school hours (ICC = 0.15–0.50 when not controlling for season vs. 0.42–0.53 when controlling for season). The total leisure time estimates was similar to the estimates for the afternoon (ICC = 0.14–0.59 and 0.42–0.61 when not controlling and controlling for season, respectively).

Table 3 The reproducibility of different outcome variables for one out of two weeks of measurement for a 7-d week, weekdays and weekend days, applying a ≥ 3 weekdays and ≥ 1 weekend day wear time criterion (n = 465 (69%) children)
Table 4 The reproducibility of different outcome variables for one out of two weeks of measurement for school hours, afternoon and total leisure time, applying a ≥ 3 weekdays and ≥ 1 weekend day wear time criterion (n = 465 (69%) children)

When controlling for season, the number of weeks needed to obtain a reliability of 0.80 as estimated by the Spearman Brown prophecy formula was 2.2–3.9 for an overall 7-d week, 2.2–3.2 for weekdays, 5.3–31.4 for weekend days, 2.4–3.0 for school hours, 3.5–5.6 for afternoon, and 2.5–5.5 for total leisure time. A substantial intra-individual variation over time for all outcomes is also indicated by the LoA (Table 3; Table 4; Additional file 1: Figure S1, Additional file 2: Figure S2 and Additional file 3: Figure S3), which approximated factors of 1.3–1.8, 1.3–1.8, 1.6–2.5, 1.7–1.8, 1.4–2.0, 1.3–1.9 SDs of the sample PA levels, respectively. Overall PA level (cpm) was the least reproducible outcome across models. In general, we found minor improvements in reproducibility when the analyses were restricted to those children (n = 257, 38%) providing data for ≥4 weekdays and 2 weekend days (Additional file 5: Table S2).

Discussion

The aim of the present study was to determine domain-specific reproducibility of accelerometer-determined PA and SED in children over two separate weeks of monitoring undertaken 3–4 months apart. Our results suggest that one week of accelerometer monitoring have poorer reliability than suggested by most previous studies that have relied on a single monitoring period. Further, our findings revealed that reliability was superior during school hours and weekdays as compared to leisure time.

Previous studies conclude that 3–7 monitoring days provide reliable estimates of PA and SED in children [11,12,13,14,15,16,17,18,19,20,21,22]. These studies have estimated reliability based on day-by-day analyses, typically using a single 7-day monitoring period. In contrast, but consistent with studies including measurements over several seasons [34,35,36], our findings shows that longer and/or a higher number of monitoring periods are required to estimate PA and SED with acceptable confidence. Mattocks [36] determined overall PA, MVPA and SED over four 7-day periods over approximately one year using the Actigraph 7164 accelerometer in 11–12-year-old children. The ICC for one single period of measurement varied from 0.45 to 0.59 across outcome variables. Wickel & Welk [34] found an ICC of 0.46 for one out of three 7-day periods to assess steps for the Digiwalker pedometer in 80 children aged ~ 10 years. The present findings along with previous findings may question the validity of one week of measurement to determine children’s habitual activity level.

While levels of PA and SED during weekdays and school hours can be estimated with a reliability of 0.80 across 2–3 week-long monitoring periods, 3–6 weeks of measurement are required to achieve this level of precision for afternoon hours and for total leisure time. This finding is consistent with our hypothesis and a previous study in adults [7]. As variation in behavior are restricted by the school curriculum during school hours, the higher reliability compared to leisure time is expected. During school hours, students follow the same schedule, including a roughly similar time spent in for example physical education and recess, during both monitoring periods. Thus, although some variation in levels and types of activities could be expected also during the school day, this variation would be restricted by the applied teaching methods and established curriculum. On the contrary, our findings indicate that children vary their activity levels greatly across season during leisure time, likely according to climate, weather, and daylight. The present study were collected in the Sogn og Fjordane County in western Norway, and we believe several factors might have caused the great variation in leisure time activity level. In the winter (January–February), daylight fade after hours 15:00–17:00, whereas the spring-measurement (April–May) was conducted when there are daylight until bedtime, giving the children a greatly different opportunity to be active outside. Moreover, the weather can be very different, inviting children to play outside in the spring, whereas they are prone to spend their time inside in the winter. There are also large geographic differences between areas. The coast have a mild and wet climate throughout the year, whereas the inland and mountain areas can have a cold winter and generally less precipitation. Thus, children are likely to spend their time in different modes of activity in different locations across the monitoring periods. Because many movement patterns (e.g., swimming, cycling, and skiing) might be poorly captured by accelerometers [48], such variation in preferred activities over time have the potential to greatly influence stability of accelerometer-derived PA and SED levels during leisure time. According to our findings, these (amongst other) sources of variability influence the weekend days the most, for which we found that 5–32 weeks of measurement was required to achieve an ICC = 0.80. Thus, our results suggest that PA and SED on weekend days, whether controlled for season or not, cannot be measured reliably using a feasible protocol. Yet, it should be kept in mind that the comparison of weekend days only include 1 or 2 days, compared to afternoons including 3, 4, or 5 days, thus, a different precision would be expected, despite both domains being leisure time. Finally, the difference in variance between the two monitoring periods (Table 2) could be an explanation for poor to modest reliability estimates, as the statistical model assumes compound symmetry and the ICC are sensitive to asymmetry [47].

As noise in exposure (x) variables will lead to attenuation of regression coefficients (regression dilution bias), and noise in outcome (y) variables will increase standard errors [5], unreliable measures weaken researchers ability to make valid conclusions. In epidemiology, researchers are in general interested in the long-term habitual PA level, not the very recent days. For example, when evaluating school-interventions, we assume a five-day monitoring period provide a reasonable estimate of true PA or SED. In the school context, our results provide support for modest errors, despite estimates being poorer than suggested in many previous studies. On the contrary, studies evaluating home-based interventions or studies investigating patterns of weekday versus weekend, or in-school and out-of-school activity levels are prone to type 2-errors, if relying on a 7-day monitoring period that provide an insufficient snapshot of children’s habitual activity level. Consequently, such studies must be sufficiently powered. Sample size calculations for designs with repeated measurements normally correct expected SDs for the expected correlations between the baseline and follow-up measurements (i.e., studies use SD for change rather than the cross-sectional SD), thus, the sample size needed is less than for performing cross-sectional analyses if the pre-post correlation is > 0.50. Yet, this benefit is not achievable for PA outcomes during leisure time given the present results. These findings are therefore important for informing study designs and sample size calculations, as well as for interpretation of study findings.

Although an increased accelerometer monitoring length improve reproducibility and thus improve validity of study conclusions, given that accelerometry is a valid measure of the outcome, the burden for study participants should be kept minimal to maximize response rate and compliance. Yet, we have previously performed 2- and 3-week monitoring protocols in preschool children and adults, respectively, without any problems regarding compliance [37, 38]. More recently, we have also successfully performed a 2-week monitoring protocol in larger samples of children, adults and older people, demonstrating this protocol’s acceptance in various context. Yet, performing measurements over separate as opposed to consecutive periods might pose an increased burden for participants, as well as for researchers. Importantly, the required monitoring volume is a matter of the research question posed, as population-estimates on a group level requires less precision than individual-level estimates used for correlational analyses [33].

Strengths and limitations

The main strength of the present study is the inclusion of a large and representative sample of children. As reliability estimates depends on the sample variation [39,40,41], the reliability estimates presented herein should be generalizable to other large-scale population-based studies. Another strength is inclusion of measurements conducted 3–4 months apart, during two different seasons, as has only been analyzed in a few previous studies in children [34, 36]. This approach extends findings from previous studies in children that have mainly estimated reproducibility over a single short monitoring period [11, 13,14,15,16,17,18,19,20,21,22]. A limitation, though, is the inclusion of only two weeks and two seasons, as inclusion of more observations probably would introduce more variability and lead to more conservative reproducibility estimates [34, 36]. Moreover, Norway has profound seasonal differences in weather conditions. This characteristic might limit generalizability to areas with less pronounced seasonality. Finally, the inclusion of the intervention group in the current analyses might have caused additional variation to the data, as the intervention group participants could be expected to change their PA level over time. Yet, the intervention was ongoing during both measurements, there was no effect of the intervention on PA levels [26], and reproducibility estimates differed marginally and non-systematically between the intervention and control groups. The maximum ICC difference between groups across the variables analyzed were 0.09 for a 7-day week, 0.06 for weekdays, 0.14 for weekend days, 0.06 for school hours, 0.07 for afternoon hours, and 0.09 for total leisure time (results not shown).

Conclusion

We conclude that the reproducibility for one week of accelerometer monitoring is poor to modest across different domains of PA when seasonal differences is considered. Reliability for a 7-day period was lower than in most previous studies relying on a single monitoring period, and reliability for leisure time and weekend days was lower than for school hours and week days. Longer or repeated measurement periods are favourable compared with one single 7-day period when assessing PA and SED by accelerometry, as this will reduce the possibility of type 2-errors in future studies.