Background

Physical activity (PA), defined as “any bodily movement produced by skeletal muscles that result in energy expenditure “ [1], is a multidimensional construct with dimensions as setting (e.g. PA during leisure time, work), mode (e.g. walking, bicycling), frequency (e.g. times per week), duration (e.g. in hours) and intensity (e.g. light, moderate or vigorous) [2, 3]. PA has many health benefits across the lifespan, especially for people with physical disabilities and/or chronic diseases [4, 5]. Still, people with physical disabilities and/or chronic diseases tend to have an inactive lifestyle [6, 7]. Monitoring PA in this population is important, as it will provide insight in how much and what types of PA they engage in. Information on the amount and types of PA can help tailor PA promotion activities to individuals and uncover opportunities for improving PA for people with physical disabilities and/or chronic diseases. Furthermore, self-monitoring is one of the most effective behavior change techniques for improving PA, further stressing the importance of accurately measuring PA [8]. The need to measure and quantify PA in this varied population has also been emphasized by various research groups [9, 10], including the developers of the new World Health Organization’s PA guidelines [11].

A variety of instruments exist to measure PA in people with physical disabilities and/or chronic diseases. Instruments for PA measurement can be classified into two main categories: device-based instruments (e.g. accelerometers and pedometers; later also mentioned as devices) and self-report instruments (e.g. questionnaires and diaries). Both types of instruments have advantages and disadvantages [12] and are believed to measure different aspects of the PA construct [13]. Self-report instruments are assumed to capture the perceived PA behavior, whereas device-based instruments aim to capture the continuous acceleration of the body above a certain threshold [13]. The consensus is currently that both types of instruments have their own value and should be used complementary to one another, depending on the research questions or clinical and/or practical goals [14].

Device-based instruments collect raw movement data (e.g. acceleration) from a variety of locations on the human body. These data are converted into different PA outcomes (e.g. energy expenditure, steps) often using dedicated algorithms [15]. These algorithms are commonly developed for a general (non-disabled) population [9]. People with physical disabilities and/or chronic diseases such as those with stroke, Parkinson’s disease, and chronic obstructive pulmonary disorder, might have a different pattern of locomotion (e.g. slower and/or asymmetrical) [16,17,18]. Also, people with physical disabilities and/or chronic diseases could have a different energy expenditure during PA compared to people without physical disabilities and/or chronic diseases, due to a lower efficiency of walking or other motor actions in general [19,20,21] or due to an increased energy cost of daily activities [22]. This could be of influence on the validity of the algorithms used in device-based PA instruments when applied to people with physical disabilities and/or chronic diseases. Research already showed that slower walking speeds limit the validity of measuring steps using certain devices [23, 24]. Furthermore, energy expenditure estimations of devices had poor correlations with estimations of indirect calorimetry in people with stroke [25]. These findings warrant a critical mapping of the measurement properties of device-based instruments used to assess PA in people with physical disabilities and/or chronic diseases.

There have been reviews in the past on the measurement properties of device-based instruments in people with physical disabilities and/or chronic diseases. However, these are mostly either diagnosis- or PA-outcome specific [25,26,27,28,29]. Also, manual wheeled mobility involves a completely different class of bodily activities and their energetic consequences as opposed to individuals who walk. A recent systematic review gave an extensive overview of the measurement properties of device-based and self-reported instruments assessing PA in people using a wheelchair [30]. Therefore, the current review focused on the ambulatory population of adults with physical disabilities and/or chronic diseases.

This scoping review aims to provide a critical mapping of the existing literature on the measurement properties of device-based instruments assessing physical activity behavior in ambulant adults with various physical disabilities and/or chronic diseases. Using this critical mapping, we provide future directions to study the measurement properties of device-based instruments assessing PA in ambulatory adults with physical disabilities and/or chronic diseases.

Methods

Study design

This scoping review was guided by the methodological framework for scoping reviews [31, 32] and the Preferred Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR) guideline [33]. A scoping review was chosen as it can be used to summarize research findings and potentially identify research gaps in the literature, which matches our aim. The study protocol is available at https://osf.io/c27xv/. During the review process, we deviated from the published protocol. In Supplementary file 1 we report the reason and the nature of these deviations. In short, we deviated from the protocol in three main ways: 1) because of the large amount of research, we changed the scope of the review from all literature on both device-based and self-reported instruments into only device-based instruments in a set time period; 2) we therefore changed the review question accordingly; and 3) we changed the method from a systematic into a scoping review.

Following the aim and scope of the original protocol, we defined the following PICO criteria: (P) Adults (≥ 18 years old) with physical disabilities and/or chronic diseases. Physical disability was defined as a congenital disease, acquired illness, or trauma that causes an impairment, activity limitation and participation restriction that lasts at least 1 year [34, 35]. Chronic disease was defined broadly as conditions that last 1 year or more and require ongoing medical attention or limit activities of daily living or both [36]. (I) Physical activity measurement instrument. Physical activity measurement instrument was defined as a device-based or self-report instrument that assesses any bodily movement produced by the muscles that results in increased energy expenditure [1] in the activity domain of the International Classification of Function, Disability and Health (ICF) model [35]. (C) We did not use a comparison group, since this is not relevant for studies on measurement properties. (O) Measurement properties (e.g. reliability, validity, responsiveness). Operationalization of Measurement properties followed the definitions of COSMIN [37].

Search strategy and information sources

Together with an information specialist (KS), we combined the three different concepts of our PICO to create our search terms: physical activity measurement instrument, physical disability and/or chronic disease and measurement properties. We used a combination of both MeSH-terms and free text words for each concept, linked with Boolean operators. Literature was initially searched up to June 26th 2019, with a first update of the search up to November 20th 2020, and a second update of the search up to April 16th 2023 in four databases: Medline, Cinahl, Web of Science and Embase. We adapted the search strategy for each database using the same keywords and, where possible, MeSH-terms. The full search strategies for each of the four databases can be found in Supplementary file 2.

Eligibility criteria

Articles were eligible for inclusion in the scoping review when 1) included participants were 18 years or older and had a physical disability or chronic disease, with having the physical disability or chronic disease a primary reason for rehabilitation treatment; 2) PA was measured as an amount or energy cost using a self-reported or device-based instrument; 3) measurement properties were a (primary or secondary) outcome measure of the studies; 4) articles were published in peer-reviewed journals and involved primary research. Articles were excluded when 1) studies were not in humans; 2) participants had an intellectual-, sensory-, cognitive- or mental disability; 3) all included participants were wheelchair users; 4) PA was measured as a functional or a performance outcome; 5) articles were not in English or Dutch. We excluded literature studying participants with intellectual-, sensory-, cognitive- or mental disabilities, as these studies may require different approaches and interpretations compared to studies involving people with physical disabilities and/or chronic diseases. As the authors are knowledgeable in Dutch and English, we excluded all non-English/Dutch articles.

Selection of sources of evidence

Before screening, duplicates were removed using Bramer et al.’s method [38] in EndNote. Two researchers independently screened titles (PB & LAK) and subsequently abstracts (PB & IB) on eligibility using custom Excel spreadsheets. Disagreement was resolved by including those articles to the next phase. For the title and abstract phase, pilot tested checklists with specific instructions for in- and exclusion were used. During the abstract screening phase, regular meetings were held to ensure equal interpretation of the abstracts between both researchers and to discuss uncertainties. Before full text screening, articles were removed that used self-reported PA instruments or were published before 2015. We did this due to the change of focus (on devise-based instruments only) of the review after the abstract phase (see Supplementary file 1).

Eligibility of full texts was screened by two researchers independently (PB & IB), using a checklist for full text eligibility and a custom Excel spreadsheet. Disagreements were discussed, and if necessary, a third assessor (LAK) was consulted. Cohen’s Kappa statistics were calculated to assess the agreement between the two screeners for the title, abstract and full text phase [39]. For feasibility reasons, the second update was performed by one researcher (PB) only. A second researcher (LAK) was consulted in case of questions and doubt with respect to the interpretation of the study. The PICO, in- and exclusion criteria and complete checklists per phase can be found in Supplementary file 3. The used custom Excel spreadsheets can be found on Open Science Framework (https://osf.io/c27xv/).

Data charting process

The first author (PB) extracted data using an extraction form in Excel (available at Open Science Framework: https://osf.io/c27xv/). The data extraction form included the following information: 1) publication data (author, year of publication, land of origin); 2) study data (design, setting, sample size, and protocol tasks); 3) study population (diagnosis group(s), age, gender, and walking speed); 4) device (name, type, placement, unit of measurement, epoch length, sampling rate, and algorithm used); 5) studied measurement properties (validity, reliability, or responsiveness) and criterion measure (name, type, unit of measurement, algorithm used); and 6) study outcomes.

Synthesis of results

We synthesized the data based on device. For each device, the available measurement properties were presented using the following ordering: 1) PA outcome; 2) diagnosis group; 3) study; 4) device placement; and 5) algorithm. We separated research-grade devices from consumer-grade devices.

Results

Figure 1 shows a flowchart of the screening and review process. A total of 21566 records were identified through the search. After removing duplicates and publications categorized as non-primary research, 13219 records were screened on title. Based on title, we excluded 10752 records. We screened the remaining records on abstract, and excluded 1725 records. A further 403 records were excluded, as they were published before 2015 or used self-report measurement instruments for physical activity. The remaining 287 records were read in full. Of these, we excluded 184 records that did not meet the eligibility criteria, which resulted in a total of 103 studies included in this review. Agreement of the initial search and first update for title, abstract and full text screening was moderate (title phase: Cohen’s Kappa = 0.68, agreement = 78%; abstract phase: Cohen’s Kappa = 0.55, agreement = 82%; full text phase: Cohen’s Kappa = 0.57, agreement = 78%).

Fig. 1
figure 1

Flowchart of screening and review process of included studies on device-based instruments assessing physical activity. n = number of studies

Characteristics of the included studies are shown in Table 1. In total, 23 different physical disabilities and/or chronic diseases were included in the studies. Most studies included people with stroke (n = 27) [40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66], chronic obstructive pulmonary disease (n = 11) [67,68,69,70,71,72,73,74,75,76,77] and multiple sclerosis (n = 10) [78,79,80,81,82,83,84,85,86,87]. Six studies included a mixed population of people with different physical disabilities and/or chronic diseases [23, 75, 77, 88,89,90]. Sample sizes ranged from 4 to 176, with a median of 28. The majority of studies were performed in Northern America (USA, n = 28 [51, 64, 69, 70, 72, 74, 76, 83,84,85, 91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107]; Canada, n = 10 [40, 47, 50, 52, 53, 89, 108,109,110,111]) and Western Europe (UK, n = 11 [78, 80, 82, 86, 112,113,114,115,116,117,118]; France, n = 8 [42,43,44,45, 55, 119]; the Netherlands, n = 6 [48, 75, 77, 120,121,122]; Germany, n = 4 [68, 87, 123, 124]; Switzerland, n = 4 [66, 81, 125, 126]; Denmark, n = 3 [127,128,129]; Belgium, n = 2 [67, 88]; Italy, n = 2 [56, 130]; Sweden, n = 2 [71, 79]; Ireland, n = 1 [131]; Portugal, n = 1 [132]). Only 14 studies were performed in other countries (Brazil, n = 6 [46, 49, 57, 62, 63, 133]; Japan, n = 4 [59, 73, 134, 135]; Australia, n = 3 [60, 90, 136]; Czech Republic, n = 1 [65]). Of the 103 included studies, 65 were performed in a laboratory setting with protocolled activities [23, 40,41,42,43,44,45,46, 49, 51,52,53,54,55,56,57,58,59, 61,62,63,64,65,66, 70, 72, 75, 78,79,80, 83, 86, 88,89,90, 92, 93, 95,96,97, 101, 103, 104, 107, 109, 111,112,113,114,115, 119, 120, 122, 123, 125, 126, 128,129,130,131,132,133, 137,138,139], 28 during free-living (activities of own choice) [50, 60, 67, 68, 71, 73, 76, 82, 87, 91, 94, 98,99,100, 102, 105, 106, 108, 110, 117, 121, 124, 127, 134,135,136, 140, 141], nine in a combined laboratory and free-living setting [47, 48, 69, 77, 81, 84, 85, 116, 118], and one in the home setting in which participants had to perform a set of protocolled activities [74]. Walking speed of the participants was on average slow, with speeds predominantly below 1.0 m/s. Supplementary file 4 provides an extended version of Table 1. This table provides extra information on important in- and exclusion criteria, the tasks performed, and criterion for valid measurement days and cases (for studies performed in a free-living setting).

Table 1 Descriptives of the 103 included studies

In total, 78 different PA devices from 43 different companies were studied on their measurement properties. In 39 studies multiple devices were used and compared [23, 43, 44, 46, 49, 51, 54, 55, 57, 58, 63, 64, 67, 70, 75, 79,80,81, 83, 84, 89, 92,93,94,95,96,97, 101, 103, 107, 112, 115, 116, 118, 122, 132, 133, 137, 141]. Twenty-three devices were research-grade and 55 were consumer-grade. The most frequently studied research-grade devices were from the companies ActiGraph (n = 28 studies) [23, 40, 43,44,45, 49, 51, 55, 61, 64, 76, 79, 81, 84, 89, 93,94,95,96, 104, 105, 107, 108, 112, 114,115,116] and PAL technology (n = 8 studies) [23, 54, 86, 91, 95, 116, 131, 138]. The most frequently studied consumer-grade devices were from the companies Fitbit (n = 39 studies) [23, 41, 46, 47, 50, 52, 53, 58, 60, 64, 65, 67, 74, 75, 80, 81, 83,84,85, 90, 92, 94, 97,98,99, 101,102,103, 106, 109, 112, 118, 122, 127, 133, 136, 137, 140, 141] and Garmin (n = 10 studies) [23, 58, 66, 80, 97, 101, 107, 130, 137, 141].

With respect to measurement properties, 97 studies determined validity [23, 40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90, 92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110, 112, 114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129, 131,132,133,134, 136,137,138, 140, 141], 11 studies determined reliability [46, 54, 58, 66, 91, 105, 106, 111, 113, 118, 135] and six study determined responsiveness [82, 100, 105, 106, 118, 136]. The measurement properties of 14 different PA outcomes were studied. Step count was the most frequently studied PA outcome (n = 68) [23, 40, 41, 46, 47, 50, 52,53,54, 56,57,58, 63,64,65,66,67,68,69, 74, 75, 79,80,81,82,83,84,85,86, 89,90,91,92,93,94,95,96,97,98, 101,102,103,104,105,106,107,108,109, 111, 112, 116,117,118, 121, 123, 124, 126,127,128,129,130,131,132,133, 136, 137, 140, 141], followed by energy expenditure (n = 19) [42, 43, 45, 49, 51, 55, 61, 62, 70, 71, 82, 88, 96, 114, 115, 119, 122, 125, 134] and activity time (n = 15) [48, 54, 68, 80,81,82, 86, 91, 95, 100, 116, 117, 120, 131, 138]. In the majority of studies (n = 60), PA was measured by means of only walking tasks or by using walking-related PA outcomes (e.g. steps, walked distance) [23, 40, 41, 43, 44, 46, 47, 49, 52,53,54, 56,57,58, 61, 62, 64,65,66,67, 69, 72, 74, 75, 77, 78, 83,84,85, 89, 90, 92, 93, 97, 98, 101, 103, 104, 107,108,109, 112, 113, 115, 119, 121, 123, 126,127,128,129,130, 132, 133, 136,137,138,139, 141].

The proprietary algorithm of the instrument was most frequently used, or the algorithm used was not reported at all. A population-specific custom algorithm was used in three research-grade and three consumer-grade devices. Devices were positioned at 15 different body positions, with the positions at the ankle, thigh, waist and wrist as most common. One device (Medtronic ICD/CRT device) was a type of pacemaker, and was surgically implanted in patients with heart failure. Validity was measured using 21 different statistical methods, reliability with three different methods, and responsiveness with five methods.

Table 2 provides an overview of the measurement properties of the research-grade devices, per PA outcome, study population, device properties (placement of the device, used algorithms) and outcome (used statistical test, result). Table 3 provides the same overview for the consumer-grade devices. Supplementary files 5 and 6 contain a more in-depth version of both tables, with extra information such as epoch length, sampling rate and results per condition.

Table 2 Overview of research grade devices evaluated on their measurement properties in the 52 studies
Table 3 Overview of consumer grade devices evaluated on their measurement properties in the 67 studies

Research-grade devices

ActiGraph

Measurement properties of a type of ActiGraph were determined in 28 studies, with 24 studies evaluating type GT3 [23, 40, 43,44,45, 49, 51, 55, 61, 64, 81, 84, 89, 94,95,96, 104, 105, 108, 112, 114,115,116, 139] and four studies evaluating type GT9 [76, 79, 93, 107] (Table 2). Only validity was measured in these 28 studies, with 27 determining criterion validity, and 1 construct validity [105]. For the GT3, the criterion validity of energy expenditure, steps, time spent in intensity zones, time in activities, distance walked, metabolic equivalent (MET) and activity counts and construct validity for steps and vector magnitude was measured in 12 unique diagnosis groups and one mixed group with variable diagnoses. Four studies applied custom-created algorithms [61, 114, 115, 139], two studies applied both a custom and a proprietary algorithm [43, 61], two studies did not report on used algorithms [45, 55] and the other studies used proprietary algorithms (n = 21), with Freedson [142] the most commonly reported. The GT3 was placed at five different body regions (ankle, upper arm, thigh, waist and wrist), at both the affected and unaffected side (for diagnosis groups that may suffer from unilateral impairment, e.g. stroke, unilateral amputation). The GT9 was studied on criterion validity of steps and sedentary time in 5 different diagnosis groups, placed on the ankle, waist or wrist. Three studies used one or more proprietary algorithms [76, 79, 93], and one study did not report on the used algorithm [107].The used epoch length of the instruments ranged from 0.033 s to 60 s, or it was not reported. Sampling rate was set at 10 Hz (1 study [45]), 30 Hz (14 studies (40, 44, 49, 51, 64, 76, 81, 84, 113, 115, 116, 117{Compagnat, 2022 #154, 140)}, 50 Hz (1 study [107]), 90 Hz (1 study [79], 100 Hz (2 studies [93, 104]), or it was not reported (9 studies [23, 43, 55, 89, 94,95,96, 105, 108]). The criterion validity was measured with 13 different statistical tests (among others: Pearson’s r, Spearman’s rho, intraclass correlation coefficient (ICC), Bland–Altman level of agreement, % accuracy). The results had a wide range of variation, with correlations between 0.004 to 0.97 and accuracy between 43.0% to 81.4%. This large variability was found among different PA outcomes, but also within PA outcomes.

PAL technologies

The devices of PAL technologies were evaluated in eight studies, six studies evaluating the ActivPAL [23, 54, 91, 95, 131, 138] and two studies evaluating the ActivPAL3 [86, 116] (Table 2). Criterion validity for steps, time spent in different activities or MET were measured in seven studies [23, 54, 86, 95, 116, 131, 138] in five unique diagnosis groups and one mixed group with variable diagnoses. Test–retest reliability was measured for steps, time spent in different activities and MET in two studies [54, 91] in two unique diagnosis groups. One study did not report the used algorithm [138], the other seven used proprietary algorithms. All studies placed the device on the thigh. The used epoch lengths were 0.1 s [91], 1 s [95] and 15 s [54, 131, 138]. Three studies did not report the epoch length [23, 86, 116]. Sampling rate was set at 10 [54, 91] or 20 Hz [86], or was not reported [23, 95, 116, 131, 138]. Test–retest reliability was measured as ICC, ranging from 0.654 to 0.997 and as absolute percentage error, ranging from 3.3% to 6.5%, depending on the PA outcome, diagnosis group and task. Criterion validity was measured as Pearson’s r, ICC, Bland–Altman level of agreement, percentage accuracy and percentage error, and varied with correlations between 0.65 and 0.99, accuracy between 90.7–100% and error between 0.3–3.1%, all depending on the PA outcome, diagnosis group and task.

Consumer-grade devices

Fitbit

Eleven different types of Fitbits were evaluated: Alta (n = 4 studies) [65, 67, 80, 98], Charge (n = 3 studies) [23, 101, 137], Charge 2 (n = 5 studies) [94, 97, 118, 122, 141], Flex (n = 9 studies) [60, 75, 83,84,85, 99, 101, 109, 140], Flex 2 (n = 2 studies) [84, 136], Inc (n = 1 study) [133], One (n = 12 studies) [23, 47, 50, 52, 53, 64, 75, 83, 92, 97, 102, 106], Surge (n = 1 study) [103], Ultra (n = 1 study) [46] and Zip (n = 9 studies) [41, 58, 67, 74, 80, 90, 103, 112, 127] (Table 3). Criterion validity was measured for steps, energy expenditure, MET, time spent in different intensity zones, time spent in different activities and distance walked by 38 studies in 15 unique diagnosis groups, and three mixed groups with variable diagnoses. Convergence validity of the Alta was measured in one study for steps in cancer patients [98]. Test–retest reliability of the Inspire (n = 1 study) [118], One (n = 1 study) [106] and the Zip (n = 1 study) [58], for steps, MET and time spent in different intensity zones in patients with stroke, myositis or progressive muscle diseases. Responsiveness was measured for the Flex 2 (n = 1 study), Inspire (n = 1 study) and One (n = 1 study) for steps, MET and time spent in different intensity zones in patients with osteoarthritis, myositis or progressive muscle diseases. The Charge, Charge 2, Flex, Flex 2, Surge and Ultra were positioned at the wrist or it was not reported, the Alta at the lower limb, waist or wrist, the One at the ankle or waist, and the Zip at the foot, the waist or the midline of a shirt. Devices were placed at both the affected and unaffected side (for diagnosis groups that may suffer from unilateral impairment). One study used a custom algorithm [94], the other studies either used proprietary algorithms or did not report the used algorithm. Criterion validity of the Fitbits was measured with 13 different statistical tests, with correlations ranging from -0.236 to 0.99 and mean percentage errors ranging from 1.9 to 84.9%. Convergence validity, measured with concordance correlation coefficient, was smaller than 0.001 compared with a questionnaire. Test–retest reliability, measured with ICC, was 0.78—0.97. Responsiveness was measured with area under the curve (0.72 – 0.90) or correlation (-0.28 – 0.63).

Garmin

Six different types of Garmin devices were evaluated: Forerunner 35 (n = 1 study) [66], Vivofit (n = 5 studies) [23, 58, 101, 137, 141], Vivotfit 3 (n = 2 studies) [107, 141], Vivofit 4 (n = 1 study) [80], Vivosmart 3 (n = 1 study) [97] and Vivosmart 4 (n = 1 study) [130] (Table 3). Studies measured criterion validity for steps and time spent in different activities in five unique diagnosis groups and one mixed group with variable diagnoses. Test–retest reliability of the Forerunner 35 and Vivofit was measured for steps in a stroke population. All devices were worn on the wrist, with the Vivofit 3 also worn on the ankle in one study [107]. One study used the proprietary algorithm [130], the other studies did not report on the used algorithm. Sampling rate and epoch length were not reported for the devices. Criterion validity was measured using 5 different statistical tests (ICC, concordance correlation coefficient, Bland–Altman level of agreement, percentage error and mean absolute percentage error). Correlations ranged from 0.12 to 0.97, depending on the device, PA outcome and task. Test–retest reliability was measured using ICC, ranging from 0.86 to 0.99.

Discussion

This scoping review provides a critical mapping of the research on measurement properties (validity, reliability and responsiveness) of device-based instruments assessing PA in ambulatory adults with disabilities and/or chronic diseases. The results show a large variability in research on the measurement properties of device-based instruments assessing PA in adults with physical disabilities and/or chronic diseases. Predominantly, different forms of validity are assessed in a total of 78 different research- and consumer-grade devices using 14 different PA outcomes in 23 different diagnosis groups. There is large variability in measurement properties within and between instruments and studies. The ActiGraph devices are the most frequently studied research-grade devices, and the Fitbit devices are the most frequently studied consumer-grade devices.

PA outcomes

PA behavior is assessed with a variety of different PA outcomes. The most commonly used PA outcome is step count, comparable to previous reviews on the use of device-based PA instruments [143,144,145]. However, step count informs only about walking and walking-related tasks and does not give information on the intensity and duration of PA behavior from a broader perspective. Even when step count is not used as the PA outcome, we have found that studies mostly use walking-related tasks to study the measurement properties. This results in device-based PA instruments only applicable for valid and reliable measurement of walking, and thereby excluding valid and reliable measurement of other modes of PA behavior such as cycling and swimming.

The importance of frequency, intensity and duration of PA is stressed by the guidelines for PA, which typically include statements on the frequency and duration in certain intensities needed for achieving optimal health benefits [146, 147]. Energy expenditure and intensity time are PA outcomes that take two of these dimensions into account (i.e. intensity and duration). However, the trend visible in this scoping review is that incorporating intensity in the PA outcome results in lower validity outcomes. As intensity depends on the used cut-off points and algorithms [148], given the fact that these are mostly developed for a general population [9], this finding is not surprising. Custom-made disease-specific algorithms could be a solution to increase validity outcomes. In the eight studies using custom algorithms in five different instruments, generally moderate to good values of validity are found [43, 61, 73, 94, 114, 115, 125, 134]. However, just two of these studies compare a custom disease-specific algorithm with a proprietary algorithm, reporting increased validity for the custom algorithm [43, 61]. More research needs to compare custom disease-specific algorithms with proprietary algorithms.

When using intensity time and energy expenditure as PA outcomes only, information on how and where PA is being performed is not acquired. This information can be of importance for rehabilitation specialists and policymakers to identify possibilities to improve PA behavior in people with physical disabilities and/or chronic diseases. The how (or mode) of PA can be measured using activity time. This outcome is used by 15 studies, with a variety of outcomes on measurement properties [48, 54, 68, 80,81,82, 86, 91, 95, 100, 116, 117, 120, 131, 138]. As device-based PA instruments only capture movement or acceleration of the body, the where (or context) of PA cannot be measured with these instruments [15]. Self-report instruments can fill this gap, hence the consensus that both self-report and device-based PA instruments should be used in complement to each other [12, 14]. In conclusion, we can say that different PA outcomes have different advantages and disadvantages, but none of the device-based PA outcomes is able to capture the complete construct of PA (i.e. setting, mode, intensity, duration, frequency). This requires future research consideration.

Population

Most of the studies on measurement properties of device-based PA instruments are conducted in diagnosis-specific populations, and only six studies concerned a mixed population including people with different physical disabilities and/or chronic diseases [23, 75, 77, 88,89,90]. People with different diagnoses may suffer from different walking-related complications [19,20,21,22], which could have an effect on measurement properties of device-based PA instruments (e.g. frequency spectrum of accelerations, energetic cost and efficiency of movement/activities). Thus, a diagnosis-specific approach in these studies seems logical. However, this diagnosis-specific focus does have the drawback that it lacks generalizability to other types of physical disabilities and/or chronic diseases. It might be of interest to conduct studies using a functioning-specific focus, in line with the ICF model [35]. Functional limitations may differ between individuals within diagnosis groups, and different diagnoses might share problems with functioning, such as slower and asymmetrical gait [16,17,18], which can influence the measurement properties of PA devices [24]. Studies using this functioning-specific approach can give insight in PA devices with good measurement properties for multiple physical disabilities and/or chronic diseases. This is of relevance as monitoring and measuring PA is important for all physical disabilities and/or chronic diseases. As self-monitoring is an important behavior change technique [8], a PA device that is valid and reliable for a variety of people with physical disabilities and/or chronic diseases might increase feasibility of PA promoting interventions for people with physical disabilities and/or chronic diseases. The same can be suggested for the rehabilitation setting, in which a variety of patient groups are treated. Correct measurement and monitoring of PA in the rehabilitation setting can lead to a more tailored approach to improve PA behavior, which ultimately may improve health and functioning [149].

Measurement properties and statistics

The criterion validity of the device-based PA instruments is the most common studied measurement property. Besides criterion validity, only 11 studies on (test–retest) reliability [46, 54, 58, 66, 91, 105, 106, 111, 113, 118, 135] and six studies on responsiveness are included [82, 100, 105, 106, 118, 136]. Good reliability of a device-based PA instrument is needed for suitable clinical application to ensure that a change in PA behavior over time is related to an actual change instead of measurement error. Good responsiveness is needed as a prerequisite for measuring effectiveness of PA promotion in clinical care. During our search, we found studies that investigated the number of days needed for reliable measurement of PA using devices in free-living settings [150,151,152,153]. Although this is important information, it is not considered a measurement property since it does not provide information on the measurement error and the extent to which repeated measurement outcomes are the same for people who have not changed [37].

There is a large variety of statistical methods used to study the measurement properties of the different devices, which makes it difficult to compare the different studies. Most studies included in this review assessed criterion validity and test–retest reliability, for which methods of correlational nature are recommended [154]. The use of techniques comparing means (e.g. t-test and analysis of variance) is irrelevant in studies on measurement properties, since these pretend to measure a difference (from a criterion measure or between two measurements), instead of an agreement [37]. Still, a number of included studies did not use the appropriate statistical methods according to the international standards of the COSMIN group.

Technical decisions

Using device-based PA instruments in research or clinical practice, numerous choices about data collection and data processing need to be made. All these choices could influence the measurement properties. First, one needs to think about the placement of the device on the body. Multiple studies showed the influence of placement of the device on measurement properties [23, 40, 44, 45, 51, 53, 55, 56, 58, 65, 89, 96, 107, 112, 114, 120, 128], with no clear advantage to a single location. Algorithms and cut-off points are developed with a certain placement in mind, and are not interchangeable between placements [149, 155], explaining at least part of the influence of placement on measurement properties. Secondly, epoch length and sampling rate should be considered when using PA measurement devices. Previous studies have shown that different epoch lengths result in differences in PA outcomes [15, 156]. However, none of the reviewed studies have looked at the influence of epoch length on measurement properties. Furthermore, in a large number of studies (n = 25 in research-grade devices, n = 59 in consumer-grade devices) the used epoch length is not reported. The same is found for sampling rate, which is also not always reported. Therefore, we cannot make recommendations on the optimal epoch length and sampling rate. However, for the use of device-based instruments in practice, one needs to critically assess considerations such as accuracy versus storage capacity. Thirdly, another important choice is the algorithm used to convert the measured accelerations of movement into interpretable PA outcomes. Applying different general algorithms could lead to differences in measurement properties, which is shown by the three studies that compared multiple algorithms [49, 71, 76]. And as mentioned previously, custom-made disease-specific algorithms could influence the measurement properties when using intensity-based PA outcomes [43]. For research and clinical use, we suggest applying an algorithm that is evaluated for the specific population and activity level. However, based on our findings we cannot recommend certain algorithms, as this is beyond the scope of this review. Considering the effect of these technical choices on PA outcomes and the measurement properties of the device-based instruments, Burchartz et al. already stated in their state of science paper on device-based PA instruments that all important technical decisions (such as placement on the body, the used epoch length, sampling rate and algorithm) should be reported in studies on measurement properties [15]. As it is apparent from this review that reporting the technical decisions is not common practice in studies on measurement properties, we wholeheartedly support this recommendation.

Strengths and limitations

The main strength of this scoping review is the detailed and extensive mapping of studies using a broad range of methodological approaches and in a diverse group of ambulatory people with physical disabilities and/or chronic diseases. Furthermore, we used a systematic process in this scoping review, with the screening and selection process for the majority done in duplicate using information from four major databases. Another strength is the transparency and openness of the current scoping review. We provided additional information on the screening and analysis processes in the supplements and on Open Science Framework, which greatly improves the reproducibility of our scoping review. Lastly, we provided detailed information on decisions made in the included studies, which has not been reported in such detail in previous reviews on this topic. The Supplementary files add an extra layer of information for the interested reader, and provide extra emphasis on the large variability of the studies (e.g. the variety in what is considered a valid day/case among the studies).

However, some limitations of this scoping review should be acknowledged. One of the limitations is related to the search strategy. Although we carefully developed our search strategy, together with an information specialist, it is possible that we missed important search terms (e.g. specific wearables, specific disease groups), which could have resulted in missed relevant studies. Also, the inclusion of some search terms could have led to a relative overrepresentation of certain studies or devices used in the studies. As an example, ‘ActiGraph’ was included as a search term in our search strategy, which we found as the most used research-grade device in the literature. However, a previous review of device-based PA instruments in cardiovascular patients also found the ActiGraph as most frequently used instrument [145]. We did not apply the search filter for measurement properties developed by the COSMIN group [157], as this increased our search results exponentially.

Another limitation is our Dutch view on the rehabilitation setting. One of our inclusion criteria was that the physical disability or chronic disease of the participants must be a primary reason for rehabilitation. However, rehabilitation might not be organized the same across countries. This may have resulted in us excluding certain diagnosis groups that would be included by researchers of other countries, and vice versa, using the same in- and exclusion criteria.

In the current scoping review, we did not differentiate the overview of the measurement properties to the used setting (i.e. laboratory setting vs free-living setting) of the studies, which can be considered a limitation. The difference in setting might influence the measurement properties, and thus entail different concepts. We reported the used setting of the studies in the description table (Table 1) so that readers who are interested in these concepts can find this information in the current scoping review. However, future reviews could put more in-depth focus on the differences in setting and their effect on measurement properties.

A limitation inherent to research on device-based PA instruments is the rapidly changing field with regard to the technology. The technology of these devices develops rapidly, leading to newer models to hit the market before previous models have been properly studied. This is especially true for the consumer-grade instruments, which illustrates a commercialky-driven approach to the development of new technology, not necessarily leading to a quality-driven market. For research purposes, there is more need for valid and reliable instruments.

Future directions

Considering the importance of PA in people with physical disabilities and/or chronic diseases, and the need to measure and quantify PA in this population as stated by different research agenda’s [9,10,11], instruments with good measurement properties are vital. Due to the large variability in measurement devices and the methods used to evaluate these, we were unfortunately unable to make concrete recommendations for specific devices and settings based on this review. However, this review provides an overview of detailed information per measurement device, which we use to provide directions for research on measurement properties of device-based instruments assessing PA in people with physical disabilities and/or chronic diseases.

  • The focus of research on measurement properties of device-based PA instruments in people with physical disabilities and/or chronic diseases needs to be less on step count as a PA outcome, as it provides a very narrow view of PA behavior. Energy expenditure and intensity time seem important, but the validity of these outcomes needs to be improved. More research is needed on the measurement properties when using activity time since this can be important information for rehabilitation purposes. To better measure the multidimensionality of PA, the use of device-based PA instruments can be supplemented by the simultaneous application of self-report instruments.

  • Studies on measurement properties of device-based instruments should inform readers of important technical decisions made for data collection and data processing. Especially the placement of the device on the body, the epoch length, sampling rate, and the used algorithm in full detail should be reported, as these are known to influence PA measurement. This information will help with data comparison between studies, but will also inform in detail in which situation a device-based instrument should or could be used.

  • Future research should investigate the influence of disease-specific versus general algorithms on the measurement properties (in this case mainly validity) of device-based PA instruments. Intensity is an important aspect of PA, as evidenced by the focus of PA guidelines on moderate to vigorous PA [146, 147]. The use of custom disease-specific algorithms could improve the ability of device-based instruments to capture intensity.

  • More research on the measurement properties of device-based PA instruments should be conducted in populations consisting of people with different physical disabilities and/or chronic diseases, for example by using a functioning-specific approach. It would be beneficial to have a single device-based PA instrument with good measurement properties available for different diagnosis groups. This will improve the ease of use in a rehabilitation setting where different diagnosis groups are treated.

  • Raw data from device-based instruments should be used, instead of using PA outcomes processed by proprietary algorithms. In this way, the measurement properties of the device-based instruments when using raw data can be studied in a diverse population, and this raw data can subsequently be processed into PA outcomes using disease-specific or even individualized algorithms. Important to note, is that these algorithms should also be validated. The use of raw data has also been recommended by previous studies [15, 149].

  • Reliability and responsiveness of device-based instruments should be studied more often. These measurement properties are especially important when device-based PA instruments are used to study changes in PA behavior over time. And although there has been an increase in studies on these measurement properties (especially responsiveness) in the last two to three years, they are still underrepresented in the literature of this scoping review.

  • The methodologically correct statistical methods should be used while studying measurement properties of device-based instruments. This will help with comparing different studies and will result in better informed researchers and health professionals when selecting device-based instruments.

Conclusion

There is a large variability in research on the measurement properties of device-based instruments assessing PA in ambulatory adults with physical disabilities and/or chronic diseases. This variability shows a need for more standardization of and consensus on research in this field. Based on this scoping review, the results could provide researchers and health professionals with some directions for selecting a device-based PA instrument that suits their need. Finally, to improve research and bridge knowledge gaps, we provide future directions for researchers interested in studying the measurement properties of device-based instruments assessing PA in ambulatory adults with physical disabilities and/or chronic diseases.