Key Points

  • Both commercially available and custom wearable sensors have some evidence of validity in the literature. Although commercial wearable sensors were validated against pseudo gold standards, each study customised the commercial software to do so.

  • Wearable sensors demonstrated errors < 5° for all degrees of freedom at the wrist and elbow joints when compared to a robotic device. The range in error is greater when measured in vivo and compared to a pseudo gold standard.

  • The measured errors are within margins that warrant future use of wearable sensors to measure joint angle in the upper limb.

Background

Clinicians and researchers seek information about the quality and quantity of patients’ movement as it provides useful information to guide and evaluate intervention. Range of motion (ROM), defined as rotation about a joint, is measured in a variety of clinical populations including those with orthopaedic, musculoskeletal, and neurological disorders. Measurement of ROM forms a valuable part of clinical assessment; therefore, it is essential that it is completed in a way that provides accurate and reliable results [1, 2].

In clinical practice, the goniometer is a widely used instrument to measure ROM [2,3,4]. Despite being considered a simple, versatile, and an easy-to-use instrument, reports of reliability and accuracy are varied. Intra-class correlation coefficients (ICCs) range from 0.76 to 0.94 (intra-rater) [3, 4] and 0.36 to 0.91 (inter-rater) [4] for shoulder and elbow ROM. Low inter-rater reliability is thought to result from the complexity and characteristics of the movement, the anatomical joint being measured, and the level of assessor experience [5, 6]. The goniometer is also limited to measuring joint angles in single planes and static positions; thus, critical information regarding joint angles during dynamic movement cannot be measured.

In research settings, three-dimensional motion analysis (3DMA) systems, such as Vicon (Vicon Motion Systems Ltd., Oxford, UK) and Optitrack (NaturalPoint, Inc., Corvallis, OR, USA), are used to measure joint angles during dynamic movement in multiple degrees of freedom (DOF). Such systems are considered the ‘gold standard’ for evaluating lower limb kinematics, with a systematic review reporting errors < 4.0° for movement in the sagittal plane and < 2.0° in the coronal plane; higher values have been reported for hip rotation in the transverse plane (range 16 to 34°) [7]. Measurement in the upper limb is considered more technically challenging due to the complexity of shoulder, elbow, and wrist movements [8]. However, given the demonstrated accuracy in the lower limb, 3DMA systems are used as the ‘ground truth’ when validating new upper limb measurement tools [9]. However, 3DMA does have limitations. Most notably, these systems are typically immobile, expensive, require considerable expertise to operate, and therefore rarely viable for use with clinical populations [10, 11].

Wearable sensors, or inertial measurement units, are becoming increasingly popular for the measurement of joint angle in the upper limb [12]. In this review, we were interested in wearable sensors that contained accelerometers and gyroscopes, with or without a magnetometer, to indirectly derive orientation. The software typically utilised three main steps: (i) calibration, using two approaches: (1) system, also referred to as ‘factory calibration’ (offset of the hardware on a flat surface), and (2) anatomical calibration including both static (pre-determined pose) and dynamic (pre-determined movement) [10, 13]; (ii) filtering, using fusion algorithms including variations of the Kalman filter (KF) [14, 15]; and (iii) segment and angle definition, using Euler angle decompositions and/or Denavit-Hartenberg Cartesian coordinates.

Wearable sensors are an increasingly popular surrogate for laboratory-based 3DMA due to their usability, portability, size, and cost. Systematic reviews have detailed their use during swimming [16] and whole body analysis [17] and in the detection of gait parameters and lower limb biomechanics [18]. However, their validity and reliability must be established and acceptable prior to their application [19]. Accuracy of the wearable sensors is dependent on the joint and movement being measured; therefore, a systematic review specific to the upper limb is required. This study aimed to establish the evidence for the use of wearable sensors to calculate joint angle in the upper limb, specifically:

  1. i.

    What are the characteristics of commercially available and custom designed wearable sensors?

  2. ii.

    What populations are researchers applying wearable sensors for and how have they been used?

  3. iii.

    What are the established psychometric properties for the wearable sensors?

Methods

This systematic review was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines [20] and registered with the International Prospective Register of Systematic Reviews on 23 March 2017 (CRD42017059935).

Search Terms and Data Bases

Studies and conference proceedings were identified through searches in scientific data bases relevant to the fields of biomechanics, medicine, and engineering, from their earliest records to November 1, 2016: MEDLINE via PROQUEST, EMBASE via OVID, CINAHL via EBSCO, Web of Science, SPORTDiscus, IEEE, and Scopus. Reference lists were searched to ensure additional relevant studies were identified. The search was updated on 9 October 2017 to identify new studies that met the inclusion criteria.

The following search term combinations were used: (“wearable sens*”OR “inertial motion unit*” OR “inertial movement unit*” OR “inertial sens*” OR sensor) AND (“movement* analysis” OR “motion analysis*” OR “motion track*” OR “track* motion*” OR “measurement system*” OR movement) AND (“joint angle*” OR angle* OR kinematic* OR “range of motion*”) AND (“upper limb*” OR “upper extremit*” OR arm* OR elbow* OR wrist* OR shoulder* OR humerus*). Relevant MeSH terms were included where appropriate, and searches were limited to title, abstract, and key words. All references were imported into Endnote X6 (Thomson Reuters, Carlsbad, CA, USA), and duplicates were removed.

Study Selection Criteria and Data Extraction

The title and abstracts were screened independently by two reviewers (CW and AC). Full texts were retrieved if they met the inclusion criteria: (i) included human participants and/or robotic devices, (ii) applied/simulated use of wearable sensors on the upper limb, and (iii) calculated an upper limb joint angle. The manuals of commercial wearable sensors were located, with information extracted when characteristics were not reported by study authors. Studies were excluded based on the following criteria: (i) used a single wearable sensor, (ii) included different motion analysis systems (i.e. WiiMove, Kinetic, and smart phones), (iii) used only an accelerometer, (iv) calculated segment angle or position, (v) studied the scapula, or (vi) were not published in English.

Two reviewers (CW and AC) extracted data independently to a customised extraction form. Discrepancies were discussed, and a third reviewer (TG) was involved when consensus was not reached. Extracted parameters of the wearable sensor characteristics included custom and commercial brands, the dimensions (i.e. height and weight), components used (i.e. accelerometer, gyroscope, and magnetometer), and the sampling rate (measured in hertz (Hz)). Sample characteristics included the number of participants, their age, and any known clinical pathology. To determine if authors of the included studies customised aspects of the wearable sensors system, the following parameters were extracted: the type of calibration (i.e. system and anatomical), the fusion algorithms utilised, how anatomical segments were defined, and how joint angle was calculated.

To understand the validity and reliability of the wearable sensors, information about the comparison system, marker placement, and psychometric properties were extracted. The mean error, standard deviation (SD), and root mean square error (RMSE) reported in degrees were extracted where possible from the validation studies. The RMSE represents the error or difference between the wearable sensor and the comparison system (e.g. 3DMA system). The larger the RMSE, the greater the difference (in degrees) between the two systems. Further, to report on the validity of the wearable sensors, studies that did not delineate error between the wearable sensor and soft tissue artefact (movement of the markers with the skin) by not using the same segment tracking were not further analysed. Reliability was assessed using ICCs, with values < 0.60 reflecting poor agreement, 0.60–0.79 reflecting adequate agreement, and 0.80–1.00 reflecting excellent agreement [21].

The following parameters were used to guide the interpretation of measurement error, with < 2.0° considered acceptable, between 2.0 and 5.0° regarded as reasonable but may require consideration when interpreting the data, and > 5.0° of error was interpreted with caution [7].

Assessment of Risk of Bias and Level of Evidence

Due to the variability between research disciplines (i.e. health and engineering) in the way that studies were reported, and the level of detail provided about the research procedures, the available assessments of risk of bias and levels of evidence were not suitable for this review. Therefore, the following criteria were used to evaluate the quality of the reporting in the included studies:

  • The aim of the study was clear and corresponded to the results that were reported.

  • The study design and type of paper (i.e. conference proceeding) were considered.

  • Number of participants included in the study was considered in relation to the COSMIN guidelines which indicate that adequate samples require 50–99 participants [19].

Results

The initial search (2016) identified 1759 studies eligible for inclusion, with an additional 432 studies identified 12 months later (2017). A total of 66 studies met the inclusion criteria (Fig. 1). Eight studies reported on the  validation against a robotic device, and 22 reported on validation against a motion analysis system with human participants. One study assessed the reliability of the wearable sensors, with the remaining 35 studies using wearable sensors as an outcome measure in an experimental design.

Fig. 1
figure 1

A PRISMA diagram of the search strategy

Characteristics and Placement of the Wearable Sensors

The characteristics of the wearable sensors are summarised in Table 1. A total of seven customised wearable sensors and 13 commercial brands were identified. The level of detail provided for the placement of the wearable sensors on the upper limb varied significantly, as did the mode of attachment (Table 1).

Table 1 Summary of the descriptive characteristics of the wearable sensors

Calibration Methods

Forty-seven studies reported on a calibration procedure prior to data acquisition. System calibration, also commonly known as ‘factory calibration’, was reported on 12 occasions, with two procedures described for the wearable sensors: (i) placement on a flat surface and/or (ii) movement in a pre-determined order while attached to a flat surface [56, 62]. The aim of system calibration was reported to be to align coordinate systems [39, 56] and account for inaccuracies in the orientation of wearable sensor chip relative to its case/packaging [62]. Static anatomical calibration was performed often (n = 34), with dynamic anatomical calibration performed sometimes (n = 10) [23, 30, 36, 41, 45, 49, 57]. Only one study used system calibration alongside both static and dynamic anatomical calibrations to compute joint kinematics [47].

Populations Assessed Using Wearable Sensors

Most studies (n = 52) recruited healthy adults; participants with known pathology were reported in nine studies (Table 1). One study recruited children (< 18 years) [49]. Sample sizes ranged from 1 to 54 participants, with a median sample of 7.6 participants per study. Twenty-nine studies recruited less than five participants, with 20 studies recruiting one single participant.

Psychometric Properties of Wearable Sensors

Validity

Validation studies were split into two categories: (i) studies that compared the wearable sensor output to simulated upper limb movement on a robotic device (Table 2) and (ii) studies that compared wearable sensors output to a 3DMA system on a human participant (Table 3). The term ‘error’ is used to describe the difference between the capture systems; however, we acknowledge that comparisons between the wearable sensors and a robotic device are the only true measures of error.

Table 2 List of the 8 articles organised by first author and containing information related to the validation of wearable sensors for the measurement of joint angle for simulated movements of the upper limb when compared to a robotic device
Table 3 List of the selected 22 articles organised by first author and containing information related to the validation of wearable sensors for the measurement of joint angle in upper limb when compared to a three-dimensional motion analysis system

Robot Comparisons

Eight studies reported the error of wearable sensors when compared to simulated upper limb movement on a robotic device (Table 2). A mean error between 0.06 and 1.8° for flexion and 1.05 and 1.8° for lateral deviation of the wrist was reported using Xsens [28, 31]. For elbow flexion/extension, the difference between Invensence and the robotic device was between 2.1 and 2.4° [59]. For finger flexion/extension, RMSEs ranged from 5.0 to 7.0° using a customised wearable sensor system [77].

Three studies reported the error associated with the use of different fusion algorithms. Using the unscented Kalman filter (UKF) to fuse data from Opal wearable sensors, the RMSE range was 0.8–8.1° for 2DOF at the shoulder, 0.9–2.8° for 1DOF at the elbow, 1.1–3.9° for 1DOF of the forearm, and 1.1–2.1° for 2DOF at the wrist [46, 48]. The rotation of the shoulder and twist of the wrist resulted in more error compared to single plane movements of flexion/extension and pronation/supination [46, 48]. When the UKF was compared to a modified UKF, lower RMSEs were found across all 6DOF using the modified UKF [46]. One study investigated the effects that speed of movement had on measurement error. Using Opal wearable sensors, the UKF was compared to the extended Kalman filter (EKF) under three speed conditions: slow, medium, and fast. For slow movements, both fusion algorithms were comparable across all 6DOF (RMSE 0.8–7.8° for the UKF and 0.8–8.8° for the EKF). The UKF resulted in less error across 6DOF for the medium (RMSE 1.2–3.0°) and fast (RMSE 1.1–5.9°) speeds compared to the EKF (RMSE 1.4–8.6°; 1.4–9.7°) [48].

3DMA Comparisons

Twenty-two studies compared the joint angles calculated by wearable sensors, both custom and commercial, to a ‘gold standard’ 3DMA system (Table 3). Studies that used same segment tracking (i.e. motion analysis markers directly on the wearable sensors) were reported in 7 studies. Opal wearable sensors were compared to a 3DMA system during simulated swimming (multiplane movement). The largest difference between the two systems occurred at the elbow (RMSE 6–15°), with the least occurring at the wrist (RMSE 3.0–5.0°) [45]. Xsens was compared to codamotion during single plane movement, with the addition of a dynamic anatomical calibration trial [30]. The largest difference occurred at the elbow (5.16° ± 4.5 to 0.54° ± 2.63), and the least difference at the shoulder (0.65° ± 5.67 to 0.76° ± 4.40) [30]. Xsens was compared to Optotrak with consistent differences between systems across all DOFs of the shoulder (RMSE 2.5–3.0°), elbow (RMSE 2.0–2.9°), and wrist (RMSE 2.8–3.8°) [24].

Three studies investigated the performance of wearable sensors using different fusion methods to amalgamate the data and compared this to a ‘gold standard’ system. Zhang and colleagues [34] compared the accuracy of their own algorithm to two pre-existing algorithms. Comparing Xsens to the BTS Optoelectronic system, their methodology resulted in less error (RMSE = 0.08°, CC = 0.89 to 0.99) across 5DOF compared to the two other methods [34]. The addition of a magnetometer in the analysis of data was also investigated using the EKF- and non-EKF-based fusion algorithm [15]. The latter produced the least difference between the two systems, irrespective of the speed of the movement and whether or not a magnetometer was included. In contrast, the EKF fusion algorithm resulted in the largest difference from the reference system, particularly for fast movements where magnetometer data was included (7.37° ± 4.60 to 11.91° ± 6.27) [15]. The level of customisation to achieve these results is summarised in Table 4.

Table 4 Summary of the software customisation reported by the authors for validation studies that used the same segment tracking

One study compared the difference between YEI Technology (YEI technology, Portsmouth, OH) wearable sensors and Vicon during three customised calibration methods for the elbow, which resulted in RMSEs that ranged from 3.1 to 7.6° [69].

Reliability

Adequate to excellent agreement was reported for 2DOF at the shoulder (ICC 0.68–0.81) and poor to moderate agreement for the 2DOF at the elbow (ICC 0.16–0.83). The wrist demonstrated the highest overall agreement with ICC values ranging from 0.65 to 0.89 for 2DOF [73].

Risk of Bias

The sample sizes of the included studies were mostly inadequate, with 30% including single participants (Table 1). Twenty-eight percent of the included studies were conference papers, providing limited information.

Discussion

This systematic review described the characteristics of wearable sensors that have been applied in research and clinical settings on the upper limb, the populations with whom they have been used with, and their established psychometric properties. The inclusion of 66 studies allowed for a comprehensive synthesis of information.

Similar to other systematic reviews on wearable sensors, commercial wearable sensors, as opposed to custom designed, were reported in most studies (83%) [17]. One benefit for users of commercial wearable sensors is the user-friendly nature of the associated manufacturer guidelines and processing software, including in-built fusion algorithms and joint calculation methods. However, the studies that utilised commercial hardware often customised aspects of the software (i.e. fusion algorithm, calibration method, anatomical segment definition, and the kinematic calculation). Therefore, the validity and reliability of an entirely commercial system (hardware and software) for use in the upper limb remains unknown. Customisation impacts the clinical utility of the wearable sensor systems, especially if there are no support personnel with appropriate knowledge and expertise.

Of the studies reviewed, there was no consensus on the procedures to follow for using wearable sensors on the upper limb. The placement of the wearable sensors varied and, in some cases, was poorly described. Manufacturer guidelines for placement of commercial wearable sensors were not referred to, which lead to apparent differences in placement for studies that utilised the same commercial brand. Multiple fusion algorithms were reported, with no clear outcome about which was best suited to a specific joint or movement. The level of customisation of fusion algorithms makes it difficult to compare between studies, and often, the specifics of the algorithm were not readily available, limiting replication. Similar inconsistencies and a lack of consensus were reported in other systematic reviews investigating use of wearable sensors [16, 87]. Without clear guidelines, measurement error can be introduced and/or exacerbated depending on the procedures followed.

The methods of calibration also varied between studies, with a static anatomical calibration the most commonly utilised method (typically adopting a neutral pose, standing with arms by the side and palms facing forward, as recommended by most manufacturers). Dynamic anatomical calibration was often customised to suit the needs of the study and the joint being measured. For example, dynamic anatomical calibration of the elbow varied from repetitions of flexion and extension at various speeds [59], to the rapid movement of the arm from 45° to neutral [42]. Details of the dynamic anatomical calibrations were omitted in some studies, limiting replication. More pertinent for the calculation of joint kinematics is anatomical calibration as compared to system calibration, with the type of calibration (i.e. static or dynamic) and movements of the dynamic anatomical calibration, having a significant impact on the accuracy of wearable sensors [69].

Of the 66 studies included in this review, almost half (45%) were validation studies with the remaining studies using wearable sensors as an outcome measure. Over one third (29%) were conference proceedings in the field of engineering, thus limiting the amount of information available. The median sample size was 7.6 participants per study; only one study was considered to have an adequate sample size for the validation of a measurement tool as per the COSMIN guidelines [19]. The majority (78%) of the results were obtained from healthy adults, with clinical populations (12%) and those under the age of 18 (1.5%) not well represented. Research investigating the use of wearable sensors to measure lower limb kinematics has demonstrated a level of accuracy with clinical populations and children. Errors < 4° were reported for elderly individuals with hemiparesis [88] and RMSEs between 4.6 and 8.8° for children with spastic cerebral palsy [10]. There is potential for wearable sensors to be applied to the upper limb of these populations; however, more research is required to determine the optimal procedures prior to implementation in clinical practice.

The validity and reliability of wearable sensors when applied to the upper limb has not been clearly described to date. When compared to a robotic device, the commercial wearable sensors with customised software recorded errors below McGinley’s [7] suggested 5.0° threshold. Less than 3.9° was reported for replica/simulated movements of the wrist in 3DOF [28, 46, 48, 56], < 3.1° for 2DOF at the elbow [46, 48, 56], and < 2.5° for 1DOF (flexion/extension) at the shoulder [48]. Shoulder internal and external rotation resulted in the largest error (3.0–9.7°) [48], and therefore, results for this movement should be interpreted with caution.

The next section will discuss ‘in vivo’ studies with 3DMA as a pseudo gold standard. Studies that made a direct comparison between the wearable sensors and 3DMA system (i.e. used the same segment tracking) demonstrated differences that exceeded the suggested 5.0° threshold, with up to 15.0° difference reported for the elbow. However, depending on the software specifications and level of customisation, a difference of < 0.11° (3DOF shoulder), < 0.41° (2DOF elbow), and < 2.6 (2DOF wrist) was achievable. The range in difference observed between the two systems is indicative that wearable sensors are still largely in a ‘developmental phase’ for the measurement of joint angle in the upper limb.

Consistent with prior findings, error values were unique to the joint and movement tasks being measured. Most of the tasks involved movements in multiple planes (i.e. reaching tasks), which resulted in more error compared to studies that assessed isolated movement in a single plane (i.e. flexion and extension). Measuring multiple planes of movement poses a further challenge to motion analysis and needs careful consideration when interpreting the results [89].

Limitations

Due to the heterogeneity in the reported studies, a meta-analysis was not appropriate given the variance in sample sizes, movement tasks, different procedures, and statistical analyses used. It was also not possible to apply a standard assessment of quality and bias due to the diversity of the studies. The inclusion of small samples (30% single participant) is a potential threat to validity, with single participant analysis insufficient to support robustness and generalisation of the evidence. The inclusion of conference papers (28%) meant that many papers provided limited detail on the proposed system and validation results. Small sample sizes and the inclusion of mostly healthy adults means the results of this review cannot be generalised to wider clinical populations. In addition, studies that utilised different segment tracking (i.e. 3DMA markers were not mounted on the wearable sensor) were not further analysed as it was not possible to delineate between the sources of error.

Conclusion

Wearable sensors have become smaller, more user-friendly, and increasingly accurate. The evidence presented suggests that wearable sensors have great potential to bridge the gap between laboratory-based systems and the goniometer for the measurement of upper limb joint angle during dynamic movement. A level of acceptable accuracy was demonstrated for the measurement of elbow and wrist flexion/extension when compared to a robotic device. Error was influenced by the fusion algorithm and method of joint calculation, which required customisation to achieve errors < 2.9° from known angles on a robotic device. Higher error margins were observed in vivo when compared to a 3DMA system, but < 5° was achievable with a high level of customisation. The additional level of customisation that was often required to achieve results with minimal error is particularly relevant to clinicians with limited technical support, and critically, when using a system ‘off the shelf’, the expected level of accuracy may not be comparable to the findings reported in this review.

With this technology rapidly evolving, future research should establish standardised protocol/guidelines, and subsequent reliability and validity for use in the upper limb, and in various clinical populations. Direct comparisons with the gold standard (i.e. same segment tracking) is needed to produce results that are most meaningful. We recommend and encourage the use of wearable sensors for the measurement of flexion/extension in the wrist and elbow; however, this should be combined with outcome measures that have demonstrated reliability and validity in the intended population.