Introduction

Bicycles have been used as a form of active transportation due to the known benefits in terms of health and reduced environmental impact than motor vehicles [1]. In addition, cycling is a very popular sport with strong history of scientific engagement to improve performance and reduce the risk of injuries [2]. Among the most used methods to alter movement order to reduce risk of injuries and improve performance is video analysis [3]. Modern systems enable three-dimensional assessment of human movement but are limited to laboratorial environment or are prohibitively expensive, which limits their use in most clinics and recreational sports settings. Some systems utilise wearable sensors (e.g. LEOMO®) or a dual-camera setting which integrates data in the three-dimensional space (i.e. RETUL®). However, RETUL® does not disclose the cost of the system if you do not take part of their training module and extracts pre-determined variables that have not been based on scientific data. The LEOMO® has shown to produce moderate levels of agreement (ICC = 0.52–0.71) for some outcomes [4], which are also not supported by robust scientific evidence.

Traditionally, analysis of movement of cyclists has been undertaken on a stationary ergometer/trainer using two-dimensional (2D) video footage [5,6,7]. Markers are attached to the cyclist’s body and should be visible in the video frame to enable tracking in real time or after the video is recorded. This method enables clinicians and coaches to explore sensitivity of joint angles to changes in exercise intensity, pedalling cadence, fatigue, and body position on the bicycle [8,9,10]. However, tracking markers is time consuming and depends on skill level of the practitioner palpating the appropriate bony landmarks [11]. This element limits the large scale use of quantitative movement analysis to clinical settings.

The rapid development of trained neural networks to identify key human joint locations has provided an opportunity to streamline the analysis of videos (e.g. marker tracking). Neural network approaches to 2D human pose estimation are based around training a large model with input/output pairs, where the input is an RGB image and the output is a complete set of 2D joint locations. After training is complete, the model approximates a function for mapping images to joint locations. Even though studies have explored the validity of marker-less methods in determining joint angles [12, 13], only two studies explored the validity of pre-trained neural networks for cycling movement [14, 15]. Data from these studies suggest that a popular convolutional neural network (CNN) method for pose estimation proposed by Microsoft Research Asia [16] results in errors between 3 and 12° whilst OpenPose [17] led to errors of 4–22° in relation to a criterion measure [14, 15]. These errors would be potentially larger than the range proposed to determine body position on the bicycle [i.e. 10 deg. [18, 19]. In addition, data from Bini et al. [15] demonstrated that utilising a statistical parametric mapping method (i.e. SPM; [20]) provides a temporal comparison between the marker-less and a marked dataset to fully determine sections of the crank cycle where a given method is less accurate. This method should be implemented when assessing the validity of other marker-less methods.

With this in mind, this study examined the validity of two neural networks pre-trained to track key human joint locations in images (i.e. TransPose and MediaPipe) with potential ability to improve tracking of body segments. TransPose-R-A4 [21] was selected as a model representative of state-of-the-art accuracy in 2D human pose estimation. This model architecture incorporates a ResNet backbone [22] with a Transformer encoder [23] and requires a computer equipped with a GPU device for timely inference. MediaPipe BlazePose GHUM Heavy [24] was selected as a model representative of state-of-the-art efficiency in human pose estimation, since its optimised architecture enables inference in a range of computational environments including on smartphones and within web browsers. For validation of joint angles calculated using data from these networks, tracked reflective markers (reference) were utilised with the hypothesis that both networks would provide acceptable agreement in relation to the reference data.

Materials and methods

Twenty-six cyclists (four females and twenty-two males) with 37 ± 10 years of age, 178 ± 9 cm of stature and 80 ± 11 kg of body mass ranging from recreational to competitive were assessed in a single session using their own bicycles. Before data collection, all cyclists signed an informed consent to participate in the study, which was approved by the University Human Ethics Committee (AUTEC09/178). The sample size was calculated utilising a correlational model aiming for an effect size of ρ > 0.55 (large effect) with α < 0.05 and power of 0.80 using G*Power statistical package [25]. We based our calculations on the test–retest reliability of joint angles in cycling indicating that a coefficient of determination of 0.30 (i.e. effect size of 0.55) would be detectable when 21 samples are utilised [26]. The rationale for adding five cyclists was to ensure that any issues with processing video files would not result in less than 21 cyclists with all data available for statistical analysis.

After measurements of stature and body mass, cyclists performed 2 min of cycling on their own bicycles attached to a cycle trainer (Kingcycle, Buckinghamshire, UK) at self-selected cadence using their cycling shoes and cleats. Participants were instructed to sustain an intensity equivalent to long duration flat cycling. A digital camera (Samsung ES15, Seoul, South Korea) positioned at the height of their saddle, 4-m away from the bicycles recorded movement in the sagittal plane. Reflective markers were positioned at the greater trochanter, lateral femoral epicondyle, lateral malleolus, and pedal spindle (Fig. 1). Videos were recorded for 20 s at the end of the 2 min of exercise at 30 fps (640 × 480 of frame resolution) using automated quick shutter and anti-shake settings to minimise blur. The option for standard video rather than high speed was selected to simulate specifications of most smartphone video cameras, which are widely used in clinics and sports settings.

Fig. 1
figure 1

Illustration of the kinematic model used to calculate hip (H), knee (θK) and ankle (θA) angles. Inset illustrates model used for TransPose (TP) and MediaPipe (MP)

Comparison between the TransPose and the MediaPipe methods in relation to reference data (marker tracking) was performed. Pre-trained model weights were obtained from each method’s respective public code release and incorporated into a customised evaluation framework. In this context, the term “pre-trained model weights” refers to the fact that the neural network was previously trained on a separate dataset (as opposed to training the models on our cycling data). Since the existing TransPose model weights were not trained with detailed foot keypoints, this model was further fine-tuned using the Human Foot Keypoint Dataset [17]. The cycling video files were then imported to a customised programme which first located the cyclist using an object detection model (YOLOv5) and then inferred joint centres. An object detection model predicts bounding boxes for objects in an image (these are also referred to as “detections”). Whereas a pose estimation model maps an image to joint keypoints, an object detection model maps an image to object bounding boxes. In the context of this work, we use an object detector to locate the cyclist within the broader image. Separate fine-tuning and detection steps were not necessary for the MediaPipe model. Predicted joint centres (i.e. keypoints) were obtained from both methods and utilised to calculate hip, knee and ankle angles, as shown in Fig. 1. Keypoints were gap filled using a median filter and a moving average was utilised to reduce noise from the automated digitisation prior to angular calculations.

As a criterion measure, hip, knee and ankle angles were also calculated using reflective markers digitised from each frame. Semi-automatic digitisation was performed using a motion analysis software (Skill Spector, Video4Coach, Denmark). The median filter and moving average were also applied to the digitised joint centres to reduce filtering effects to comparisons with the marker-less methods. An offset was applied to ankle angles from the marker-less outputs because these angles were measured differently to the criterion method, where the ankle was determined using the pedal axle (see Fig. 1). Data from the two methods and the criterion were sectioned into ten consecutive crank cycles, with the mean temporal series from each cyclist obtained for further analysis.

Comparison of temporal patterns were performed between methods using statistical parametric analyses within spm1d statistical package (www.spm1d.org), in MATLAB. Paired samples t-tests were conducted to compare each marker-less method in relation to the reference data. Typical errors were calculated for the whole crank cycle for comparisons between methods as the ratio between the standard deviation of the differences by the square root of ‘2’ [27]. Correlation coefficients (with 95% confidence intervals) were also calculated between waveforms in MATLAB. R values were ranked as poor (0–0.5), moderate (0.5–0.75), good (0.75–0.90), and excellent (> 0.9) [28].

Results

Correlation coefficients for hip angles between the TransPose method and the reference were 0.97 [excellent, 0.97–0.98, p < 0.01]. For knee angles, correlation coefficients between the TransPose method and the reference were 0.98 [excellent, 0.98–0.99, p < 0.01]. For the ankle angle, correlation coefficients between the TransPose method and the reference were 0.47 [poor, 0.46–0.49, p < 0.01]. For the hip angle, significantly less flexion was observed for TransPose than the reference between 90 and 129° and more flexion was observed between 304 and 331° of the crank cycle (Fig. 2). Significantly less flexion was also observed for TransPose compared to the criterion between 50 and 110° of the crank cycle (Fig. 3). For the ankle, significantly more plantar flexion was observed for TransPose between 44 and 59° of the crank cycle (Fig. 4).

Fig. 2
figure 2

Top panel: Hip angle temporal comparison between criterion (Tracked—blue), TransPose (TP—red). Bottom panel presents SPM 1-d statistics. Solid lines in the top panel are the mean values whilst dashed areas represent standard deviations. Solid lines in the bottom panel present the t statistic outputs whilst the dashed red lines show the critical t value for significant differences (colour figure online)

Fig. 3
figure 3

Top panel: Knee angle temporal comparison between criterion (Tracked—blue), TransPose (TP—red). Bottom panel presents SPM 1-d statistics. Solid lines in the top panel are the mean values whilst dashed areas represent standard deviations. Solid lines in the bottom panel present the t statistic outputs whilst the dashed red lines show the critical t value for significant differences (colour figure online)

Fig. 4
figure 4

Top panel: Ankle angle temporal comparison between criterion (Tracked—blue), TransPose (TP—red). Bottom panel presents SPM 1-d statistics. Solid lines in the top panel are the mean values whilst dashed areas represent standard deviations. Solid lines in the bottom panel present the t statistic outputs whilst the dashed red lines show the critical t value for significant differences (colour figure online)

Correlation coefficients for hip angles between the MediaPipe method and the reference were 0.91 [excellent, 0.90–0.91, p < 0.01]. For knee angles, correlation coefficients between the MediaPipe method and the reference were 0.96 [excellent, 0.95–0.96, p < 0.01]. For the ankle angle, correlation coefficients between the MediaPipe method and the reference were 0.25 [poor, 0.23–0.27, p < 0.01]. For the hip angle, significantly less flexion was observed for MediaPipe than the reference between 90 and 129° and more flexion was observed between 304 and 331° of the crank cycle (Fig. 5). Significantly more flexion was observed for MediaPipe between 0 and 36° and between 175 and 360° of the crank cycle (Fig. 5). The knee was also more flexed for MediaPipe between 147 and 272° of the crank cycle (Fig. 6). For the ankle, significantly more plantar flexion was observed for MediaPipe throughout the crank cycle (Fig. 7).

Fig. 5
figure 5

Top panel: Hip angle temporal comparison between criterion (Tracked—blue), MediaPipe (TP—red). Bottom panel presents SPM 1-d statistics. Solid lines in the top panel are the mean values whilst dashed areas represent standard deviations. Solid lines in the bottom panel present the t statistic outputs whilst the dashed red lines show the critical t value for significant differences (colour figure online)

Fig. 6
figure 6

Top panel: Knee angle temporal comparison between criterion (Tracked—blue), MediaPipe (TP—red). Bottom panel presents SPM 1-d statistics. Solid lines in the top panel are the mean values whilst dashed areas represent standard deviations. Solid lines in the bottom panel present the t statistic outputs whilst the dashed red lines show the critical t value for significant differences (colour figure online)

Fig. 7
figure 7

Top panel: Ankle angle temporal comparison between criterion (Tracked—blue), MediaPipe (TP—red). Bottom panel presents SPM 1-d statistics. Solid lines in the top panel are the mean values whilst dashed areas represent standard deviations. Solid lines in the bottom panel present the t statistic outputs whilst the dashed red lines show the critical t value for significant differences (colour figure online)

Typical errors are presented in Figs. 8 (TransPose) and Figs. 9 (MediaPipe). For the hip angle differences ranged between 1 and 3° for the TransPose method in relation to the criterion, whilst the MediaPipe presented errors of 3–6°. For the knee angle, typical errors ranged between 2 and 3° for the TransPose method and 3–6° for the MediaPipe method, as illustrated in Fig. 5. For the ankle angle, typical errors between the TransPose and the criterion ranged between 3 and 10°, whilst the MediaPipe method differed between 5 and 9°.

Fig. 8
figure 8

Typical errors for the hip (HA), knee (KA) and ankle (AA) angles through the crank cycle between TransPose (TP) and the reference data (Tracked)

Fig. 9
figure 9

Typical errors for the hip (HA), knee (KA) and ankle (AA) angles through the crank cycle between MediaPipe (MP) and the reference data (Tracked)

Discussion

This study demonstrated that TransPose presented stronger agreement and lower difference to the reference method than the MediaPipe method, which partially supports our hypothesis. Major differences between both marker-less methods and the reference data were at the ankle joint. This information is important because, MediaPipe is gaining popularity as a result of its versatility, which enables deployment in smartphones and web browsers. However, the magnitude of errors from MediaPipe should be taken into consideration depending on the application.

In prior studies involving walking gait, marker-less methods presented differences between < 1° [12] and 6° [13], which is comparable to findings from the current study for the hip and knee joints. For cycling, Bini et al. observed 3–12° of difference between the MSRA and the reference data [14], which suggests that TransPose and MediaPipe may perform better than the MSRA. It is also important to highlight that these methods seem to perform well when tracking the hip and knee joints but struggled to track foot markers. This is why both TransPose and MediaPipe produced poor agreement in determining the ankle angle. Visual inspection of the videos generated by these methods suggest that this was potentially due to increased blur at the foot from lower shutter speed, which challenged the marker-less methods in accurately detecting the toes. An increased shutter speed and higher quality image sensor should improve the accuracy of these methods in future application.

There are multiple factors that influence joint angles during cycling, including exercise intensity, cadence, fatigue, etc. Prior research observed that the ankle range of motion increases by ~ 4° and mean ankle angle reduces by ~ 3° when intensity is increased during cycling [29], which suggests that none of the marker-less methods tested in the current study would be sensitive to detect these changes. Another application of joint kinematics is as an input in musculoskeletal modelling. Simulating changes in knee angle of 3–6° in terms of the moment-arm of the vastus lateralis in a public available model [30] would result in errors of 0.14–0.30 cm, which could be deemed small. Therefore, it seems possible that both marker-less methods could offer an open-source alternative to subscribed marker-less software, but further research is required to fully determine the magnitude of these errors in terms of internal loads. There are potential implications for bicycle fitting because most studies recommend a range of knee angles to optimise saddle position (e.g. 30–40 deg.; [31]), which should be detectable by TransPose and MediaPipe. In addition, knee forces do not seem to be sensitive to changes in knee angles of ~ 10–14° [32], which suggests that large changes in cycling kinematics could be detectable particularly by both the methods.

The option for using pre-trained neural networks required amendments to TransPose as this was not initially prepared to identify foot keypoints. In addition, neither of the marker-less methods have been extensively exposed to cycling images or poses taken purely from the sagittal plane [33]. It is probable that fine-tuning TransPose and MediaPipe with sagittal plane cycling images would further improve their accuracy, particularly when tracking the foot.

The use of two-dimensional video analysis limited data from this study due to possible parallax errors. The choice for using a two-dimensional model was based on a larger use of this method in most clinical settings and bike fitting practices, due to the low cost of video recording devices. Even though data from walking gait demonstrated good agreement between two-dimensional marker-less vs. three-dimensional marker [34], it is important to assume that there would be ~ 2.2–10° of error in relation to the true movement of cyclists detected using three-dimensional data [35, 36]. Our choice for using standard frame rate (i.e. 30 fps) and standard video resolution (640 × 480 pixels) was also in line with the fact that most commercial cameras will be limited in terms of frame rate. Results may improve if frame rate and image resolution are higher than the currently used in this study.

Conclusions

In summary, the TransPose method presented stronger agreement in determining joint angles compared to a criterion method than the MediaPipe method. Poor correlation though was observed for the ankle joint for both marker-less methods, which limits their accuracy in tracking this joint.