Keywords

1 Introduction

The measurement of the range of motion (ROM) is critical in health care as well as clinic treatment professionals. Also, the importance of ROM is increased in the field of rehabilitation exercises (RE) from surgery, disability, accidents and so on. The measurement devices of ROM (MDROM) can be applied to yield ROM information only if ROM values are measured by well trained and experienced persons with relatively high precision. However, P2P measurement with MDROM has several drawbacks. First, P2P with MDROM needs resources such as education, training for professional knowledge which follows cost and certain period of time. Second, it has constraints since at least two persons must be met in same time and physical space. Third, even though a person who measures is well trained, the results are not always be consistent since it depends on the persons.

With the rapid progress of artificial intelligence (AI) in image processing recently, there is a possibility that these shortcomings can be compensated. AI bears various areas. One of the AI-based image processing areas is the detect pose landmarks in an image or video streaming. MediaPipe provides open application programming interfaces (API) [3], called ‘ML kit’ to software developers for the purpose of ease and fast building applications. Right figure of Fig. 1 is an example figure provided by ‘ML kit pose detection API’ (we will use ‘ML-kit’ for brevity hereinafter). ML-kit provides 33 coordinates of body landmarks including entire body (face, arms, legs, etc.) For each coordinate is consisted of three values representing x-, y-, and z-axis of a landmark.

Image processing powered by machine learning algorithms show the powerful ability to replace P2P MDROM. Also, researchers show practical implementations in various areas such as underwater running [4], marker less system [5], pose matching [6], measuring single-leg squat kinematics [7] and so on.

Fig. 1.
figure 1

Two examples for measuring human body. P2P based ROM measurement (left) [1] and Machine learning based pose (landmarks) detection [2].

Besides medical and rehabilitation, RE machines are already on commercial market. For examples, multi-purpose device including rehabilitation, therapy and fitness is introduced in [8], mobility-abled rehabilitation exercise equipment [9], rehabilitation machine for people with disabilities [10], home healthcare device with IoT [11], and so on. However, almost RE doesn’t have functionalities for ROM using computer vision systems. We argue the main reason is that output data from computer vision is not reliable and hard to implement onto RE in terms of accuracy compared to P2P approaches.

We focus on the machine to be used in RE using machine learning to facilitate ROM with vision systems in this paper. We developed new RE machines during our research project mainly focused on upper body rehabilitation exercise/training machine. To exploit the computer vision systems, we installed camera devices and developed analysis software for various motions in rehabilitation exercise. One of the challenging problems is that since the images from rehabilitation training is 2D-coordinate, it is not trivial to estimate body angles in actual 3D-coordinates in real world. The easiest approach is to apply multiple cameras to measure ROM on RE. However, mounting several cameras is disadvantageous in terms of RE operation, maintenance, data processing of computing devices inside RE, and movability to elsewhere. We addressed the problem-solving approach utilizing machine learning to overcome insufficient 2D space information in 3D world.

2 Problem Definition

2.1 Problems of 2D Information in 3D World

To give better experience and systemic rehabilitation exercise, we developed a RE, which can monitor, and analysis 13 types of motions based on computer vision system. We report the possible rehabilitation actions of RE we developed in Table 1. The RE is consisted of exercise devices, information screen, computing devices, and a camera. The detailed figure is shown in Fig. 2.

Table 1. Supported rehabilitation motions in our RE.

We utilized ML-kit to detect landmarks of human body, which supports to extract 3D coordinates of 33 landmark points in real time. For more detailed information, refer to [12]. Even though the ML-kit produce 3D-coordinates with the tuple values, (x, y, z). The x and y are the landmark coordinates normalized to [0, 1], and z is the depth at the midpoint of hips.

Fig. 2.
figure 2

The configuration of the RE we developed. A camera is installed on head area of RE

Problems arises here is that if we consider only 2D image, the estimation of ROM has errors by nature. For example, if we measure angle of in shoulder-elbow-wrist of ROM in left elbow anterior up and down (Motion # 5 in Table 1) using only one camera as shown in Fig. 1, the angle is equal to 185°. However, if we measure the angle after moving the location of the camera to the left-side of the person the value of angle is equal to 118° which shows not acceptable errors. For the intuitive understanding, we portrayed the problematic situation in Fig. 3.

Fig. 3.
figure 3

The landmarks from front camera (right) and those from the left side of person

185 and 118° from the left- and right-hand figure of Fig. 3, respectively. Note that we applied well known mathematical equation to compute angle between three points in 2D space stated in Eq. (1).

$$angle=\left| \frac{radians}{180}\times \pi \right|,$$
$$radians=\mathrm{arctan}\left(\frac{{c}_{y}-{b}_{y}}{{c}_{x}-{b}_{x}}\right)-\mathrm{arctan}\left(\frac{{a}_{y}-{b}_{y}}{{a}_{x}-{b}_{x}}\right)$$
(1)

where \(a, b, c\) are the 2D coordinates such that \(a=\left({a}_{x},{a}_{y}\right),\)

\(b=\left({b}_{x},{b}_{y}\right), c=({c}_{x},{c}_{y})\) and \(\left|\cdot \right|\) is absolute function, respectively.

This example is just for one situation among 13 rehabilitation motions. To mitigate errors under information loss between the 2D and 3D space, we introduce machine learning approach in this paper.

2.2 Problem Formulation

We restrict the problem such that it is just for one situation among 13 rehabilitation motions. To mitigate Given a dataset \(D\), we want to learn how to predict more precise coordinates information to estimate ROMs. As our notation, we define observed dataset \(D={\left\{\left({o}_{i}, {y}_{i}\right)\right\}}_{i=1}^{N}\), where \({o}_{i}\) is observed information and expressed by vector space, \({y}_{i}\) is label (true value), \(\left(\cdot \right)\) is tuple containing a pair of data, \(N\) is the total number of data tuples, respectively. The maximum dimension of our data is \({\mathbb{R}}^{1\times 24}\), consisting of 24 \(=8\,points\,(landmarks)\times 3(x, y, z)\). Also, we select the first step of this problem applying simple machine algorithm, multiple linear regression (MLR), which is simple but fast learning algorithm. We will use root mean squared error (RMSE) to evaluate our approach, which can be stated as Eq. (2)

$$RMSE=\frac{1}{N}\sqrt{{\left({y}_{i}-{\widehat{y}}_{i}\right)}^{2}},$$
(2)

where \({\widehat{y}}_{i}\) is the predicted value by MLR.

3 Experimental Setup

To gather \(D\), we conducted experiments for each rehabilitation motions as stated in Table 1. The experiments are conducted via 13 test subjects (refer Table 2). Each test subject performs 5 times for each rehabilitation motion.

Fig. 4.
figure 4

Camera positions in experiments, where the numbers on solid black lines are distances between a camera and test subject, the red numbers beside cameras are the heights of cameras from ground, the figures inside of yellow circles are the camera indexes.

Table 2. Anonymized test subjects

To gather various possible combinations of camera position, we installed four additional cameras with different distances, heights, angle at which looking at the test subject. Finally, we have five camera configuration including rehabilitation measuring equipment (marked as ME in Fig. 4). We simulate five formations which are 9, 10, 12, 2, 3 o’clock direction from the line of sight of the test subject as portrayed in Fig. 4. Each camera record and save videos in remote storage (server) at the same time to yield the dataset \(D\) of machine learning. After finishing all experiments, we facilitated ML-kit algorithms to extract all possible coordinates among 33 landmarks. Since the camera yields 60 FPS (frame per second), we pullout 60 images per second. Finally, we secured 1,816,704 images from the experiments. Using the images, we extract all landmarks.

ML-kit extracts 1,773,234 coordinates of landmarks after dropping NONE values which failed to get coordinates by ML-kit, which is \(N=\mathrm{1,773,234}\) in \(D={\left\{\left({o}_{i}, {y}_{i}\right)\right\}}_{i=1}^{N}\). The \({x}_{i}\) is a vector with size \({\mathbb{R}}^{1\times 24}\) consist of 3D coordinates (\(x, y, z)\) from landmark #11, #12, #13, #14, #15, #16, #23, and #24 in Fig. 4. To acquire \({y}_{i}\) (label values, or equivalently ground truth values), we build mapping table for each action in Table 3.

Table 3. Camera mapping to generate \({y}_{i}\)

We set \({y}_{i}\) to be \({\mathbb{R}}^{1\times 12}\) since the final task of prediction is elbow angle of right and left in this paper. To compute right and left angle, we apply Eq. (1) landmark #12, #14, #16 for right elbow and #11, #13, #15 for left elbow angle, respectively. We divided \(D\) into train and test set applying ration 80% and 20%, respectively. Finally, we trained train set by applying MLR, and predicted \({y}_{i}\) after data learning (i.e., after all train steps).

4 Result Analysis

4.1 Prediction Performance in RMSE

We predict \({y}_{i}\) of test set (landmark #11, #12, #13, #14, #15, #16, #23 values) using \({x}_{i}\) of test set via MLR trained model. Let the predicted value be \({\widehat{y}}_{i}\). Then we compute RMSE errors between \({y}_{i}\) and \({\widehat{y}}_{i}\) using Eq. (2). The overall error is 0.07064 and the least error is 0.02726 shown in right wrist \(x\)-coordinate and the largest error is 0.11332 found in left elbow \(y\)-coordinate, respectively. The RMSE comparison is shown in Table 4 and visual representation is portrayed in Fig. 5.

Fig. 5.
figure 5

Prediction RMSE errors between \({y}_{i}\) and \({\widehat{y}}_{i}\).

Table 4. Prediction RMSE errors between \({y}_{i}\) and \({\widehat{y}}_{i}\).

4.2 Prediction Errors Between Left- and Right-Hand

To compare the prediction behavior between right and left, we report scatter plots of shoulder and elbow in Fig. 6. We omit other plots due to the page limit.

Fig. 6.
figure 6

Comparison prediction errors between left- and right-hand.

4.3 Performance of ROM Measurement

Since elbow angles is only considered among ROM in this study, we averaged all RMSEs using observed coordinates and predicted coordinates via MLR, respectively. Note that the ground truth is \({y}_{angle\_from\_groundTruth}\) using \(y\) in Sect. 3 and we have two types of \(\widehat{y}\) such that \({\widehat{y}}_{angle\_from\_observe}\) and \({\widehat{y}}_{angle\_form\_MLR}\) for angle comparison. The comparison between two angle estimation is reported in Table 5.

Table 5. Prediction RMSE errors between elbow angle estimation approach.

5 Conclusions

It is an important task to estimate ROMs as accurate as possible in rehabilitation exercise. Since the computer vision technology is plausible to apply into detecting human landmarks, it is promising in motion analysis in that cost effective and time saving. However, detecting and measuring the landmarks is not trivial yet. We addressed the way of measuring ROMs facilitating machine learning approach. In this paper, simple MLR is used to learn human elbow angle based on data. We developed how the machine learning approaches can be used and generate training and test dataset. Even though the test cases and test subjects are limited, we think the possibility of our approach could be addressed sufficiently. After implementation and experiments, we showed that machine learning based ROM estimation is possible approach in terms of enhancing the accuracy. If more enhanced algorithm is tested and evaluated in our domain, the better enhancement will be achieved. We leave this as our future research.