Introduction

Human activity recognition (HAR) aims to identify significant patterns from gestures and activities of the human body, which is extremely valuable in many applications. Recent advances in various technologies have made HAR research more dynamic, thus increasing demand in various industries [1]. Generally, HAR often utilizes multiple sensors, such as depth sensors, cameras, and wearable devices. However, data collection for HAR is a cost-intensive and challenging task because of privacy issues and data diversity problems [2]. In particular, it is an exceedingly difficult task to collect action data for people with disabilities. Nevertheless, collection of disability data for HAR research should be performed because the number of people with disabilities continues to increase, and they are performing various activities together in society with non-disabled people [3]. Owing to the difficulty of collecting data on general actions in daily life, HAR research for people with disabilities needs to be conducted using data on exercises. Additionally, in HAR, people with disabilities should be considered differently from non-disabled people because there are restrictions on various actions depending on the type of disability.

When collecting data for people with disabilities, the accuracy of the HAR task is particularly vulnerable to the class imbalance problem. The class imbalance problem occurs when the distribution of classes within a dataset is extremely skewed, that is, there are many more examples in one class (majority class) than in other classes (minority classes) [4]. This is an inherent problem that exists in a wide range of fields, such as facial and image recognition [5], medical diagnosis [6], and fraud detection [7]. Most studies assume that the data are balanced among different classes. However, in the case of data on disabled people, it is difficult to collect a perfectly balanced dataset. Therefore, identifying the minority disability classes is critical for assessing risks that may occur during exercises by the disabled. As such, improving the accuracy of the classifier is not sufficient because in supervised learning, the classifier may overfit to the majority classes [8].

Many studies have presented solutions to address the class imbalance problem. The most conventional approaches are the undersampling and oversampling methods. These methods balance the class ratio of imbalanced data by sampling from the original data. Another approach is the cost-sensitive learning method, in which machine learning algorithms, such as neural networks, converge using a cost function [9]. Ghorbani et al. [9] proposed modifying the cost function of neural networks with respect to relative weights based on class imbalance ratios. Recently, several augmentation studies have focused on capturing spatial and temporal information for different tasks, such as action recognition [10, 11]. However, data augmentation for HAR is complex because recognizing human action is a high-dimensional and complex task that requires the extraction of spatio-temporal information from human skeleton sequences [12, 13]. Few research articles have explored data augmentation techniques for the HAR task. Moreover, progress in related research has slowed because of the lack of data on people with disabilities.

Thus, we focus on alleviating the class imbalance problem by proposing a data augmentation method to improve the accuracy of disability classification. This study makes the following major contributions:

  1. 1.

    We propose the state transition-oriented conditional variational autoencoder (STO-CVAE), a data augmentation method specialized for human action recognition per type of disability, to resolve the class imbalance problem for the HAR task.

  2. 2.

    We provide a data-transformation method for capturing action from human pose estimation (HPE) to obtain action-centric data based on state transitions.

  3. 3.

    We examine the effectiveness and superiority of the proposed approach by conducting various experimental studies on real-world datasets.

The remainder of this paper is structured as follows. “Related works” reviews the related literature. The next section describes the proposed method. In “Experiments and results”, we present the experimental results for a real-world dataset. The final section summarizes the conclusions of the study and explores directions for future research.

Related works

This section presents an overview of the existing literature on four related topics: the data imbalance problem, HAR, HPE, and classification of tabular datasets. This study focuses on the data imbalance problem in classifying the disability type using video datasets. Additionally, HAR and HPE techniques are reviewed because they are used to recognize the actions of disabled people from videos. Finally, we examine the classification algorithms, especially for action recognition.

Data imbalance

Many studies have been conducted in various fields to solve the data imbalance problem. Traditionally, there are two main approaches to dealing with the imbalance problem: data-based and algorithm-based methods [14, 15].

The data-based methods adjust the class ratio of the learning dataset. Sampling methods can be divided into undersampling and oversampling methods. Under-sampling techniques involve arbitrarily removing examples from the majority class to balance the class ratio of the dataset [16], such as the random under-sampling, condensed nearest neighbor rule and Tomek links [17, 18]. These methods reduce training time, but informative samples from the majority class may be lost and unpredictable bias occurs, potentially resulting in decrease in accurate performance [19].

On the other hand, oversampling techniques replicate examples from the minority class to match the majority class ratio to balance the data. The oversampling approach includes random and synthetic sampling methods. Random oversampling (ROS) is a method of randomly replicating a sample until the minority class has the same ratio as the majority class. However, it may cause overfitting problems, because the ROS method simply generates a process for the original dataset, which is not conductive to the generalization performance of the classification [20]. Meanwhile, as a representative example of synthetic sampling methods, the synthetic minority oversampling technique generates data from the minority classes using the k-nearest neighbor (k-NN) algorithm [21]. However, generating synthetic data based on the k-NN algorithm is inadequate when the data are high-dimensional, nonlinear, or complex. Therefore, research dealing with the data imbalance problem in small datasets has become more active than that in big datasets. For small datasets, research has been conducted to develop a data augmentation method to increase prediction accuracy by modifying the ratio of synthetic data to solve the data imbalance problem [22].

Generative models

The most representative models of generating synthetic data are divided in three algorithms, generative adversarial networks (GANs), diffusion models, and variational autoencoders (VAEs). GANs are type of generative model using two neural networks, a generator and a discriminator, competing against each other. GANs have fast and accurate generating performance, so GANs have shown remarkable success in various applications such as image, videos, text and so forth [23]. Although GAN has attracted the attention of researchers due to remarkable performance and wide applicability, this is limited to image data. For tabular data, which accounts for the largest proportion in the world, GAN is vulnerable to multi-modal distribution problems [25, 26]. To overcome this, TGAN and CTGAN, which are specialized in tabular data, have developed [27]. CTGAN focuses on the problem of continuous variable processing and categorical data imbalance following non-Gaussian and multi-modal distribution by proposing mode-specific normalization and training-by-sampling methods. In the training phase, generators and discriminators are trained to generate tabular data similar to real data with conditional vectors and use loss functions of WGAN-GP [28]. However, the purpose of GAN-based models is not to learn the distribution of data, but to generate synthetic data, which is as similar as real data as possible.

Diffusion models are a class of generative models that directly generate samples from a noise Gaussian distribution, leveraging the concept of diffusion processes. Diffusion models have shown promising results in generating realistic high-dimensional data, but its suitability for tabular data is still an area of research [24]. This study does not deal with unstructured high-dimensional data, so diffusion models are out of scope.

VAEs are another type of generative model consisting of an encoder and decoder networks, and latent space. VAEs are well known for their ability to generate diverse synthetic data in terms of distribution, so it is evaluated as a suitable technique for generating tabular dataset [24]. VAE can build higher quality datasets for generating synthetic data, which is generated from the learned distribution of real data based on the variational inference. The VAE generates synthetic data with random noise following the Gaussian distribution. It includes various algorithms, such as the \(\beta\)-VAE and the conditional VAE (CVAE) [29, 30]. \(\beta\)-VAE introduces a relative weight \(\beta\) on the Kullback–Leibler (KL) divergence loss against the reconstruction loss in the VAE loss function [31]. Meanwhile, the CVAE adds conditional information, such as label values to improve the embedding space learning. Wang et al. [32] demonstrated that CVAE outperforms \(\beta\)-VAE in terms of the reconstruction loss for the MNIST and Fashion MNIST datasets. As such, the algorithm-based methods adjust the loss function to concentrate on improving the ability of minority class.

Human action recognition (HAR)

Action recognition is an important task that has received considerable attention for decades due to its role in monitoring systems in health care and real-world applications [33, 34]. Existing studies are largely divided into RGB frame-based and human skeleton-based studies. Recently, a skeleton-based HAR study focusing on the position of human joints instead of RGB presentation has received assiduous attention [34]. This is because the position of human joints provides robust action recognition to overcome environmental noise (e.g., placement constraints, a variety of costumes) [35, 36]. However, since the skeleton-based HAR methods extract skeleton information per frame, there is a problem of considering temporal information. Moreover, there are challenges with different time intervals, even when taking the same action. In our study, rehabilitation exercise videos are used for extracting action information according to the disability type.

In general, the existing literature offers three main approaches to skeleton-based action recognition: handcrafted feature-based [37], deep learning [38, 39], and pose estimation methods [40]. Handcrafted feature-based methods have limitations in obtaining joint relationships with spatial features. Meanwhile, spatial and temporal information could be obtained with graph neural network based methods [41, 42]. However, there are limitations in recognizing repetitive actions on continuously irregular time intervals and not considering the differences in accuracy of the x, y, and z-axis values in three-dimensional (3D) space. These limitations show that accurately estimating human poses is difficult because the human posture has many degrees of freedom and occlusion problems [43]. Various HPE methods have been proposed to overcome these problems [44, 45]. HPE can be more accurate by predicting only the positions of the joints in each frame. Some estimators have also been developed for predicting more accurately joint positions in fields that deal with joints sensitively, such as the field of physical medicine and rehabilitation [46, 47]. In our study, we will use HPE with the proposed algorithm for rehabilitation exercise recognition.

Human pose estimation (HPE)

Recently, HPE has become one of the most important tasks in computer vision to estimate specific joint positions in the human body for HAR [48]. HPE is divided into two approaches: 2D and 3D pose estimation methods [49]. With recent advances in deep learning-based methods, 2D pose estimation techniques have shown high performance. For example, PoseNet and OpenPose are representative 2D HPE models [50, 51]. However, many actions in reality, such as fitness and yoga exercises, can be more accurately recognized by capturing them in 3D space than in 2D space. 3D pose estimation focuses on the occlusion problem by the direction of the camera or environmental constraints. Recent studies in deep learning architectures have led to significant progress in 3D pose estimation, especially lifting 3D poses from a single camera [52]. These require large-scale datasets to achieve the generalization capability to accurately estimate 3D poses [53]. Further, the recent advancements prefer a single image for 3D HPE because of a more accessible and convenient solution for real-world applications. 3D HPE contains two major approaches: direct 3D estimation approach and 2D-to-3D lifting approach. The former adopts the end-to-end manner to obtain 3D HPE from 2D images or videos while the latter extracts 2D HPE keypoints and transforms them to 3D dimension [67,68,69]. Several studies focus on input image normalization and camera calibration with semi-supervised learning to leverage the accurate transformation [70, 71].

Meanwhile, three keypoints for HPE research are highlighted: (1) accurate pose estimation, (2) real-time processing, and (3) lightweight model architecture [72]. First, various research focuses on precisely predict the position and angles of keypoints. Second, HPE is required for efficiency even in real-time environments. With its optimized model structure and efficient processing method, it will quickly estimate human poses from live video streams. Third, a lightweight model architecture requires low resources, making it suitable for deployment portability, and usability in real-world scenarios. Therefore, regressor-based models like BlazePose are more suitable for on-device environments compared to heatmap-based models [54].

BlazePose is a 3D pose estimation model developed by Google. It is used as a proxy for a human detector and uses a detector-tracker machine learning pipeline. The detector determines the region of interest within the frame, which is a human object, and first detects the face of the human. The face detector is used as a proxy for the human detector [55]. It predicts three additional alignment features: the midpoint of a person’s hip, the radius of the circle surrounding the entire person, and the inclination angle of the line connecting the shoulder and midpoint of the hip.

The BlazePose tracker predicts the presence of a person for the \((x, y, z)\) coordinates of 33 points on the human body by presenting a new topology, as shown in Fig. 1, which is a superset of Common Objects in Context, BlazeFace, and BlazePalm topologies [56]. Unlike conventional approaches that use compute-intensive heatmap prediction, the pose estimation tracker of BlazePose uses a regression approach that combines heatmap and offset predictions for all keypoints. During training, BlazePose uses a heatmap and the offset loss to train a heatmap-based network, as shown in Fig. 2. After learning the left and center towers, BlazePose removes the heatmap output and learns the regression-based network. Consequently, it has a lightweight effectiveness while effectively using the heatmap result.

Fig. 1
figure 1

Estimated 33 keypoints topology in BlazePose

Fig. 2
figure 2

Training BlazePose architecture: BlazePose uses a heatmap and the offset loss to learn the center and left towers. In the right towers, the outputs of the heatmap-based network are combined and used to train the regression encoder

Classification for tabular dataset

In this subsection, we review the disability classification algorithms by recognizing keypoint-based actions, which are typically represented as tabular datasets. The existing research points out that machine learning algorithms, such as ensemble methods, have outperformed deep learning models when dealing with tabular datasets [57]. Thus, we describe four competitive ensemble algorithms with a deep learning model specialized for tabular dataset, named TabNet [58].

The extreme gradient boosting (XGBoost) and light gradient boosting machine (LightGBM) are representative algorithms for tabular datasets. XGBoost is a model that is sequentially learned and ensembled by weighting the errors of weak decision trees. It performs robustly on classification and regression tasks [59]. Meanwhile, LightGBM copes with high-dimensional data through efficient learning by adopting the gradient-based one-side sampling and the exclusive feature bundling algorithms [60]. The random forest algorithm is a bootstrap aggregating (bagging)-based model that is used in tasks dealing with multivariate tabular datasets [61]. It deals with missing values and avoids overfitting by combining multiple decision trees with the random subspace method. Further, it finally obtains the predictions through a voting mechanism [62]. The support vector machine (SVM) algorithm is a popular discriminative model that utilizes decision boundaries using support vectors in the data space. SVM is effective in dealing with high-dimensional structured data because it uses the kernel trick [63].

TabNet, a tabular data-oriented deep learning algorithm, was developed by Google [58]. It has an autoencoder structure that combines an encoder and a decoder to enable supervised and unsupervised learning. As shown in Fig. 3, TabNet consists of three components in the encoder section, namely the feature transformer, attentive transformer, and mask, in one step, which offers the advantage of decision tree-based gradient boosting. The feature transformer, which performs the encoding for each step, consists of four layers, and each layer consists of three blocks in sequence: a fully connected (FC) layer, batch normalization, and a gated linear unit (GLU). Using these sequential attentions, TabNet improves accuracy.

Fig. 3
figure 3

TabNet encoder architecture: a the TabNet encoder consists of three components: feature transformer, attentive transformer, and mask; b the attentive transformer consists of prior scales and sparsemax for mask; c the feature transformer has four layers where two are shared across all decision steps and two are decision step-dependent

Proposed method

This section presents an overview of the proposed method. The proposed method handles the data imbalance problem for rehabilitation exercise datasets. First, we extract 33 keypoints from rehabilitation exercise videos of a rowing exercise using HPE. Second, we perform data augmentation using the proposed model, STO-CVAE, which does not simply learn keypoints, but also transforms them into a sample that reflects state transition-based human action, and then generates the action-based data using STO-CVAE (Fig. 4).

Fig. 4
figure 4

Overall procedure of the proposed STO-CVAE data augmentation method. a The input contains the estimated keypoints from the frame image of the exercise video. b State transition step. c CVAE structure. d The final output is the augmented data generated by STO-CVAE using the state transitions of the exercise positions

Data collection using HPE

Human exercise posture depends on the flow of combinations of keypoints over time. Therefore, it is vital to evaluate a person’s athletic ability using a combination of keypoints.

Until recently, public datasets for keypoints did not exist in the field of exercises for disabled people. Therefore, we had to collect data directly on on-device for this research. This study can be a reference for the research on disability action recognition based on the disability type. We collected videos of rehabilitation exercises from the National Rehabilitation Center, which is the primary division supporting and providing welfare programs for disabled people in South Korea. This study focused on rowing exercises among the ten rehabilitation exercises performed at the center. In the rowing exercise, the patients extend their arms to the right and left sequentially, at a constant speed. We constructed a keypoint-based tabular dataset using video data. The dataset had 99 feature values as 3D coordinate values of 33 keypoints per frame. Figure 5 shows the process of converting a video into a tabular dataset using HPE, which estimates 33 human keypoints at 30 frames/s.

Fig. 5
figure 5

Process of collecting keypoint data using the HPE model. (Left) We use BlazePose to extract 33 human keypoints from several exercise videos performed by disabled patients. (Right) The output is a tabular dataset of keypoint coordinate values per sample

The data for 19 people were collected and classified into seven classes based on the disability types: non-disabled (ND), cerebral palsy (CP), cerebral lesion (CL), spinal cord injury (SCI), muscular dystrophy (MD), intellectual disability (ID), and autism spectrum disorder (ASD). There was an extreme imbalance ratio according to class, as shown in Fig. 6. Specifically, the CL, ID, and ASD classes with a ratio of not less than 0.1 were considered minority classes, while the remaining classes were included in the majority class.

Fig. 6
figure 6

Total data examples per class, revealing that our dataset has a class imbalance problem

Human action state-centric transformation

Figure 7 shows the process of generating action samples through the proposed state transition algorithm. The proposed algorithm can capture temporal patterns regardless of the number of frames in continuous multi-frames on video. Figure 4b shows the state transition steps from data to the state transition-oriented transformation. First, a frame (image) extracted from the video is used to estimate the topological skeleton of the human body via the HPE model. Then, the estimated skeleton is represented as several keypoints for major joints in the body, and each keypoint contains \(x, y,\) and \(z\) values. We recognize human action according to the estimated keypoints with sequential multiple frames because it is hard to recognize the action from a single frame.

Fig. 7
figure 7

The process of generating action samples using a state transition algorithm. On the left side, frames with skeleton are generated by BlazePose. On the left side, action samples with temporal information spanning multiple frames are generated using the state transition algorithms

Second, regarding robust action recognition, sequential frames should be divided, and a frame group should be combined according to the specific position state, which is an action element. For the rowing exercise, wrist keypoints in both hands were used to split the positional states. As the rowing exercise involves moving both hands up and down, we considered two states: 0 and 1. We identified each state by considering the keypoint of both wrists and the threshold \(\gamma\), which is the average of keypoint y-values of the shoulder and hip. The formula for \(\gamma\) is defined as follows:

$$\gamma =\frac{{y}_{{\text{left-sholuder}}}+{y}_{{\text{right-sholuder}}}+{y}_{{\text{left-hip}}}+{y}_{{\text{right-hip}}}}{4}.$$
(1)

In state 0, the average of both two wrists \(y\)-values is less than the threshold \(\gamma\), and vice versa. Figure 8 illustrates frames of position states 0 and 1.

Fig. 8
figure 8

Position state is divided into two position states. If the average \({\varvec{y}}\) of both wrists is below a threshold \({\boldsymbol{\gamma}}\), then the frame is labeled position state 0 (left); otherwise, is it labeled position state 1 (right)

Third, we recognize the change in position states that are robust to the various motion transition times for each exerciser. Here, a state transition is proposed to transform keypoints into positional-state-based actions. These action-based states can reflect clear difference patterns according to the disability type. Figure 4b shows the process of extracting the representative frame for each frame group. In the case of \(k\)-sequential frames form one group \({g}_{i}\) = {\({f}_{1}, \ldots , {f}_{k}\)} and \(l\)-sequential frames form the next group \({g}_{i+1}\)=\(\left\{{f}_{k+1}, \ldots , {f}_{k+l}\right\}\), then \({r}_{{g}_{i}}= {f}_{\lfloor\frac{k}{2}\rfloor}\) and \({r}_{{g}_{i+1}}= {f}_{\lfloor k+\frac{l-1}{2}\rfloor}\) are the representative frames of \({g}_{i}\) and \({g}_{i+1}\), respectively.

Fourth, we calculate the difference between \({g}_{i+1}\) and \({g}_{i}\) to reflect the state transition-oriented topological information. Since the representative frames \({r}_{{g}_{i}}\) and \({r}_{{g}_{i+1}}\) of the frame groups are 99-dimensional vectors of keypoints, the new sample \({x}_{i}\) contains state transition-based keypoint values. The formula for \({x}_{i}\) is as follows:

$${x}_{i}= {r}_{{g}_{i+1}}-{r}_{{g}_{i}},$$
(2)
$${r}_{{g}_{i}}=\left({r}_{{g}_{i}, 1,}{r}_{{g}_{i}, 2}.\ldots ,{r}_{{g}_{i}, 99}\right)\in {\mathbb{R}}^{99},$$
(3)

where \({x}_{i}\) is the state transition between \({r}_{i}\) and \({r}_{i+1}\). In addition, \({r}_{{g}_{i+1}}\) and \({r}_{{g}_{i}}\) are the representative frame keypoint coordinates of the groups \({g}_{i+1}\) and \({g}_{i}\), respectively, and \(i\) is the group index. In addition, a frame count variable is added to the sample \({x}_{i}\), and the new variable is defined as the sum of the number of frames in frame groups \({g}_{i}\) and \({g}_{i+1}\). Therefore, we can obtain a transformed sample \({x}_{i}\) that reflects sequential information of exercises through state transition. The sample is divided into two actions: (i) arms from the head to the side of the hip and (ii) arms from the side of the hip to the head. The pseudocode of the state transition is presented in Algorithm 1.

Algorithm 1:
figure a

State transition estimation

State transition-oriented conditional variational autoencoder (STO-CVAE)

CVAE, which is an extension of VAE, utilizes class labels as a condition to more accurately reflect the class-dependent properties in generating synthetic data [21]. It differs from VAE in that it trains in consideration of certain conditions in encoding and decoding. CVAE aims to maximize the value of the marginal likelihood for the distribution model \(p\) under the given conditions. The marginal likelihood is expressed as follows:

$${\text{log}}\,p(x)={D}_{KL}({q}_{\varnothing }(z|x, c)|\left|p\left(z|x\right)\right)+{\mathcal{L}}\left(\theta , \varnothing ; x, c\right),$$
(4)

where \({q}_{\varnothing }\) is the approximate posterior probability, \(p(z|x, c)\) is the prior distribution of the latent variable \(z\) under condition \(c\), and \(c\) is the class label as a condition of input data \(x\). The second term \({\mathcal{L}}\) is the evidence lower bound (ELBO). Since the KL divergence is non-negative, the ELBO is the upper bound on the marginal likelihood of \(p\). Then, (4) changes as follows:

$${\text{log}}\,p(x)\ge {\mathcal{L}}\left(\theta , \varnothing ;x, c\right),$$
(5)
$$\begin{aligned} {\mathcal{L}}\left(\theta , \varnothing ;x,c\right)&= -{\mathbb{E}}_{{q}_{\varnothing }\left(z|x,c\right)}\left[{\text{log}}\,\left({p}_{\theta }\left(x|z,c\right)\right)\right]\\ & \quad + {D}_{KL}({q}_{\varnothing }\left(z|x,c\right)|\left|{p}_{\theta }\left(x|z,c\right)\right).\end{aligned}$$
(6)

From (5), maximizing the marginal likelihood of \(p\) is derived by replacing the marginal likelihood with the problem of maximizing the ELBO. In the ELBO Eq. (6), the first term should be calculated using the Monte Carlo gradient method during backpropagation for CVAE training. However, the Monte Carlo gradient method is unsuitable because of its high variance. The CVAE overcomes this problem via a reparameterization trick that uses random variables \(\varepsilon \sim {\mathcal{N}}(0,{\sigma }_{\varepsilon })\) from a standard Gaussian distribution instead of sampling \(z\sim {q}_{\varnothing }\left(z|x, c\right)\). As a result, the first term in (6) is the negative log-likelihood of the reconstruction error. The second term in (6) is the regularization of the KL divergence with the prior distribution for sampling \(z\). Consequently, the equation can be rewritten as follows:

$${L}_{{\text{total}}}= {L}_{{\text{recon}}}+ {L}_{{\text{KL}}},$$
(7)
$${L}_{{\text{recon}}}=-\sum_{j=1}^{n}{({x}_{i}-{\widehat{x}}_{i})}^{2},$$
(8)
$${L}_{{\text{KL}}}= \frac{1}{2}\sum_{j=1}^{l}\left({\mu }_{i,j}^{2}+ {\sigma }_{i,j}^{2}-{\text{ln}}\left({\sigma }_{i,j}^{2}\right)-1\right),$$
(9)

where \(n\) is the feature space of \({x}_{i}\), \({\widehat{x}}_{i}\) is the synthetic sample of \({x}_{i}\), and \(l\) is the latent vector size of sampling \(z\) from \(\varepsilon \sim {\mathcal{N}}(0,{\sigma }_{\varepsilon })\). Here, we use the mean squared error instead of the binary cross entropy of the original CVAE for multiclass classification as shown in Eq. (8) [64, 65].

Based on the state transition estimation, we converted continuous multi frames, which estimated by BlazePose, into an action sample with temporal information. However, there is an uncertainty problem due to varying resolution, self-occlusion and complexity of action. Figure 9 presents boxplot results showing that the uncertainty of the z-axis difference value is significantly greater than those of other axes. To mitigate this issue, we propose STO-CVAE, a data augmentation model that utilizes an action sample-based state transition algorithm. STO-CAVE incorporates relative weights based on the inter-axis uncertainty observed in 3D space. In STO-CVAE, the reconstruction loss term can be reformulated for the state transition as follows. Here, we set a smaller weight to the estimated values of z-axis to reduce the side effect of the \(z\)-axis uncertainty of the keypoint estimations. The total loss of STO-CVAE is as follows:

$${L}_{{\text{total}}\_{\text{STO}}}= {L}_{{\text{recon}}\_{\text{STO}}}+ {L}_{{\text{KL}}},$$
(10)
$${L}_{{\text{recon}}\_{\text{STO}}}={w}_{z}\sum_{k=1}^{\left|{S}_{z}\right|}{({x}_{i,k} - {\widehat{x}}_{i,k})}^{2}+\left(1-{w}_{z}\right)\sum_{l=1}^{\left|{S}_{z-}\right|}{\left({x}_{i,l} - {\widehat{x}}_{i,l}\right)}^{2},$$
(11)

where \({x}_{i, k}=\Delta {r}_{i, k}\), \({S}_{z}=\left\{s\right|s \text{ is keypoint values of }z{\text{-axis}}\}\), \({S}_{z-}=\left\{s\right|s\notin {S}_{z}\},\)in which the keypoints’ set \(S\) is a union of the set \({S}_{z}\), which is a set of keypoint values of z-axis, and \({S}_{z-}\), which is a set of keypoint values of \(x\), \(y\)-axis (\(S={S}_{z}\cup {S}_{z-}, |{\text{S}}| = |{S}_{z}| + |{S}_{z-}| = 33 + 66 =99)\). \({w}_{z}\) is the weight for the difference in the keypoint values of \(z\)-axis of the \({x}_{i}\), and \({r}_{i, s}\) is an \(s\)(\(s\in S\)) element of the representative frame of the group \({g}_{i}\). In Eq. (10), \({L}_{{\text{recon}}\_{\text{STO}}}\) is the loss term that gives a relative weight in terms of the uncertainty of estimating z-axis value. As for \({w}_{z}\), we experimentally determined to 0.3.

Fig. 9
figure 9

Boxplots indicating the variances of the estimated keypoints, specifically for the left ear, right shoulder, and left foot. It is evident that the uncertainty of z-axis value is large in all three keypoints

Class labels were used as condition \(c\), for the seven classes (ND, CP, CL, SCI, MD, ID, and ASD). We defined the condition by replacing the string with a numerical value as follows: \(\text{``ND''}\to 1/7\), \({\text{``CP''}}\to 2/7\), \({\text{``CL''}}\to 3/7\), \({\text{``SCI''}}\to 4/7\), \({\text{``MD''}}\to 5/7\), \({\text{``ID''}}\to 6/7\), and \({\text{``ASD''}}\to 7/7\).

Four layers were stacked in the encoder and decoder to build the STO-CVAE. Each layer comprised three components: a dense layer, a batch normalization layer, and a dropout layer. The leaky ReLU activation function was used in each layer. However, we used a hyperbolic tangent only in the last output layer because our dataset values were continuous between − 1 and 1. The pseudocode of the STO-CVAE training is presented in Algorithm 2.

Algorithm 2:
figure b

STO-CVAE in the training phase

Experiments and results

In experiments, we only used directly collected datasets because of the unavailability of public data for rehabilitation exercises according to the disability type. As for data augmentation for state transitions, we first divided the dataset into four subsets based on actions to improve the data generation performance. To evaluate the performance of STO-CVAE, a total of 100 test samples, representing all seven disability type classes, were generated using the state transition algorithm. The quality of the synthetic samples was assessed based on the enhanced classification performances. Additionally, a sensitivity analysis of training for STO-CVAE was conducted to verify the consistent convergence of data augmentation for the four actions. We used two state-of-the-art upsampling-based augmentation methods regarding the number of augmented samples per class. As for the classifiers, we used five major machine learning algorithms, including random forest, SVM, XGBoost, LightGBM, and TabNet [57]. We compared the accuracy and F1-score performance before and after data augmentation for all combinations.

Dividing the dataset based on state transitions

Each sample representing the action state has different modalities in the feature space. The principal component analysis (PCA) demonstrates that all actions have four individual clusters in two principal components, as shown in Fig. 10. These clusters are divided into four actions in the rowing exercise: (i) arms down left; (ii) arms up left to right middle; (iii) arms down right; and (iv) arms up right to left middle. Thus, we divided the dataset according to the exercise direction and position state to enhance the performance of data generation: position state 0—right (P0_R), position state 0—left (P0_L), position state 1—right (P1_R), and position state 1—left (P1_L).

Fig. 10
figure 10

Dimensionality reduction results with The PCA and t-SNE for the original samples. The figure in the top row and on the left illustrates the issue of overlapping that arises when embedding data without separating them based on position states. We then divided the total samples into four subsets based on the position states. The figures in the 2nd row present PCA results of the four actions while the figures in the 3rd row present t-SNE results of the four actions. We observed that t-SNE embedding results better represent the clustering based on the types of disabilities compared to PCA

Sensitivity analysis for STO-CVAE

We conducted a grid search to tune the hyperparameters; set a learning rate of 0.0001, a batch size of 256, and a dropout ratio of 0.5; and used the Adam optimizer with \({\beta }_{1}=0.9\) and \({\beta }_{2}=0.99\). We then conducted a sensitivity analysis to identify the optimal sampling hyperparameter \(\varepsilon\) for the STO-CVAE. We set the sampling standard deviation \(\varepsilon\) according to the classification accuracy for minority classes with augmented data. This is because it is important to lower the loss of the generative model and generate useful synthetic data for the minority class. As shown in Fig. 11 and Table 1, we built four STO-CVAE generative models, and all trained STO-CVAEs showed similar convergence in terms of the reconstruction error and KL loss using the same standard deviation.

Fig. 11
figure 11

Results for STO-CVAE training reconstruction loss (first) and KL loss (second) for the four subsets with the sampling standard deviation \({\boldsymbol{\varepsilon}}={\bf 0.05}\)

Table 1 Sensitivity analysis results of STO-CVAE according to the sampling standard deviation \({\boldsymbol{\varepsilon}}\)

Data augmentation with STO-CVAE

Then, two recent upsampling methods were considered to compare the performance on data augmentation. The first and second settings were multiclass (\(M\)) and balanced multiclass (\({\text{BM}}\)) methods, respectively [22]. \(M\) is a data augmentation method for each class in proportion to the ratio of samples for each class. Meanwhile, \({\text{BM}}\) is a method of generating more data for the class as the ratio of samples per minority class. If the class of the sample to be generated is \(i\) and the number of samples to be generated is \({L}_{i}\), then each method is constructed as follows:

$${\text{Setting }} 1.\ {L}_{i}=m\cdot {N}_{t}\quad {\text{for }} M,$$
$${\text{Setting }} 2.\ {L}_{i}=\left[\frac{1-m}{n-1}\right]\cdot {N}_{t}\quad\text{for BM},$$

where \(m\) is the proportion of class \(i\) in the original data, \(n\) is the number of classes, and \({N}_{t}\) is the total number of generated samples.

The base sample of our dataset had a severe class imbalance problem. Table 2 lists the sample number results per class obtained by applying settings 1 and 2 of the data augmentation methods. In setting 1, the augmented sample of each class was generated at the same ratio according to the class ratio of the base sample using the \(M\) method. Consequently, a large number of majority class samples were generated, whereas a small number of minority class samples were generated, and there were even classes for which no samples were generated. In setting 2, a relatively large number of samples of the minority class were generated using the \({\text{BM}}\) method. As a result, a relatively large ratio of minority class samples was generated, whereas a small ratio of majority class samples was generated.

Table 2 Sample number results per class after applying data augmentation methods: settings 1 and 2

Metrics

We used the following four evaluation indicators to evaluate the classification model’s performance:

accuracy, which is defined as the number of correctly classified data examples divided by the total number of data examples in the dataset;

precision, which is defined for a class as the number of true positives divided by the total number of model predictions that belong to the positive class;

recall, which is defined for a class as the number of true positives divided by the total number of elements labeled as belonging to the positive class;

F1-score, which is defined as the weighted harmonic mean of the precision and recall metrics; the F1-score has a higher value when the precision and recall metrics are similar.

Classifiers

Five classifiers were used to evaluate the improvement in classification performance before and after data augmentation using the proposed method. We conducted hyperparameter tuning again before and after data augmentation because the training dataset was changed by data augmentation. Regarding hyperparameter tuning, cross-validation-based Bayesian optimization was used for all classification models [66].

Hyperparameter tuning for each classifier is described below. The number and depth of trees were adjusted for the random forest algorithm. The floating point, kernel coefficient, and kernel type were adjusted for the SVM algorithm. For XGBoost, the maximum depth, the minimum sum of the instance weights required for the child, sub-sample ratio, and learning rate were adjusted. For LightGBM, the maximum number of leaves of the tree, depth of the tree, learning rate, and normalization conditions were adjusted. For TabNet, hyperparameters such as the dimension of the prediction layer, dimension of the attention layer, number of decision steps, number of shared GLU layers of the feature transformer, and number of unshared GLU layers were adjusted.

For model performance evaluation, we trained all classifiers with tenfold cross-validation (CV). In training, the performance was evaluated by an average of ten accuracy values and ten F1-scores from ten validation sets. In addition, the generalization performance of the classifiers was estimated by the standard deviations of F1-score and accuracy.

Comparison of data augmentation methods

We conducted comparative studies with other generative models to evaluate the data generation performance of STO-CVAE. The compared models included VAE, CVAE, and CTGAN. The number of augmented samples per class was determined using BM sampling for all generative models. Table 3 demonstrates significant performance improvement across five classifiers when data augmentation was performed using STO-CVAE.

Table 3 F1-score results of classification models before and after data augmentation methods

Augmented action-centric sample using STO-CVAE

To confirm the quality of the synthetic samples in terms of statistical characteristics, we conducted a Wilcoxon test to evaluate whether the rank of the population mean differed between the original and synthetic samples [73, 73]. In Table 4, we concluded that the differences between original and synthetic samples was not statistically significant. In the Wilcoxon tests, p value \(\ge 0.01\) presented that the synthetic samples generated by STO-CVAE are not significantly difference to the original samples in the minority class (CL, ID, and ASD). Additionally, STO-CVAE, which is a generative model focused on z-axis uncertainty, is robust to z-axis keypoint distribution estimation.

Table 4 Results of Wilcoxon test for verification the significance of a synthetic samples, which are generated by STO-CVE

The proposed STO-CVAE can generate action-centric samples. Figure 12 shows synthetic samples for the right wrist \(x\)-values and left index \(y\)-values among 99 keypoints. STO-CVAE generated samples according to each distribution for state transition per class. In the left index y-value of class CL, there was only one original sample in position state1 (left); thus, synthetic samples were generated using the \({\text{BM}}\) method for one sample. These variant samples improved classification performance.

Fig. 12
figure 12

Right wrist \({\varvec{x}}\)-values and left index \({\varvec{y}}\)-values of augmented samples through STO-CVAE

Figure 13 shows the distribution of the average \(y\)-axis values on the left and right wrists for both real-world and augmented samples. The distribution has a single modal and high density on the left side next to the hip. Moreover, as shown in Fig. 11, both the actual and augmented samples demonstrate that the distribution of the same action according to the disability type is different. We verified the distribution similarity for all seven disability classes and all behaviors.

Fig. 13
figure 13

(Left) position state 0 of cerebral lesion (CL) class. (Right) Distribution of the average \(y\) values of both wrists moved by adding a synthetic sample augmented through STO-CVAE in position state 0

Classification results after data augmentation with STO-CVAE

Table 5 presents the training results for each classification model before and after data augmentation using STO-CVAE. In training, all classifiers trained with augmentation outperformed the models trained without augmentation in terms of the F1-score and accuracy. A comparison of the two sampling methods after data augmentation revealed that setting 1(\(M\)) and setting 2(\({\text{BM}}\)) generated synthetic samples, resulting in a similar overall performance in terms of the F1-score and accuracy.

Table 5 Training results for classification models before and after data augmentation methods setting 1 (\(M\)) and setting 2 (\({\text{BM}}\))

Table 6 demonstrates that augmented samples improve all evaluation indicators, including the F1-score for all sampling methods. For SVM, the F1-score shows the most improvement, from 0.25 to 0.571. For TabNet, F1-score was not the highest before data augmentation, but it had the highest F1-score among all classifiers after data augmentation (setting 2). Figures 12, 13, 14, 15 and 16 show the confusion matrix of the five classifiers, including random forest, SVM, XGBoost, LGBM, and TabNet, before and after data augmentation. Overall, the prediction accuracy of most classes was improved; in particular, the prediction accuracy of the minority classes was greatly improved.

Table 6 Results for classification models before and after employing the data augmentation sampling methods
Fig. 14
figure 14

Confusion matrix of random forest before and after data augmentation through STO-CVAE with \({\text{BM}}\) sampling

Fig. 15
figure 15

Confusion matrix of SVM before and after data augmentation through STO-CVAE with \({\text{BM}}\) sampling

Fig. 16
figure 16

Confusion matrix of XGBoost before and after data augmentation through STO-CVAE with \({\text{BM}}\) sampling

In Fig. 14, the accuracy of CL and ID classes was initially 0 before data augmentation. After applying data augmentation, the accurate performance increased to 0.61 and 0.23, respectively. This resulted in a notable improvement in the accuracy of the minority class, particular ASD, which increased from 023. And 0.83. As can be seen in Fig. 15, XGBoost showed a slight improvement in accuracy for all classes overall compared to random forest. In addition, referring to Figs. 16, 17 and 18, SVM, LightGBM, and TabNet models all present that the accuracy of the minority classes has a positive effect compared to the majority class in terms of accuracy improvement. Naturally, relying solely on the addition of synthetic samples proposed in this study to address the accuracy improvement is quite challenging. The main focus of this study is to generate robust synthetic samples with the small observed samples in minority classes. Importantly, the proposed approach consistently demonstrated improved accuracy across all combinations of experiments.

Fig. 17
figure 17

Confusion matrix of LightGBM before and after data augmentation through STO-CVAE with \({\text{BM}}\) sampling

Fig. 18
figure 18

Confusion matrix of TabNet before and after data augmentation through STO-CVAE with \({\text{BM}}\) sampling

Conclusion

We proposed a data augmentation method called STO-CVAE for disability classification based on HPE keypoints. Our model uses state transitions to generate synthetic data to alleviate the class imbalance problem. In this method, we transform multiple frames that include human skeleton keypoints into an action according to the state transition. The sampling based on the proposed state transitions reduces the side effects of the uncertainty of the keypoint estimation in HPE and increases the learning efficiency of the generative model. Through this transformation, we avoided using a complex backbone for representation learning. Further, we examined several state-of-the-art data sampling approaches for the action-oriented samples by varying the ratios for each class and demonstrated the effectiveness of data augmentation using comparative experiments.

Regarding the implications of this research, the proposed STO-CVAE can be used to improve the accuracy of disability classification. An accurate disability classifier can help in providing suitable personal training and rehabilitation exercise programs. We expect that a fully customized AI trainer based on our approach can guide and recommend optimized exercises to individuals with disabilities. Furthermore, it can quickly recognize emergency stop situations according to the disability type during exercise. For example, a significant increase in heart rate may be dangerous for some people with the specific disability type.

In future works, we plan to develop a model that learns keypoints by reflecting on the structure of the human body. This is necessary because when a person exercises, the direction of movement is different for each keypoint, and keypoints in the same area have a high probability of moving in the same direction. In addition, we will extend our method to all rehabilitation exercises, not only one. In this study, heuristic data transformation was limited to a specific exercise. Therefore, state transition, which is based on the keypoints’ distribution for repetitive actions, can further increase the model scalability.