1 Introduction

Physical rehabilitation exercises are crucial for individuals to recover body function and capabilities following an injury, illness, or surgery. The exercises are critical components of recovery programs intended to help patients regain their lost physical abilities, such as flexibility and strength, and to facilitate a safe return to their normal daily activities. This research explores how computer vision and the latest deep-learning techniques can be used to evaluate physical rehabilitation exercises. The findings will help promote healing, prevent future injuries, and improve overall health and well-being.

The importance of physical therapy research is emphasized by global health trends, according to the World Health Organization (WHO) [1]. Currently, an astounding estimate of 2.4 billion individuals worldwide require rehabilitation for their health conditions. Notably, over 50% of individuals who need essential rehabilitation services lack access to them. This severe lack of accessibility is worsened by insufficient funding and undervaluation of rehabilitation services, particularly in countries with inadequate healthcare systems. Low- and middle-income countries have less than ten skilled rehabilitation practitioners per 1 million people, which is critically insufficient. Furthermore, with the projected doubling of the global population by 2050, there is an urgent need to address the increasing demand for rehabilitation services, particularly in developing countries where the strain on healthcare systems is becoming more pronounced. This contextual backdrop highlights the importance of advancing physical therapy research to fill gaps and improve the effectiveness of rehabilitation services globally.

The integration of digital technologies and Artificial Intelligence (AI) into home-based rehabilitation, particularly via exercise classification, showcases its groundbreaking potential. The application of auxiliary systems and AI is critical for tailoring exercises and monitoring progress in shoulder rehabilitation treatments [2, 3]. The incorporation of physical activity into rehabilitation routines, as Loellgen et al. [4] proposed, emphasizes the critical role of structured exercises in improving recovery outcomes.

Furthermore, innovative strategies like motor imagery and action observation [5], coupled with the introduction of digital tools in orthopedic recovery [6], and the adoption of virtual reality and digital biofeedback systems for post-operative rehabilitation [7, 8], collectively signal a move towards more personalized and accessible rehabilitation options. These advancements not only confirm the utility of exercise classification in home-based rehabilitation systems but also underscore a transition towards solutions that prioritize patient-centric care, efficiency, and accessibility. Additionally, these technological advancements align with the Sustainable Development Goals (SDGs), particularly SDG 3, which seeks to ensure health and well-being for all at every age. This approach not only improves patient outcomes but also aligns with the World Health Organization’s (WHO) goals for universal health coverage by enhancing the accessibility and efficiency of rehabilitation services.

Recently, there has been a growing emphasis on Human Activity Recognition (HAR) research, particularly within physical rehabilitation. For instance, in [9], the authors present a comprehensive review of Indoor-HAR, an emerging technology leveraging cameras and vision-based sensors to assess human actions within indoor settings. The review encompasses contemporary approaches, existing challenges, practical implementations, dataset resources, and a proposed taxonomy for Indoor-HAR. The study also highlights the potential of this technology in various industries, such as healthcare, surveillance, and human-computer interaction. It has multiple applications in smart homes, elderly care, and assisted living. Debnath et al., [10] investigated different computer vision approaches for physical rehabilitation and assessment. This has facilitated the creation of novel procedures that aid in the monitoring and evaluation of physical rehabilitation exercises, providing benefits to patients receiving rehabilitation for injuries or disabilities. Recent advancements in deep learning algorithms have enhanced the accuracy and efficiency of HAR systems. The current and prospective applications of Artificial Intelligence (AI) in physiotherapy and rehabilitation for advancing medical digitization are explored in [11]. On the other hand, [12] examines the application of a subset of machine learning approaches, namely, Reinforcement Learning. Virtual Reality (VR) approaches are also applied in this domain for physiotherapists and physicians [13]. VR approaches construct an intelligent space using multi-camera systems to capture gait data, thus creating a 3D virtual environment. Several studies in this discipline are skeleton-based [14] or pose-guided [15].

This study focuses on Pose-guided Activity Recognition, which involves the analysis of human body poses to understand and classify various activities, uncovering their potential benefits for patients and healthcare providers. In the accompanying image as shown in Fig. 1 where circles depict body joints, and lines represent intra-body edges, capturing the natural connections in the human body. This visual representation, reminiscent of Kinect pose frames over time, illustrates the temporal sequence of human poses. The duplication of these poses in each frame allows for recognizing and analyzing activities based on the moving body positions. This approach is a crucial aspect of the broader field of human activity recognition.

Fig. 1
figure 1

Skeleton-based activity recognition demonstrates circles representing body joints and lines defining intra-body edges based on natural connections in human bodies. The duplication of human poses replicates the time sequences across frames captured over time

Popular deep learning algorithms have been utilized for Human Activity Recognition (HAR). Numerous researches utilized Long Short-Term Memory (LSTM) for Human Activity Recognition [16,17,18,19]. Bidirectional LSTM (BiLSTM) has been used before for Human Activity Recognition and produced solid performance and acceptable results [20,21,22]. Significant contributions to the field of Human Activity Recognition leveraging Convolutional Neural Network (CNN) architectures can be found in the research of [23,24,25], where the utilization of (CNN) models exhibited remarkable efficacy in the precise identification and classification of human activities. Wang et al., [26], and Yadav et al., [27], have applied (CNN-LSTM) for Human Activity Recognition and proved its superiority. Various deep learning techniques, such as transfer learning, were utilized. Transfer Learning is a machine learning technique where knowledge or insights acquired from one dataset or task are applied to another closely related dataset or task. This technique entails fine-tuning a pre-trained model on a new dataset or task, frequently used when the new task or dataset has limited data or computational resources, making it difficult to train a new model from scratch. Numerous researchers have applied transfer learning in human activity recognition, particularly rehabilitation exercises [14, 15, 28].

Additionally, transform learning, a feature extraction or dimensionality reduction technique that involves converting high-dimensional data into a lower-dimensional space, has been used in this domain. Transformers models are useful for decreasing the computational complexity of working with high-dimensional data and extracting significant features that can be used for a specific job. For example, Réby et al. applied Transform Learning in this domain using the Graph Transformer approach [29].

Real-time healthcare applications pose a significant challenge due to their demand for substantial computational resources, a concern that has garnered increased attention in recent healthcare application research. Various studies have endeavored to address this challenge by achieving a 55% reduction in delay, a 72% improvement in response time, and a 70% reduction in energy consumption [30].

Our proposed model offers the potential for seamless deployment within a mobile application, empowering patients to engage in rehabilitative exercises within the comfort of their homes, utilizing a Kinect camera to capture and feed relevant data to the model. The models determine the accuracy of the exercise execution, a pivotal aspect for expediting the recovery process. This at-home exercise monitoring circumvents the necessity for physical visits to physiotherapy clinics. This paradigm has gained heightened significance during the ongoing global pandemic, where restrictions on outdoor activities are prevalent. This prioritizes patient safety and alleviates the strain on physiotherapists, a profession already grappling with a global shortage of practitioners.

Video-based techniques for feature extraction are resource-intensive [31,32,33], while manual feature extraction from videos for 2D image conversion is less demanding but often less robust [34]. Our methodology modifies the 1D approach from [35], integrating a novel feature extraction method that balances performance with resource efficiency. This hybrid strategy is particularly cost-effective for at-home rehabilitation, enabling deployment on low-power devices like mobile apps for real-time patient feedback.

The main contributions of this paper are :

  1. 1.

    Feature Engineering: The study introduces a novel approach to represent exercises as 1D vectors through comprehensive feature engineering, employing various statistical techniques.

  2. 2.

    Comparative Study: A detailed comparative analysis is conducted, assessing the performance of four distinct models (LSTM, Bi-LSTM, CNN, and CNN-LSTM) across two diverse datasets. This comparative study provides insights into the strengths and weaknesses of each model in the context of exercise classification.

  3. 3.

    Disease Classification: The models developed in the study are further utilized for disease classification based on five specific exercises from the KIMORE dataset. This application demonstrates the proposed models’ versatility and potential clinical relevance beyond exercise classification.

This manuscript is structured into eight sections: Introduction, Related Work, Dataset, Methodology, Experiments, Results, Discussion, Conclusion and Future Work.

2 Literature review

This section is organized into four subsections: Deep Learning-Based, Ensemble-Based, Transformer-Based, and Transfer-Based Approaches. Following the approach taken by Akkem et al., [36], a comparative analysis of specific methods within these approaches has been conducted. Table 1 provides a comprehensive overview of various models applied in HAR along with their associated evaluation metrics, results, datasets, and identified limitations. One research gap evident from the table is the common challenge of real-time applicability, particularly in resource-constrained environments. Several models, such as Graph Convolutional Siamese Network [37], Ensemble CNN-RNN [38], Ensemble-based Graph Convolutional Network [39], and Graph Attention Network [40], exhibit high computational costs, potentially limiting their suitability for real-time deployment on devices with constrained resources. Additionally, some models, including CNN-LSTM [34], LSTM-1D CNN [35], 3D CNN [41], and VGG-16 [28], highlight the need for further optimization to improve accuracy, indicating an avenue for future research in model refinement.

Table 1 Limitations and state-of-the-art status

The state-of-the-art models in human activity recognition and physical rehabilitation, such as CNN-LSTM, Graph Convolutional Siamese Networks, LSTM-1D CNN, 3D CNN, Ensemble CNN-RNN, Ensemble-based Graph Convolutional Network, Graph Attention Network, and VGG-16, demonstrate significant advancements in accuracy and performance across various datasets. However, they also reveal critical limitations concerning generalizability, computational efficiency, and scalability. These challenges highlight the ongoing need for optimization to balance model complexity with practical applicability, ensuring that advanced deep learning models can be deployed effectively in real-world settings without being hindered by computational constraints or loss of relevant information due to preprocessing choices.

Another noteworthy research gap involves transforming time-series data into 2D images or RGB formats, as Boukhennoufa et al., [28] proposed. This process introduces feature engineering choices that may encode biases or lead to information loss, impacting the model’s generalizability. Additionally, it is observed that several studies lack experimentation on multiple datasets, limiting the generalizability of their findings. Addressing these gaps can contribute to developing more efficient and accurate HAR models with improved real-time capabilities and enhanced generalizability.

Our methodology aims to mitigate the constraints outlined in the preceding Table 1. The refinement of model architectures was executed through a random search process, prioritizing attaining accuracy to ensure broad generalization. Extensive testing across multiple datasets incorporated a robust 5-fold cross-validation strategy to validate the models’ performance. Leveraging deep learning models was instrumental in addressing the inference time challenges associated with attention-based and ensemble models. We opted for a 1D approach over 2D, strategically sidestepping the complexities associated with image representation choices and the intricate nature of 2D models.

2.1 Deep learning-based approaches

Kumar et al., [38] introduced an innovative Deep-HAR model, Fusing Convolutional Neural Networks (CNNs) to extract features and Recurrent Neural Networks (RNNs) to capture temporal patterns within sequential data. Their evaluation encompassed three publicly accessible datasets (WISDM [44], PAMAP2 [45], and KU-HAR [46]), revealing exceptional performance across all activity types. Notably, the model achieved remarkable accuracy rates, scoring 99.98% for simple activities, 99.64% for complex activities, and 99.98% for heterogeneous activities. Similarly, our study employs various (CNN) and RNN models to address a specific healthcare concern, focusing on rehabilitation, and evaluates their efficacy on two dedicated rehabilitation datasets.

Tasmin et al., [48] focused on complex and diverse human activities rather than simple activities like walking or sitting. Using skeletal data, they introduced a Dynamic Edge Convolutional Neural Network (DECNN). DECNN operates on a graph representation of skeletons, treating joints as nodes and modeling their connections as edges. The authors evaluated their approach on UTD-MHAD [49] and MSR-Action3D [50], achieving high accuracy.

The integration of a CNN-LSTM hybrid model is a prevalent approach in both human activity recognition [26, 27, 34, 51,52,53], and disease classification [54]. Lakhan et al. employed a CNN-LSTM architecture in proposing a framework for detecting Autism Spectrum Disorder. Liao et al., [34] also introduced a deep learning architecture that combines CNN and LSTM networks to evaluate physical rehabilitation exercises. Their framework proficiently classifies exercises in the UI-PRMD dataset, demonstrating practicality with low absolute deviation per exercise.

2.2 Ensemble-based approaches

Ensemble learning, as investigated by Zhang et al. [55] extensively, combines data fusion, modeling, and mining into a unified framework. Effective ensemble methods carefully combine members to enhance performance, avoiding random fusion issues. These methods are categorized (in classification tasks) into data-level, feature-level, decision-level, and model-level approaches. Chenguang et al., [56] utilize ensemble learning for hand function assessment, while Chihiro et al., [57] apply it for predicting functional outcomes after spinal cord injury. Additionally, Wenchuan et al., [58] leverage ensemble learning for personalized remote training in Parkinson’s disease patients. In the realm of skeleton-based rehabilitation exercises, EGCN, an ensemble-based framework by Yu et al., [39], demonstrates robust performance on both UI-PRMD and KIMORE datasets.

2.3 Transformer-based approaches

In this paper [59], the authors present a new Spatial-Temporal Graph Convolutional Networks (ST-GCN) model for dynamic skeletons. ST-GCN is designed to autonomously learn spatial and temporal patterns from skeleton data to enhance human action recognition. The authors employ graph convolutional networks (GCNs) with spatial and temporal convolution operations to achieve this. These operations capture the dynamic aspects of human body skeletons, represented as spatial and temporal graphs. These graphs are constructed based on the inherent connections between human body parts and the temporal progression of skeleton sequences. Furthermore, the authors introduce a partitioning strategy to segment the spatial graph into distinct components, aligning with different body parts’ physical significance and functions. The effectiveness of the proposed approach is extensively evaluated on two large datasets, Kinetics and NTU-RGBD, resulting in state-of-the-art performance. Notably, the method achieves an accuracy of 81.5% on NTU-RGBD (CV) and 30.7% on Kinetics-Skeleton. The authors conclude that this approach can be readily extended to other tasks involving skeleton data, including gesture recognition, pose estimation, and human-object interaction.

Rahevar et al. proposed a Spatial-Temporal Dynamic Graph Attention Network (ST-DGAT) [40], which presents a groundbreaking approach to extracting valuable features from human body skeleton data. In contrast to previous methods like Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT), which struggled with capturing long-range joint dependencies, (ST-DGAT) introduces dynamic graph attention. By reordering the weighted vector operations in (GAT), (ST-DGAT) significantly enhances the attention mechanism, allowing it to learn spatial-temporal patterns from skeleton sequences effectively. The model outperforms existing approaches and achieves state-of-the-art accuracy on large-scale datasets, including NTU60, NTU120, and Kinetics-400, with accuracy levels of 96.4%, 88.2%, and 61.0%, respectively.

Reby et al., [29] proposed a graph transformer model for assessing the effectiveness of physical rehabilitation on the UI-PRMD dataset [42]. Their model combines self-attention mechanisms with spatiotemporal graph networks, achieving state-of-the-art binary classification and quality score prediction results. It is essential to emphasize that these outcomes are specific to a singular dataset, and there is a possibility that the performance of the (G2PRE) model may differ on other datasets.

Peng et al., [14] proposed a personalized rehabilitation training recognition framework using transfer learning based on human skeleton data. It employs pre-trained (CNN) models for feature extraction and hierarchical classification for exercise recognition. The approach achieved significant accuracy improvements for personalized rehabilitation training.

Generally, Transformers are capable of modeling long-range dependencies in sequential data. However, it requires significant computational costs due to self-attention mechanisms. This can hinder their real-time applicability on resource-constrained devices often used for (HAR). (CNNs), in contrast, offer efficient spatial feature extraction, making them well-suited for HAR tasks with fixed-size sensor inputs. However, their limitations in handling long temporal dependencies might necessitate careful design and segmentation strategies. (RNNs), specifically (LSTMs), provide a balance between sequential modeling and computational efficiency, but their performance can still be impacted by sequence length and model complexity.

2.4 Transfer-based approaches

Boukhennoufa et al., [28] introduced an innovative strategy to enhance activity recognition accuracy in post-stroke rehabilitation assessments, incorporating the Geometric Mean of Absolute and Relative Frequencies (GMAF) [60]. The proposed methodology encompasses two primary phases: feature extraction and classification. GMAF is applied to extract features from raw accelerometer data in the feature extraction stage. Subsequently, the classification phase involves experimenting with three different approaches: 1D CNN, 2D CNN, and a transfer learning approach using VGG-16 [61]. Implementing a basic 1D (CNN) classifier on these images elevates the accuracy of test data from 94% using traditional segmentation to 97.06%. Additionally, the transformation of 2D images into RGB format, coupled with using a 2D (CNN) classifier, results in a noteworthy accuracy increase to 97.52%.

However, certain limitations accompany this approach. It demands additional prepossessing to convert data into images. Although the 1D (CNN) achieves satisfactory results, it significantly requires fewer computational resources than VGG models. Using transfer learning with pre-trained models, such as (VGG16), designed for different tasks, like image recognition, may not fully capture the specific features and dynamics of post-stroke activity data. This potential domain mismatch could limit the benefits derived from transfer learning.

In summary, the choice of deep learning architecture for HAR necessitates a delicate balance between performance and computational constraints. Transformers, adept at capturing long-range dependencies in sequential data, come with significant computational costs attributed to self-attention mechanisms. Requiring significant computational can impede their real-time applicability on resource-constrained devices commonly utilized for HAR. On the other hand, CNNs excel in efficient spatial feature extraction, rendering them suitable for HAR tasks with fixed-size sensor inputs. However, their limitations in handling long temporal dependencies may require meticulous design and segmentation strategies. RNNs, particularly LSTMs, balance sequential modeling and computational efficiency, but their performance may still be influenced by sequence length and model complexity. Ensemble models, while offering a potential solution to prior challenges, come at the expense of increased inference time as at least two models are run for the same classification task. Transfer Learning, although promising, can face obstacles due to domain dissimilarity, necessitating task-specific fine-tuning. In conclusion, the optimal HAR architecture selection demands a comprehensive understanding of the trade-offs between accuracy, computational resources, and the temporal characteristics of sensor data.

3 Datasets characteristics

This study employs two separate datasets to validate our action recognition methodology. The first dataset, known as the UI-PRMD dataset [42], and the second dataset, referred to as the KIMORE dataset [43].

3.1 UI-PRMD

The UI-PRMD dataset, introduced by Vakanski et al. in their research [42], represents a comprehensive resource dedicated to a physical rehabilitation experiment. Ten healthy individuals participated, performing ten rehabilitation exercises with ten repetitions, each with correct and wrong techniques for each exercise, resulting in 20 classes and 2000 records. Two motion-capturing systems, a Vicon optical tracker and a Kinect camera, were utilized throughout the study. Including both techniques enhances the dataset’s utility for training and evaluating models tailored to physical rehabilitation. Providing insights into the correct form and identifying common errors, the UI-PRMD dataset is crucial in advancing research in human activity recognition and rehabilitation analysis. Within this paper, the classification of correct and erroneous techniques holds paramount importance, aligning with the primary objective of facilitating accurate assessments during at-home exercises.

The investigation specifically targeted ten distinct exercises: Deep Squat, Side Lunge, Standing Shoulder Abduction, Hurdle Step, Standing Shoulder Scaption, Standing Shoulder Extension, Inline Lunge, Sit Stand, Standing Active Straight Leg Raise, and Standing Shoulder Internal-External Rotation. Notably, each exercise was executed with correct and incorrect techniques, providing a nuanced dataset for analysis. The complete listing of the ten exercise classes can be found in Table 2.

Table 2 Compilation of exercises and corresponding labels in the UI-PRMD Dataset

The UI-PRMD dataset records data in the YXZ coordinate system, where Y denotes height, X represents width, and Z indicates depth. Various body parts, including Chest, Waist, Head, Left Collar, Right Collar, Left Upper Arm, Right Upper Arm, Left Forearm, Right Forearm, Head tip, Neck, Left Upper Leg, Right Upper Leg, Left Hand, Right Hand, Left Foot, Right Foot, Spine, Left Leg Toes, and Right Leg Toes, were utilized to extract joint information and corresponding angles from the sensor data.

3.2 KIMORE

Marianna Capecci et al., [43] conducted a study titled "Kinematic Analysis of Movements and Clinical Assessments for Remote Supervision of Physical Rehabilitation (KIMORE)" at the University of Pisa, Italy. Employing RGB-D sensors, specifically the Kinect v2, the study recorded RGB and depth videos and skeletal joint positions during five targeted exercises addressing lower back pain. Data collection occurred in a controlled environment using Kinect cameras, incorporating a diverse set of participants, including both healthy individuals and patients with various motor dysfunctions. This diversity is critical for developing algorithms with robust generalization capabilities across different clinical populations. The dataset comprises 78 participants, including 44 individuals without identified health issues, consisting of 17 Expert Professionals and 27 Not Experts. Additionally, there are 34 participants with motor dysfunctions, representing common neurological and musculoskeletal conditions such as stroke, Parkinson’s disease, or low back pain. This diverse representation ensures the dataset’s applicability to various rehabilitation settings.

The primary purpose of the dataset is exercise classification, with an additional experiment demonstrating the feasibility of disease classification. The five exercises in the dataset are listed in Table 3.

Table 3 Compilation of exercises and corresponding labels in the KIMORE Dataset

Similarly to Reby et al., [29], we prioritize skeleton-based methods because of their resilience to variations in body scales, motion speeds, camera perspectives, and environmental interference compared to RGB image-based methods. In line with this preference. In this study, we primarily employ the Kinect-extracted data, which performs better than directly extracting data from individual joints, as evidenced by previous research [35]. Furthermore, the dataset includes a set of meticulously defined features for each exercise, which medical experts established to characterize the scope of each exercise. These features, validated against a stereophotogrammetric system, can be analyzed to compute a performance score for each participant. The dataset is structured into three sub-folders: Raw data, Script, and Label. The participant cohort is further categorized into two main macro-groups: the Pain and Posture disorders Group (GPP) and the Control Group (CG). Within the CG, there is a subdivision into two subgroups, namely CG with expertise in physiotherapy exercises (CG-E) and CG without such expertise (CG-NE). Similarly, the GPP is divided into three sub-groups based on specific diagnoses, namely Parkinson’s disease, Stroke, and Low Back Pain. The dataset was employed for exercise classification, and an additional experiment was undertaken to showcase the feasibility of disease classification.

4 Methodology

Figure 2 illustrates a system overview, capturing the entire workflow from data acquisition to deployment. The diagram provides a detailed depiction of key stages, including preprocessing, Hyper-parameter tuning, model training, diverse metric-based evaluation, visualization of results, and model deployment. Each of these stages will be intricately expounded upon in their corresponding sections. Section 4.1 delves into the intricacies of data acquisition and alternative sensors, while section 4.2 illuminates the data preparation and preprocessing phase. In section 4.3, we delve into the details of adjusting Hyper-parameter tuning, and the employed classification models are detailed in section 4.4. Section 4.5 explains the different deployment choices that have been considered.

Fig. 2
figure 2

Overview of the Methodology

4.1 Data input

The patient initiates the exercise in front of the sensor device, whether a Kinect or a conventional RGB camera. Our investigation primarily focuses on skeletal data acquired through Kinect camera technology; however, our model demonstrates versatility by not relying exclusively on Kinect. An RGB camera can also be employed because our models are constructed utilizing joint coordinates. Nevertheless, additional processing is essential to extract skeletal joints. Two viable approaches for using RGB cameras as a primary sensor are OpenPose [62] and BlazePose [63]. OpenPose and BlazePose are widely adopted solutions for extracting skeletal joints from the human body. BlazePose offers a significantly faster runtime than OpenPose[64], making it suitable for diverse applications such as movement pre-screening and activity classification. BlazePose, functioning as a lightweight and efficient (CNN) model, excels in real-time pose estimation by predicting the 3D pose of an individual from a single image or video frame. The model employs CNNs to extract features from input images, followed by regression layers to predict the location of body key points. On the other hand, MediaPipe [65], a framework based on the BlazePose model, is better suited for constructing real-time machine-learning pipelines.

Despite both MediaPipe and Kinect providing joint coordinates in three dimensions (XYZ), there exists a disparity in the number of joints. MediaPipe extracts 33 joints, while the Kinect’s joint count varies depending on the version. For instance, the UI-PRMD dataset utilizes a Kinect, resulting in 22 joints, whereas KIMORE employs a Kinect-v2, yielding 25 joints. Consequently, slight modifications in the input shape are necessary depending on the specific sensor utilized. Subsequently, these joint data are forwarded to the next phase for preprocessing. Figure 2 visualizes the data acquisition phase.

4.2 Data preparation & preprocessing

Data processing played a vital role in this phase. Each dataset’s unique structure necessitated different preparatory procedures due to variations in format, features (especially the number of joints), and data volume between the two sets. Nevertheless, uniform preprocessing techniques were applied to ensure unbiased comparisons across all algorithms. The pre-processing phase is visualized in Fig. 2.

In the case of the UI-PRMD dataset, the Kinect Camera records the data of 22 body joints, storing this information in a vector denoted as V. At each time instance t, the representation of each joint data \(J_n\) consists of three-dimensional coordinates: \(X_t\), \(Y_t\), and \(Z_t\). Feature extraction techniques are then applied to process each joint data. These techniques include mean, median, minimum, maximum, and standard deviation. Applying these methods results in extracting 330 features (22 body joints x 3 Coordinates x 5 statistical techniques for each joint coordinate).

For the KIMORE dataset, the original dataset comprises XYZ coordinates for 25 distinct body joints, accompanied by a confidence score. This score quantifies the degree of confidence in the 3D joint position as recorded by the Kinect sensor, with values ranging from 1 (indicating a high level of confidence) to 0 (indicating a low level of confidence). Consequently, the dataset incorporates a total of 100 features. Salwa et al., [66] surveyed the most common feature extraction techniques used in human activity recognition. Instead of using only one method for feature extraction, various statistical techniques were employed, encompassing measures such as mean, median, minimum, maximum, and standard deviation, applied to each of the 100 features. This process resulted in an expanded total of 500 features [25 body joints x 4 (3 Coordinates + 1 confidence score ) x 5 statistical techniques for each joint coordinate]. It is noteworthy that, unlike the UI-PRMD dataset, it does not exhibit a balanced class distribution. Nevertheless, the disparities between class distributions are marginal, making implementing oversampling or down-sampling techniques unnecessary in the context of exercise classification experiments. The feature vector (V) is formulated in Equation 1; here, the variable n represents the number of body joints, with a value of 25 for the KIMORE dataset and 22 for the UI-PRMD dataset. In this context, i denotes the joint index, y corresponds to vertical, x signifies horizontal, and Z represents depth. It’s worth noting that, for the KIMORE dataset, an additional value was incorporated as a confidence score.

$$\begin{aligned}{} & {} \quad \sum _{i=1}^{n} \left( \text {median}(j_{i_x}) + \text {median}(j_{i_y}) + \text {median}(j_{i_z}) \right) \nonumber \\{} & {} + \sum _{i=1}^{n} \left( \text {mean}(j_{i_x}) + \text {mean}(j_{i_y}) + \text {mean}(j_{i_z}) \right) \nonumber \\ \text {Feature Vector}\, (V)= & {} + \sum _{i=1}^{n} \left( \max (j_{i_x}) + \max (j_{i_y}) + \max (j_{i_z}) \right) \nonumber \\{} & {} + \sum _{i=1}^{n} \left( \min (j_{i_x}) + \min (j_{i_y}) + \min (j_{i_z}) \right) \nonumber \\{} & {} + \sum _{i=1}^{n} \left( \text {std}(j_{i_x}) + \text {std}(j_{i_y}) + \text {std}(j_{i_z}) \right) \end{aligned}$$
(1)

For the disease classification experiment (the experiment’s objectives and constraints are discussed in Section 5.2), we employed diseases as class labels instead of utilizing the five exercises as class labels. We combined the ’Expert’ and ’Not Expert’ classes into one class, ’Normal,’ representing healthy individuals. The remaining classes include Parkinson’s disease, Stroke, and Back Pain. It’s worth noting that the dataset (for disease classification) exhibited a significant class imbalance, with notable variations in sample sizes among different classes. To address this imbalance and enhance the data distribution, we employed the Synthetic Minority Over-sampling technique (SMOTE) [67]. SMOTE is a machine learning data augmentation method specifically designed to mitigate challenges associated with class imbalance.

The core principle of SMOTE involves generating synthetic instances within the minority class by interpolating between existing data points. This augmentation technique effectively counteracts skewed class distributions, improving model performance, especially in scenarios characterized by substantial class imbalances. A visual representation of the class distribution before and after applying SMOTE can be seen in Fig. 3.

Fig. 3
figure 3

Distribution of the five disease records in the KIMORE dataset before and after applying SMOTE

Subsequently, all features underwent normalization, and One-Hot Encoding was employed to encode the labels of both datasets. Finally, a reshaping process was implemented for training and testing data subsets to ensure their formats conform to the consistent data structure required by all utilized models. One-Hot Encoding is a common approach for multi-class datasets [68,69,70,71].

4.3 Hyper-parameters tuning

Deep learning poses a significant challenge in terms of model optimization. Despite the widespread success of deep networks across various domains, certain factors can hinder their overall effectiveness. These factors encompass the selection of appropriate model parameters, determining the architecture and feature representation, and identifying optimal weight and bias values [72]. In this research, we harnessed the capabilities of four deep learning algorithms, specifically LSTM, BiLSTM, CNN-LSTM, and CNN. Our central concern was the identification of the most suitable Hyper-parameters for these models. Our effort went beyond finding these four models best architecture and parameters. It encompassed the pursuit of Hyper-parameters capable of consistently delivering exceptional results across both datasets. The manual exploration of diverse configurations and subsequent evaluation to ascertain the optimal parameter settings is an onerous and time-consuming endeavor, especially when dealing with high-dimensional search spaces. Furthermore, it is crucial to recognize that finely tuned Hyper-parameter configurations, established by an experienced user, are intricately tied to the specific application. Consequently, a compelling need exists for automated Hyper-parameter optimization to alleviate computational demands and minimize user intervention. Utilizing the steepest gradient descent algorithm for optimizing Hyper-parameter configurations proves unsuitable [72]. Instead, prominent techniques such as Grid Search and Random Search have emerged to address this challenge effectively. Grid search is a brute-force approach to Hyper-parameter tuning. It evaluates all possible combinations of Hyper-parameter values [73], whereas Random Search adopts a more stochastic approach by exploring random subsets of Hyper-parameter combinations [74]. The choice between Grid Search and Random Search hinges significantly on the number of Hyper-parameters under consideration. Grid Search proves more convenient when the Hyper-parameter count is limited, whereas Random Search excels when dealing with a larger number of Hyper-parameters [74]. In our approach, we have adapted Random Search to configure the models architectures and determine the most favorable Hyper-parameters that consistently yield superior results across both datasets. Figure 2 [Model Tuning] Visualizes the architecture designing and Hyper-parameters-tuning phase.

The Random Search technique, as discussed in [74], was employed to select the model architecture and fine-tune Hyper-parameters across four distinct models. Due to computational resource limitations, 150 trials were conducted for all models. The primary objective of this endeavor was to optimize validation loss during the training process. The dataset was partitioned to achieve this, with 80% of the data allocated for training and validation purposes. Hyper-parameter tuning was performed on one dataset, and subsequently, the model was tested on another dataset. We explored four distinct combinations: (1) Tuning on the KIMORE dataset and training on the UI-PRMD dataset. (2) Tuning on the UI-PRMD dataset and training on the KIMORE dataset. (3) Initial tuning on the KIMORE dataset followed by a subsequent round of Hyper-parameter optimization on the UI-PRMD dataset. (4) Initial tuning on the UI-PRMD dataset followed by a subsequent round of Hyper-parameter optimization on the KIMORE dataset. Our findings led us to determine that the most favorable results were obtained through the second combination, where the model was tuned on the UI-PRMD dataset and subsequently tested on the KIMORE dataset. This section comprehensively explains the parameter values considered for investigation for each model. For the (LSTM) model, The Random Search Tuner was employed to ascertain the values for several critical variables to establish the most effective model architecture. These variables encompassed the following: the number of (LSTM) layers, number of (LSTM) units, dropout rate, learning rate (ranging from 0.0001 to 0.01), type of regularizer (l1, l2, or none), and its associated strength [75]. The regularization strength parameter, often denoted as lambda (\(\lambda \)), controls the amount of regularization applied to the model. A higher value of \(\lambda \) corresponds to more robust regularization, which penalizes larger weights more heavily, leading to a more sparse model. Conversely, a lower value of \(\lambda \) reduces the regularization effect, allowing the model to have larger weights.

A comprehensive set of Hyper-parameters was considered in the context of the (CNN-LSTM) model. These parameters encompassed several key aspects, including the count of convolutional filters, the kernel size, the quantity of (LSTM) units, the dropout rate, the learning rate, and the choice of regularization techniques. These techniques are applied to convolutional, dense, and (LSTM) layers, providing flexibility with options such as l1, l2, or none.

Similarly, for the (BiLSTM) model, an array of Hyper-parameters came into play. These parameters involved factors like the number of Bidirectional LSTM layers, the quantity of (LSTM) units, the dropout rate, the learning rate, and the selection of regularization techniques. Among the choices for regularization were options like l1, l2, or none.

Shifting the focus to the (CNN) model, the Hyper-parameters under consideration included the count of convolutional layers, the number of dense layers, convolutional units, dense units, the dropout rate, the application of regularization techniques to both convolutional and dense layers (with choices of l1, l2, or none), and the learning rate. These parameters were pivotal in shaping the model’s architecture and behavior during training and evaluation.

4.4 Action classification

LSTM variants stand out in action classification owing to their robust memory retention capabilities, facilitating the effective capture and interpretation of temporal dependencies within sequential data-an essential aspect of action classification tasks. This advantage is well-established in natural language processing and time-series analysis, where LSTM architectures consistently outperform conventional models in processing sequential information. The study encompasses three investigated LSTM variants: LSTM, Bidirectional LSTM (BiLSTM), and Convolutional Neural Network - LSTM (CNN LSTM). All models process input data, comprising a 500-dimensional feature vector for the KIMORE dataset and a 330-dimensional feature vector for the UI-PRMD dataset. The output layer includes 20 classes for the UI-PRMD dataset and five classes for the KIMORE dataset, with these model selections being motivated by previous research studies that showcase the widespread adaptation of LSTM networks in various domains, such as human activity recognition [16,17,18], chronic pain detection [19], and exercise classification [76, 77]. Similarly, Bidirectional LSTM (BiLSTM) has found extensive application in studies related to human activity recognition [20,21,22], exercise classification [78, 79], and other relevant domains. Convolutional Neural Networks (CNN) have demonstrated remarkable results in human activity recognition [23,24,25] and exercise classification [80,81,82,83,84]. The hybrid CNN-LSTM model has also been utilized in both human activity recognition [26, 27, 51] and physical rehabilitation domains [34, 52, 53].

characteristics

4.4.1 Long short-term memory (LSTM)

Long Short-Term Memory (LSTM) [85] networks have emerged as a prominent and efficacious class of recurrent neural networks (RNNs) that can effectively model and process sequential data. In contrast to conventional RNNs, LSTMs possess the distinctive ability to mitigate the vanishing gradient problem through a complex gating mechanism, which facilitates the preservation of long-range dependencies. This is achieved by regulating the flow of information through a network of memory cells, each equipped with input, output, and forget gates. The (LSTM) architecture’s adaptive gating mechanism enables it to capture and retain essential information over extended sequences, making it particularly well-suited for tasks involving temporal dynamics and sequential patterns. Moreover, the introduction of variants such as Bidirectional LSTM (BiLSTM) and Convolutional Neural Network - LSTM (CNN LSTM) has further extended the (LSTM) framework’s capabilities, allowing for the exploitation of bidirectional context and spatial features, respectively.

The Hyper-parameter tuning process determined that the model parameters and architecture comprise a single (LSTM) layer with 320 units. A learning rate of 0.000501 was identified as the most effective value for the learning process. To mitigate the problem of over-fitting, we introduced a dropout rate of 0.263337 and implemented an L2 regularizer. Dropout layers were specifically employed to counteract over-fitting, which, although potentially increasing training time, significantly contributes to enhancing the model’s generalization capability. Moreover, batch normalization can achieve model generalization, although dropout offers finer control over regularization strength [86]. Two dense layers further complemented this architecture, each comprising 940 units. Finally, the output layer utilized Softmax [87] activation due to its superiority over other activation functions.

Fig. 4
figure 4

The architecture of (LSTM) Model after applying Random Search for Hyper-tuning

Figure 4 visually illustrates the proposed model’s architecture, which was arrived at through a comprehensive exploration of various (LSTM) configurations and fine-tuning numerous model parameters via random search.

4.4.2 Bidirectional Long Short-Term Memory (BiLSTMs)

BiLSTMs, short for Bidirectional Long Short-Term Memory networks, introduce bidirectional processing by integrating two distinct (LSTM) layers that analyze input sequences in both forward and reverse directions [88]. This unique design allows the network to capture past and future context concurrently, significantly enhancing its ability to discern intricate temporal relationships within sequential data. Integrating information from both directions makes BiLSTMs exceptionally proficient in tasks demanding a profound understanding of context, such as natural language comprehension and sentiment analysis. This bidirectional processing effectively mitigates the limitations associated with unidirectional RNNs, which often struggle with capturing long-range dependencies and mitigating the vanishing gradient problem. Consequently, BiLSTMs exhibit superior performance on tasks requiring nuanced sequential information processing.

Following Hyper-parameter tuning, the parameters and architecture were determined to consist of two (BiLSTM) layers, each comprising 271 units. Additionally, a learning rate of 0.001014 was identified as the value for the learning process. A dropout rate of 0.3 was used to mitigate the over-fitting problem further. The model also incorporated four dense layers, each containing 927 units with ReLU activation, except for the output layer, which utilized softmax activation. The architectural layout presented in Fig. 5 was devised after evaluating various (BiLSTM) implementations and fine-tuning multiple parameters using a random search approach.

Fig. 5
figure 5

The architecture of (BiLSTM) Model after applying Random Search for Hyper-tuning

4.4.3 Convolutional Neural Networks (CNN)

Convolutional Neural Networks (CNN), initially pioneered by Yann LeCun et al., [89], represent a groundbreaking innovation in deep learning and computer vision. Designed initially for image recognition tasks, (CNNs) are inspired by the hierarchical feature extraction mechanisms of the human visual system, replicating the autonomous process through which the human brain distills relevant features from input data. Comprising convolutional layers for feature extraction followed by pooling layers for spatial dimension reduction, CNNs exhibit a hierarchical feature learning capability that enables them to discern intricate patterns and features within datasets.

In the context of exercise classification, the superior performance of the CNN model can be attributed to its adeptness in extracting hierarchical spatial features from skeletal time-series data, such as that captured by Kinect cameras during exercise execution. The model’s robustness to spatial transformations and its capacity for hierarchical feature abstraction contribute to its remarkable accuracy rates in discerning various exercises. Integrating these architectural features within the CNN framework enhances its efficacy for exercise classification tasks, showcasing its prowess in leveraging spatial relationships for accurate and robust recognition of distinct movement patterns.

Moreover, CNNs have demonstrated remarkable adaptability and versatility beyond image analysis, expanding their application to diverse domains. Notably, CNNs are extensively utilized in disease classification within physical rehabilitation [90,91,92,93]. This highlights the broad impact of CNNs in leveraging sensor data for tasks ranging from human activity recognition to disease classification.

Through Hyper-parameter tuning, the parameters and architecture were determined to comprise two convolutional layers, each housing 48 convolutional units with a 3x3 filter size. A Max-Pooling Layer and a Flatten Layer followed this. Additionally, a dense layer consisting of 544 units with ReLU activation was incorporated, followed by a dropout layer with a rate of 0.2 to address potential over-fitting. A learning rate of 0.0025284 was identified as the value for the learning process, and the model concluded with an output layer featuring softmax activation. Figure 6 visually represents the architecture of the proposed model. This architectural design was adopted after experimenting with various (CNN) implementations and fine-tuning multiple model parameters using a random search approach.

Fig. 6
figure 6

The architecture of (CNN) Model after applying a Random Search for Hyper-tuning

4.4.4 CNN-LSTM

The (CNN-LSTM) architecture seamlessly integrates Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks, harnessing both spatial and temporal information synchronously [94]. This fusion of CNNs for spatial feature extraction and LSTMs for temporal pattern capture excels in video analysis and action recognition tasks. Its significance is particularly pronounced in addressing challenges involving visual content and its temporal evolution. This architecture finds versatile applications in fields like video surveillance, medical imaging, and natural language processing, enriching insights from sequential data by considering spatiotemporal relationships. In essence, (CNN-LSTM) offers an efficient amalgamation of CNNs and LSTMs, providing a robust approach for deciphering intricate data.

This study introduces a hybrid (CNN-LSTM) model for classification purposes. Here, (CNN) serves as the feature extractor, while (LSTM) takes charge of classification, as illustrated in Fig. 7. The model parameters and architecture commence with a convolutional layer comprising 128 filters, featuring a kernel size of 8, along with L2 and bias regularizers set at 0.002276. A max-pooling layer and a dropout layer with a rate of 0.2 follow this. Subsequently, an (LSTM) layer with 256 units is employed, succeeded by a flattened layer. Two dense layers are added, each with 525 units and ReLU activation, complemented by a dropout rate of 0.2. The learning rate identified for the learning process is 0.000777. Finally, the output layer utilizes softmax activation. The architectural layout of the model is visually depicted in Fig. 7. This architecture was adopted after conducting experiments with various (CNN-LSTM) implementations and fine-tuning multiple model parameters using a random search approach.

Fig. 7
figure 7

The architecture of (CNN-LSTM) Model after applying a Random Search for Hyper-tuning

4.5 Model deployment and monitoring

The deployment of a model in a production environment constitutes a pivotal phase in its operational integration. Consequently, adopting model deployment with MLOps, as outlined in [95], becomes necessary. To facilitate model consumption, the design of an API using Python frameworks, notably Flask, has been considered, drawing insights from various articles [96, 97], particularly those focusing on disease classifications involving Flask development and its integration with machine learning models. The selection of Python is motivated by its pre-defined packages, such as Tensorflow and Pandas [95]. The model can be deployed in the production environment for its usage. Various cloud platforms, such as Amazon Web Service (AWS), Microsoft Azure, or Google Cloud, can be utilized to deploy the model. In this MLOps-driven deployment, an automated pipeline takes charge of continuous monitoring, incorporating automated testing, validation services, and responsive actions to maintain model accuracy. The MLOps architecture facilitates a seamless machine learning life-cycle, ensuring adaptive retraining of the model and consistent alignment with predefined performance standards. Automated retraining can be utilized according to a predefined schedule, focusing on ensuring thorough documentation and logging of the entire process. This includes logging predictions, inputs, and outputs for each inference and facilitating auditing and debugging processes.

5 Experiments

Two experiments were conducted in this research. The First Experiment was conducted to find the best exercise classification algorithm across both datasets. The Second experiment was conducted to classify different diseases from patients while performing the same five exercises in the KIMORE dataset.

5.1 Experiment 1: finding the best algorithm for exercise classification

The experiment assessed and contrasted the performance of distinct (LSTM) variations as long as (CNN) models. The dataset was partitioned into training and testing subsets to an 80-20 ratio for training and testing. Furthermore, the training data was divided within the training subset into an 80-20 ratio for training and validation data, which was utilized for model training. This splitting employed the same number of folds as [35]. A 5-fold cross-validation approach was adopted to mitigate over-fitting, maintaining a consistent batch size of 32 for all models. This batch size choice aligns with the methodology employed in [29]. All models employed an identical loss function of categorical cross-entropy and were trained for 450 epochs. An early stopping technique was also applied. All experiments were run on the same machine with 15GB of GPU. Early stopping is a regularization technique used in machine learning to prevent over-fitting. It stops a model’s training process when it starts over-fitting the training data [98]. The Early Stopping technique was implemented to monitor validation loss during training with a patient of 30 epochs. All models were trained simultaneously on the same machine using the same computational resources. Adam [99] (short for Adaptive Moment Estimation) was also used as an optimization algorithm for training machine learning models, particularly neural networks. It is an extension of stochastic gradient descent (SGD) and is designed to optimize the learning process by adapting the learning rates of individual model parameters. Figure 2 [Data Splitting] visualizes the data splitting phase. While [Training] Visualizes the training process and Hyper-parameters optimization.

5.2 Experiment 2: finding the best algorithm for disease classification

This investigation assessed and juxtaposed the efficacy of various (LSTM) variations and (CNN) models in disease classification utilizing the KIMORE dataset. Following Experiment 1, we maintained adherence to the established experimental protocol. This involved employing a consistent 5-fold cross-validation approach, implementing an identical early stopping callback throughout training for 450 epochs, and utilizing the same categorical cross-entropy loss function with a batch size 32.

To rectify class imbalance, we implemented the SMOTE technique for oversampling.

5.3 Evaluation metrics

The selection of specific evaluation metrics is a crucial aspect of assessing the performance of machine learning models, and it involves a thoughtful consideration of the dataset’s characteristics and the study’s goals. In this work, we opted for a set of well-established metrics, namely loss, accuracy, precision, recall, and F1-score, to comprehensively evaluate the proposed models.

Our choice is motivated by the balanced nature of the datasets. While accuracy is a commonly used metric that provides a general overview of overall model correctness, its effectiveness can be influenced by the class distribution. In a balanced dataset, accuracy is a suitable metric, and its inclusion ensures a broad assessment of model performance.

On the other hand, precision, recall, and F1-score are particularly relevant in scenarios where a balanced consideration of false positives and false negatives is imperative. Precision assesses the accuracy of positive predictions, recall measures the ability to capture all relevant instances, and F1 score balances precision and recall. These metrics become especially significant when the consequences of false positives and false negatives differ in importance.

Furthermore, incorporating loss as an evaluation metric offers insights into the convergence and optimization process during training. Minimizing the loss is fundamental to achieving accurate predictions, making it a valuable metric for assessing the training dynamics.

In alignment with related work utilizing accuracy and F1-score, our choice of metrics aims to provide a robust evaluation framework that considers various aspects of model performance. This approach ensures a nuanced understanding of the proposed models effectiveness in capturing global correctness and the balance between precision and recall, ultimately contributing to our evaluation process’s scientific rigor and completeness.

6 Results

The Results section is organized into four main parts: Exercise Classification (Section 6.1), a Comparison of State-of-the-art Methods (Section 6.2), Disease Classification (Section 6.3), and an overview of Real-world Applications in (Section 6.4).

6.1 Exercise classification

The (CNN) model demonstrates the most efficient convergence in training, requiring the fewest iterations to achieve lower loss values during the training process on both datasets. The exercise classification results are further divided into two subsections, one for each dataset.

6.1.1 KIMORE

In our experimental investigation of the KIMORE dataset, the Hyper-tuned (BiLSTM) model emerged as the top-performing candidate, outperforming other models across all five critical evaluation metrics. It achieved remarkable results with a mean accuracy of 93.08%, precision of 93.07%, recall of 93.96%, an F1-score of 91.79%, and a loss value of 0.2860. Following closely, the (CNN-LSTM) and Bi-LSTM models also delivered commendable results, boasting accuracy rates of 91.79% and 87.18%, respectively. Notably, the Bi-LSTM model exhibited a shorter mean training duration compared to the (CNN-LSTM) model, with the (CNN-LSTM) consuming the highest duration of 7.88 minutes on average per fold. In contrast, the (CNN) model incurred the shortest training time, averaging 2.3 minutes. It’s worth mentioning that the (CNN) model, while computationally expensive and resource-intensive due to its large number of trainable parameters (6,505,349), benefits from early stopping, requiring the fewest iterations to converge, as demonstrated in Fig. 8.

Fig. 8
figure 8

Iterations results on the KIMORE dataset, featuring five subgraphs illustrating the performance comparison of four models across different evaluation metrics: [a] Accuracy, [b] Precision, [c] F1-score, [d] Recall, and [e] Loss

However, the (LSTM) model displayed a comparatively less impressive performance in this experimental context, yielding an average accuracy rate of 82.32%. This discrepancy may be attributed to the intrinsic nature of (LSTM) models, which thrive when provided with a substantial number of time steps. In our experiment, we deliberately constrained the time steps to a single record.

Simultaneously, Fig. 8 presents a graphical representation of the precision metrics in [b], the F1-score in [c], and the recall metric in [d]. These visual presentations offer valuable insights into the progressive convergence patterns exhibited by the models during the training phase when utilizing the validation dataset.

Additionally, as depicted in Fig. 8 [e], we present graphical representations of the loss values for all four models during the final fold, which is the lowest-scoring fold among the five, on the KIMORE dataset. The corresponding number of training iterations is also included in the visualization. Furthermore, Fig. 8 [a] provides a visual overview of the recorded accuracy metrics.

Table 4 presents a comprehensive comparison of mean results obtained by each of the four models, employing diverse evaluation metrics during the rigorous testing phase. Correspondingly, Fig. 9 supplements this analysis by providing a whisker box chart, which highlights the distribution of performance metrics, including minimum, maximum, mean, and interquartile range (IQR) values for the four models across the four evaluation metrics. This visual representation offers a detailed overview of the performance variations and statistical distribution of results for each model across all 5 folds.

Fig. 9
figure 9

Key performance metrics, encompassing accuracy, precision, recall, and F1 score are presented through whisker boxes for the 4 models. This visual representation underscores optimal outcomes on the KIMORE dataset for exercise classification following a thorough 5-fold cross-validation

Table 4 The Mean Performance Metrics of Various Models on the KIMORE dataset, Alongside the Duration Time for Each Fold (Measured in Minutes)

6.1.2 UI-PRMD

For the second dataset, UI-PRMD, the (CNN) model showcased exceptional performance across a comprehensive spectrum of evaluation metrics, encompassing Accuracy, Precision, Recall, F1-Score, and Loss. The (CNN) model also exhibited remarkable efficiency, with an average training duration of just 2.15 minutes across five-folds, thanks to its reduced iteration requirements. This model achieved outstanding accuracy (99.70%), precision (99.70%), recall (99.95%), and F1-score (99.70%), along with a minimized loss value of 0.0122.

In the second position, the (CNN-LSTM) model achieved commendable results, with an accuracy of 98.85%, precision of 98.85%, recall of 98.90%, F1-score of 98.90%, and a loss value of 0.0676. However, it necessitated a relatively extended training duration, with an average training time of 8.45 minutes. The (CNN-LSTM) model, despite its complexity with 16,799,497 parameters, demonstrated competitive performance and required slightly fewer iterations compared to other algorithms, as shown in Fig. 8.

Conversely, both the Bi-LSTM and (LSTM) models delivered acceptable results, with accuracy rates of 96.65% and 98.75%, respectively. Notably, the (LSTM) model exhibited significantly shorter training times than Bi-LSTM, with an average of 4.18 minutes versus 13.45 minutes, making it nearly three times faster. These outcomes emphasize the superiority of the (CNN) model, the competitive performance of the (CNN-LSTM) model, and the limitations observed in the (LSTM) model within this specific experimental context.

Furthermore, Table 5 compares the mean results attained by each of the three models, utilizing various evaluation metrics during the testing phase. In addition, Fig. 10 presents an additional whisker box chart, illustrating the results of the models on the UI-PRMD dataset. As depicted in the figure, (CNN) surpassed all other models in performance and exhibited the least variance across all five folds.

Table 5 The Results of the different models on the UI-PRMD dataset
Fig. 10
figure 10

Key performance metrics, encompassing accuracy, precision, recall, and F1 score are presented through whisker boxes for the four models. This visual representation underscores optimal outcomes on the UI-PRMD dataset following a thorough 5-fold cross-validation

6.2 Comparison of state-of-the-art

This section provides a comparison of the current state-of-the-art methodologies applied to the KIMORE and UI-PRMD datasets for the classification of exercises.

The Tuned-CNN model demonstrates remarkable performance when contrasted with previous studies conducted on the same dataset (KIMORE), achieving an impressive accuracy rate of 93.08%. As per the extent of our investigation, this signifies a substantial enhancement over the state-of-the-art methods reported by Abedi et al., who employed LSTM, and a 1D-CNN followed by a fully connected neural network, showing improvements of 0.75%. Barzegar et al.,’s utilization of a 3D-CNN is also surpassed by 2.51% [35]. A comparative analysis of results obtained from prior research endeavors and the mean value of our highest-scoring model is presented in Table 6.

Table 6 Comparison of our Tuned-CNN model with state-of-the-art methods on the KIMORE dataset

In Table 7, a comprehensive comparison of our proposed models with other relevant works on the UI-PRMD dataset is presented. Notably, our Tuned-CNN model achieves an exceptional accuracy of 99.7%, outperforming existing state-of-art methods such as Ensemble-based Graph Convolutional Network (EGCN) [39], Graph Convolutional Siamese Network [37] by 0.5%, and Extra Tree Classifier after incorporating the FCBF feature ranking technique [100] by 0.1%. This comparison underscores the superior performance of our tuned CNN model in accurately classifying exercises on the UI-PRMD dataset.

Table 7 Comparison of our Tuned-CNN model with state-of-the-art methods on the UI-PRMD dataset

6.3 Disease Classification

Our primary objective shifted from exercise classification to disease identification during patient exercise in this experiment. Rather than conducting Hyper-parameter tuning to create entirely new models, we retained the original models architecture and Hyper-parameters. However, the results yielded less promising outcomes, primarily because these models were initially designed and optimized for exercise classification, not disease identification. Although we used the same dataset, the fundamental goal of this experiment significantly differed from the previous one. Consequently, while we expected the tuned models to demonstrate excellent performance on this specific dataset, their performance on other datasets remained uncertain.

Therefore, we decided not to perform model tuning on the new dataset to avoid over-fitting. Furthermore, it’s important to note that using the same architecture on other available datasets for disease classification is not feasible as such datasets are scarce.

Fig. 11
figure 11

Key performance metrics, encompassing accuracy, precision, recall, and F1 score are presented through whisker boxes for the four models. This visual representation underscores optimal outcomes on the KIMORE dataset for disease classification following a thorough 5-fold cross-validation

Table 8 The Average Performance Metrics of Various Models on the KIMORE dataset for Disease Classification

We present the results as a whisker box chart to provide a comprehensive overview of the model’s performance consistency across different folds in disease classification. This chart, visualized in Fig. 11, displays the distribution of accuracy values across the five folds. Each whisker box represents the interquartile range (IQR) of accuracy, with the central line denoting the median accuracy. Outliers, if any, are also highlighted, offering insights into the variability of model performance. Examining this graphical representation enhances our understanding of the stability and reliability of the proposed models over multiple folds, contributing valuable information for assessing the robustness of the disease classification system. Detailed statistical measures such as mean, standard deviation, and confidence intervals further complement the visual interpretation, comprehensively evaluating the models’ accuracy across different folds.

Once again, the (CNN) model exhibited superior performance, outperforming all other algorithms, followed closely by CNN-LSTM, with only a tiny margin of 1.26%. Specifically, the (CNN) model achieved an accuracy of 89.87%, while (CNN-LSTM) scored 88.61%. Notably, the (CNN) model also boasted significantly faster training times, approximately 2.8 times faster than CNN-LSTM. In contrast, the Bi-LSTM model scored an average accuracy of 61.52%, while the (LSTM) model struggled with just 55.06%. Table 8 lists the median results across all metrics during the five folds.

It is essential to emphasize that while the results may not be outstanding, this represents an initial experiment. No model or Hyper-parameter tuning was performed due to the constraints imposed by the limited availability of datasets encompassing multiple diseases. Within the scope of our research, other datasets for disease classification typically comprise only binary classes, namely Normal and Affected.

In the context of disease classification, the performance evaluation of our proposed models includes a detailed analysis based on the confusion matrix and ANOVA analysis. The confusion matrix results after the 5th fold, which is considered the best fold, provide a comprehensive visualization of the model’s classification performance across four distinct classes: normal, Parkinson’s disease, back pain, and stroke. This crucial information is graphically represented in Fig. 12, which illustrates the model’s classification outcomes for each class. The visualization aids in interpreting the effectiveness of the models in correctly categorizing patients into the respective disease classes based on the performed exercises. This visual representation can give further insights into precision, recall, and F1-score metrics, contributing to a holistic understanding of the models disease classification capabilities.

Fig. 12
figure 12

Confusion Matrix illustrating disease classification results obtained from the (CNN) model after the 5th fold

Table 9 ANOVA results comparing four models architectures on the KIMORE dataset for disease classifcation

ANOVA (Analysis of Variance) was conducted to assess the statistical differences in the disease classification performance among four models-Bi-LSTM, CNN, CNN-LSTM, and LSTM-utilized on the KIMORE dataset. Table 9 presents the ANOVA results, showcasing key parameters such as count, sum, average, and variance for each model. The top section details the model-wise statistical metrics. In contrast, the lower section delineates the source of variation, Sums of Squares (SS), Degrees of Freedom (df), Mean Squares (MS), F-ratio (F), P-value, and the Critical F value (F crit). The F-ratio is particularly noteworthy as it indicates the ratio of between-group variance to within-group variance. In this context, a statistically significant F-ratio (7.61) with a corresponding low P-value (0.0022) suggests significant differences in disease classification performance among the models.

6.4 Real-world applications of the proposed model

The adoption of advanced machine learning technologies in physical rehabilitation not only marks a significant advance in healthcare but also contributes to sustainability in the medical sector. By leveraging models like 1D CNN, LSTM, BiLSTM, and CNN-LSTM, rehabilitation processes become more efficient and personalized, reducing the need for frequent physical visits and thus, the carbon footprint associated with healthcare delivery. This approach supports the Sustainable Development Goal (SDG) 3, aimed at ensuring health and well-being for all, by making rehabilitation more accessible and reducing inequalities in health services. Moreover, the emphasis on Human Activity Recognition (HAR) and the use of digital and virtual reality tools in rehabilitation align with the broader goals of sustainable healthcare by optimizing resource use and enhancing the quality of care. These innovations not only improve patient outcomes but also align with global efforts towards universal health coverage and sustainability in healthcare, showcasing a commitment to a future where healthcare is both advanced and environmentally responsible.

6.4.1 Case study 1: remote patient monitoring

Enabling healthcare providers to monitor patients’ rehabilitation progress remotely, especially crucial for patients who cannot frequently visit healthcare facilities due to distance, cost, or mobility issues.

6.4.2 Case study 2: physical therapy and rehabilitation

Automating the assessment of patients’ performance during physical therapy sessions, providing real-time feedback to both patients and therapists. This can enhance the recovery.

6.4.3 Case study 3: home-based rehabilitation

Offering a solution for patients to perform guided exercises at home with virtual supervision, increasing the accessibility and frequency of rehabilitation sessions. With VR, patients can engage in gamified rehabilitation exercises, making home-based therapy more enjoyable and effective. AR can overlay instructional content on the patient’s environment, guiding them through exercises correctly.

6.4.4 Case study 4: performance assessment in sports

Assisting coaches and athletes in analyzing performance for training effectiveness, injury prevention, and technique improvement by providing detailed insights into each movement’s execution.

6.4.5 Case study 5: elderly care and fall prevention

Monitoring the elderly for sudden movements or falls and assessing their physical activity levels to customize fall prevention exercises and improve overall mobility and independence.

6.4.6 Case study 6: interactive fitness applications

Integrating with fitness apps and devices to provide users with feedback on their workout routines, ensuring exercises are done correctly to maximize benefits and reduce the risk of injury. Fitness apps can leverage AR to project digital trainers into the user’s space, providing personalized workout guidance, while VR can immerse users in engaging, virtual landscapes to enhance the workout experience.

6.4.7 Case study 7: educational tools for physiotherapy

The proposed model can be used to disease classification and diagnosis, especially for conditions like Parkinson’s, by analyzing movement data for early detection and personalized care plans.

6.4.8 Case study 8: rehabilitation research

This model can be particularly effective in identifying patterns and anomalies in physical movements associated with disorders such as Parkinson’s disease and various gait abnormalities. Through detailed analysis of exercise performance data, the model aids in the early detection of these conditions, offering a significant improvement in diagnostic accuracy and the potential for early intervention. This application underscores the model’s utility in not just rehabilitation but also in the proactive management and diagnosis of movement-related health issues, showcasing a versatile approach to healthcare.

6.4.9 Case study 9: disease classification and diagnosis through movement analysis

The proposed model can be used in disease classification and diagnosis, especially for conditions like Parkinson’s, by analyzing movement data for early detection and personalized care plans.

7 Discussion

Using the Random Search technique to design diverse network architectures and fine-tune Hyper-parameters significantly impacted model performance. Our proposed Tuned-CNN model achieved state-of-the-art results on the KIMORE and UI-PRMD datasets, showcasing exceptional performance across different domains. Similarly, the Tuned-CNN-LSTM model demonstrated impressive and competitive performance on both datasets, highlighting its adaptability and effectiveness.

Interestingly, the (CNN) despite its complexity and a larger number of trainable parameters, exhibited efficient learning and the shortest training times. This efficiency resulted from requiring the fewest iterations and benefiting from early stopping. On the other hand, the Tuned-CNN-LSTM and Tuned-BiLSTM models demanded greater computational resources, longer training times, and higher resource consumption. Notably, the (CNN) model achieved a remarkable accuracy of 99.70% on the UI-PRMD dataset and 93.08% accuracy on the KIMORE dataset for exercise classification. Finally, we recommend prioritizing using our Tuned-BiLSTM architecture over the Tuned-LSTM model.

The (CNN) and (CNN-LSTM) models also demonstrated impressive performance in disease classification on the KIMORE dataset, achieving accuracy rates of 89.87% and 88.61%, respectively.

Conversely, the (LSTM) model demonstrated consistent but sub-optimal results across both datasets. This performance discrepancy may be attributed to the algorithms inherent characteristics, particularly the LSTM’s need for more timesteps. During our feature engineering phase, we generated only one timestep per record, potentially limiting the LSTM’s capacity to excel.

The results highlighted the exceptional performance of the tuned (CNN) and (CNN-LSTM) models, consistently outperforming others across all evaluation metrics. The (CNN) model achieved remarkable accuracy scores of 93.08% and 99.7% on the KIMORE and UI-PRMD datasets, respectively. Notably, this performance surpassed the previous research on the same dataset by 0.75% on the KIMORE dataset and 0.1% on the UI-PRMD dataset. Our tuned (CNN) and (CNN-LSTM) models yielded very close results for disease classification on the KIMORE dataset, achieving an accuracy rate of nearly 89.9% and 88.61%, respectively. Specifically, the (CNN) model outperformed the (CNN-LSTM) model by approximately 1.26%, demonstrating significantly faster training times.

Despite the promising outcomes of our study, it is essential to address certain limitations. The proposed (CNN) model, while achieving remarkable performance, comes with the requirement of an additional preprocessing step. It assumes that the video footage capturing the exercise is already cropped, precisely marking the start and end of the movement. This assumption implies a reliance on preprocessed data, potentially limiting the model’s adaptability to scenarios where such precise cropping is not readily available. Consequently, the generalizability of the proposed model may be influenced by the availability and accuracy of preprocessed datasets, necessitating careful consideration and potentially constraining its applicability in real-world situations where detailed annotations may be challenging to obtain.

8 Conclusion and future work

In conclusion, this research has significantly advanced the field of physical therapy and exercise classification through innovative methodologies. The introduction of a pioneering approach to feature engineering, representing exercises as 1D vectors using various statistical techniques, offers a novel perspective for exercise representation. Additionally, optimizing model performance by applying a random search methodology for designing the architectural framework of the proposed models contributes to increased efficiency and a deeper understanding of model behavior.

A substantial contribution is evident in the detailed comparative analysis across four distinct models (LSTM, Bi-LSTM, CNN, and CNN-LSTM) using two diverse datasets. This comparative study provides valuable insights into the strengths and weaknesses of each model in exercise classification, offering a nuanced understanding of performance variations. Moreover, the study’s experimentation with multiple datasets, utilizing cross-validation techniques to derive results from testing data, ensures robust model generalization performance across diverse datasets, enhancing the credibility of the study’s findings.

The CNN model exhibited outstanding accuracy, attaining scores of 93.08% and 99.7% on the KIMORE and UI-PRMD datasets, respectively. This surpasses the state-of-the-art on both datasets by 0.75 and 0.1%, respectively. Moreover, the model demonstrated notable proficiency in disease classification, enabling the detection of correct and incorrect exercise techniques and achieving a disease diagnosis accuracy of 89.87%.

Notably, the (CNN) model boasts a lightweight architecture beyond its exceptional performance, making it suitable for deployment on low-power devices such as mobile apps. Furthermore, its efficient use of resources contributes to overall computational efficiency over video-based and image-based approaches. The model exhibits strong generalization capabilities across diverse datasets, ensuring its applicability to varied scenarios. This amalgamation of superior performance, resource efficiency, and generalization proficiency positions the (CNN) model as a promising and versatile solution in the domain of exercise classification and disease identification within the field of physical rehabilitation.

Adopting MLOps methods contributed to the smooth integration of machine learning into the software development life-cycle, improving collaboration, reproducibility, and scalability. This approach ensures deployed models’ long-term efficacy and durability in dynamic operational situations.

In summary, this research’s multifaceted contributions not only advance the understanding of physical rehabilitation and exercise classification but also extend the utility of the proposed models to disease identification. This marks a significant advancement in research and practical applications, contributing to the broader knowledge base and fostering future developments in the field.

Moreover, an area warranting further investigation is the accuracy of the KIMORE dataset. Despite achieving exceptional results, there is room for in-depth study to uncover potential refinements and optimizations. This exploration could delve into refining the model architecture, feature engineering techniques, or considering additional data augmentation strategies. Future research endeavors could also explore attention-based mechanisms to enhance model robustness. Specifically, implementing a fixed or overlapping window strategy might overcome the precision limitations associated with requiring a precise crop in video data, facilitating real-time applications. This approach could improve the model’s adaptability to scenarios with challenging detailed annotations or precise cropping. Additionally, integrating Transfer Learning into our research paradigm is a promising avenue, albeit challenging. The dissimilarities between the UI-PRMD and KIMORE datasets, encompassing variations in anatomical joint data and exercise types, present hurdles that need creative solutions. Addressing these challenges will pave the way for more comprehensive and effective model transferability, ensuring the model’s success in diverse settings and scenarios.