Introduction

Advances in sensor technology have led to a surge of interest in recognizing human activities based on sensor data, owing to its wide-ranging applications in everyday life, such as medical care, movement analysis, intelligent monitoring systems, and smart homes. Human activity recognition (HAR) aims to study the details of human behavior in order to understand and predict specific actions. Behavior data can be obtained through various means, including accelerometers, infrared sensors, RFID, and video recordings. Currently, HAR can be categorized into two main classes: video-based and sensor-based HAR. Video-based HAR systems primarily rely on cameras to capture videos and images, utilizing computer vision technology to identify human actions and behaviors. While these techniques often yield satisfactory results, they are susceptible to environmental factors such as lighting conditions, occlusion, and privacy concerns. In contrast, sensor-based systems employ environmental or wearable sensors to determine human actions1. These sensors are commonly embedded in smart devices like smartphones and smartwatches. Given the ubiquity and indispensability of such devices in our daily lives, sensor-based approaches offer an immediate solution for HAR research2,3. The adoption of sensor-based HAR holds immense potential across numerous practical domains. For instance, it can be employed to develop advanced movement tracking systems in healthcare settings4, benefiting elderly individuals and disabled persons. Furthermore, it can facilitate automatic interpretation of player actions in sports5, enabling more streamlined analysis. Additionally, sensor-based HAR enables user identification and verification in surveillance systems by analyzing gait characteristics6. Lastly, it contributes to human-robot interactions through gesture recognition7. Harnessing the power of sensor-based HAR can bring significant advantages to these diverse sectors.

The utilization of wearable sensors in HAR has traditionally presented a complex challenge due to the classification of time-series data with multiple variables. A crucial aspect of overcoming this challenge lies in the extraction of relevant features, which can be achieved by employing mathematical methods in both the temporal and spectral domains8. While conventional machine learning (ML) algorithms like Naive Bayes, decision trees, and support vector machines have been successful in categorizing various human behaviors9, manual feature extraction requires specialized knowledge or expertise, limiting its practicality. Consequently, the use of mathematical methods for learning fails to capture distinct characteristics that can effectively differentiate complex actions. Fortunately, the introduction of convolutional layers in deep learning (DL) models has revolutionized the field by automating the feature extraction process2. This breakthrough empowers HAR with the capabilities of DL techniques, opening new possibilities for advancement in the field.

The convolutional neural network (CNN) model is known for its local connectivity and weight distribution mechanisms, resulting in a reduced number of parameters and faster training. Consequently, numerous studies have been published on sensor-based HAR utilizing CNN10,11. The effectiveness of CNN in extracting features and achieving accuracy is influenced by the depth and width of the network. A typical CNN comprises convolutional layers and pooling layers, which play a critical role in extracting feature maps essential for categorization. However, not all feature maps contribute significantly to accurately identifying targeted actions. CNN excels in capturing spatial representation from sensor data, while Recurrent Neural Networks (RNN) excel in capturing temporal representation. Therefore, combining CNN and RNN allows for a comprehensive representation of spatial and temporal features from sensor input. In a previous work by Ordonez et al.12, both CNN and RNN were employed for HAR. To further enhance the effectiveness of HAR, it is recommended to prioritize valuable feature maps while suppressing unreliable ones. This is addressed by the squeeze-and-excitation (SE) block13, which acts as a channel-attention mechanism. The SE block recalibrates each feature map by assigning a weight proportional to its significance in the identification process. Zhongkai et al.2 report the implementation of the SE block in CNN and/or RNN models, resulting in an improved efficacy of HAR.

The existing literature provides valuable inspiration for understanding how individual actions occur in spatial and temporal aspects. By leveraging this knowledge, we can analyze data from wearable sensors using abstract features to identify human behaviors. In this study, we propose a novel approach called ResNet-BiGRU-SE, which combines a hybrid CNN with a channel attention system, to recognize human activities based on sensor data. We conducted multiple experiments using different standard datasets for HAR to assess the effectiveness of our model. Our hybrid model surpasses previous DL models in terms of accuracy, as evidenced by its performance on evaluation metrics. Therefore, this study emphasizes the following key contributions:

  1. 1.

    We developed a hybrid CNN embedded with a channel attention mechanism, called ResNet-BiGRU-SE, to extract deep spatio-temporal features hierarchically and distinguish human activities in daily living.

  2. 2.

    Various CNN architectures have been employed as the underlying models for sensor-based HAR. To evaluate the performance of the ResNet-BiGRU-SE model, we compared its effectiveness with that of other CNN-based models on the HAR dataset. Additionally, we conducted a comparative analysis between our proposed approach and state-of-the-art models using three benchmark HAR datasets (UCI-HAR, WISDM, and IM-WSHA) for a fair assessment.

The remaining sections of the study are arranged as follows: Section “Related works” explores the research on sensor-based HAR based on DL and current frameworks; Section “Research methodology” describes the hybrid DL framework presented in this study for sensor-based HAR; and Section “Experiments and results” describes the experimental setup and provides experimental findings. This section also contains an analysis of the experimental outcomes. Section “Discussion” concludes the study and addresses future employment.

Related works

HAR poses challenges as a time series classification problem, involving the prediction of an individual’s movements using sensory input. Typically, it necessitates extensive domain knowledge and signal processing techniques to extract appropriate features from raw data that align with a machine learning algorithm. DL methods, such as CNNs and Long Short-Term Memory Neural Networks (LSTMs), have demonstrated their effectiveness by automatically learning relevant features from raw sensory input, thereby achieving state-of-the-art performance14,15.

HAR aims to collect and recognize real-world actions performed by individuals or groups while considering the surrounding environmental factors. This field holds significant promise in the study of Human-Computer Interaction16,17 as it has the potential to revolutionize how humans interact with technology in the present era. The objectives of HAR can be categorized into five main areas: identifying fundamental movements, detecting everyday motions, recognizing unique events, forecasting caloric expenditure, and performing individual biometric recognition18. To achieve these goals, a variety of sensors can be utilized, including environmental sensors and wearable video cameras. In practice, wearable sensors often take the form of smartphones or sensors integrated into wearable devices.

While camera sensors can provide unique information not obtainable from other sensor types, they come with certain drawbacks. Camera-based systems require constant monitoring of individuals, resulting in the need for significant storage capacity and computational capabilities. Additionally, continuous surveillance by camera systems may lead to discomfort or unease among individuals19. An example of a camera-based indoor human motion tracking system is presented by Zhou et al.20, showcasing continuous video monitoring and advanced video processing capabilities. Another benefit of camera sensors is their ability to provide accurate data for human motion identification systems.

Ambient sensors offer the ability to monitor and record an individual’s interactions with their environment. In the experimental context of Zhan et al.’s study21, wireless Bluetooth acceleration and gyroscope sensors were employed to capture situational components and demonstrate their usage. Furthermore, room-side wired microphone arrays were utilized to detect ambient sound, while Reed switches were placed on doors, drawers, and shelves to detect their operation and generate contextual information. However, it should be noted that environmental sensors are limited to specific conditions and architectural configurations, rendering the HAR system non-universal. A well-designed and trained HAR system cannot be directly applied to a different environmental setting. Additionally, the implementation cost associated with these sensors tends to be relatively high.

Wearable technologies worn on the human body have the capability to recognize the physical aspects and characteristics of individuals’ everyday tasks. Inertial sensors such as accelerometers and gyroscopes, along with GPS and magnetic field sensors, are commonly used in applications for action identification. In specific studies, action identification is achieved by utilizing one or more accelerometers attached to various regions of the human body. Dong and Biawas22 introduced a wearable sensor network designed for HAR. Additionally, Curone et al.23 utilize a tri-axial accelerometer worn on the body for action recognition.

Given the significant advancements made by DL across various ML applications, and considering the inherent multi-class nature of DL techniques, our systematic review begins with a concise overview of DL for human activity detection. Wang et al.24 conducted a comprehensive review of 56 publications from 2011 that utilized DL techniques, including deep neural networks, CNNs, RNNs, auto-encoders, and limited Boltzmann machines, for sensor-based HAR. They found that no single model outperforms all others in every scenario, emphasizing the importance of selecting a model based on the specific application requirements. Additionally, they compared three benchmark datasets for HAR: the Opportunity dataset25, the Skoda dataset26, and the UCI-HAR dataset27 (collected using smartphones with multiple inertial measurement units). Among these datasets, they identified studies12,28,29,30 as representing the state-of-the-art in DL for HAR utilizing inertial measurement units (IMUs).

Sophisticated HAR models benefit from complex and deeper structures, leading to improved accuracy compared to previous feature learning methods. These models utilize CNNs for automatic feature extraction. In the context of object identification, the CNN feature extractor is often referred to as the backbone. This term emphasizes that the architecture of the feature extractor and the overall model construction are evaluated separately and independently.

Instead of relying on basic models, researchers have developed sophisticated backbone models to enhance performance. Dong et al.31 introduced a combination of Hierarchical Cross-Filtering (HCF) and an inception module. Long et al.32 proposed a method of independently learning large-scale and small-scale networks and subsequently joining them. This approach incorporates two different sizes of residual blocks as crucial components. Tuncer et al.33 suggested utilizing a ResNet structure with multiple layers as feature extractors, with the extracted features cascaded to serve as the backbone. Ronald et al.34 presented the iSPLInception backbone, which is based on Inception-ResNet and utilizes a multichannel-residual hybrid architecture for HAR research. Mehmood et al.35 employed DenseNet as the backbone and leveraged dense connections for HAR purposes.

Research methodology

This research investigated sensor-based HARs using DL techniques to extract abstract characteristics from raw sensor data. As shown in Fig. 1, the explored HAR framework consists of four key process steps: data acquisition, data pre-processing, model training, and model assessment.

Figure 1
figure 1

HAR workflow employed in this study.

Data acquisition

This section highlights the HAR datasets utilized in the evaluation of this study. For assessment purposes, three public datasets were included: UCI-HAR, WISDM, and IM-WSHA. These datasets consist of inertial data collected from smartphone sensors, with each dataset capturing information from a group of individuals as they performed their daily activities. Table 1 provides a comprehensive comparison of the three benchmark datasets used in this study.

UCI-HAR dataset

This paper utilizes the ”UCI Human Activity Recognition Using Smartphone Dataset (UCI-HAR)”27 as the public activity dataset for the proposed approach. The UCI-HAR dataset comprises action data collected from a diverse group of 30 individuals with varying ages (19 to 48 years), genders, heights, and weights. Participants wore a smartphone at their waist position while performing everyday activities. Each individual engaged in six different activities. The smartphone’s tri-axial accelerometer and gyroscope captured sensor data during the execution of these six predetermined tasks. Data on triaxial linear acceleration and angular velocity were collected at a consistent rate of 50 Hz.

WISDM dataset

The WISDM dataset36 serves as a fundamental HAR dataset derived from the Wireless Sensor Data Mining Laboratory. It consists of 1,098,207 samples and captures the action recognition patterns of 36 individuals, encompassing activities such as strolling and seating. The data was collected by participants who carried an Android smartphone in their front leg pocket, utilizing the device’s built-in accelerometer sensor at a sampling rate of 20 Hz.

IM-WSHA dataset

The IM-Wearable Smart Home Activities (IM-WSHA) dataset37 is a comprehensive collection of signal data specifically designed to serve as a standard dataset for HAR. This dataset features three wearable Inertial Measurement Unit (IMU) sensors that capture three-axis accelerometer, gyroscope, and magnetometer data. The sampling frequency of the dataset is 100 Hz. To accurately capture individuals’ movement patterns during their daily activities, the IMU sensors were strategically positioned on different body parts, namely the thorax, femur, and wrist. The study involved ten participants, with an equal distribution of males and females, who performed a total of eleven distinct physical tasks within an indoor environment. These tasks encompassed various common activities such as walking, exercising, cooking, drinking, talking on the phone, doing laundry, watching television, studying, brushing hair, using a laptop, and vacuum-cleaning.

Table 1 A detailed comparison of three benchmark datasets used in this study.

Data pre-processing

The acquired raw data from sensors often contains measurement noise and additional unforeseen noise caused by the participant’s dynamic movements during data collection. The presence of noise in the signal distorts the usable information it carries. Therefore, it becomes crucial to reduce the influence of noise and extract valuable information from the signal for further processing. Common filtering techniques used to address this issue include mean, Low-pass, and Wavelet filtering38,39. In our work, we employed a third-order low-pass Butterworth filter with a cutoff frequency of 20 Hz across all three dimensions of the accelerometer, gyroscope, and magnetometer sensors for effective signal denoising. This choice of filter parameters is suitable for recording human motion since the energy content below 15 Hz accounts for 99.9% of the signal, making it an appropriate resolution.

Once the noise was removed, the filtered sensor data underwent a transformation to prepare them for further analysis. In this phase, a Min-Max normalization approach was employed to adjust each dataset’s values within the range of [0, 1]. This normalization is advantageous for learning techniques aiming to assess the effects of various factors.

During the data segmentation phase, the normalized data from all sensors is divided into equal-sized portions using fixed-size sliding windows. In this study, we chose a sliding window of 2.56 seconds, which resulted in sequences of sensory data with a specific length. These segmented portions are then used for model training.

The proposed hybrid convolutional neural network

This research proposes an effective biometric recognition model called ResNet-BiGRU-SE for utilizing motion signal data captured from smartphone sensors. The proposed method automatically generates identifying characteristics based on the sensor data inputs. ResNet-BiGRU-SE consists of a convolutional block and eight hybrid residual blocks, which extract standard spatial features. The model also includes a global average-pooling (GAP) layer, a flattened layer, and a fully connected layer, as illustrated in Fig. 2.

Figure 2
figure 2

Detailed and unrolled architecture of the proposed hybrid convolutional neural network model.

Convolutional block

CNNs typically employ a predefined set of elements and are commonly utilized for supervised learning. In these neural networks, each neuron is connected to every other neuron in the subsequent layers. The activation function of the neural network converts the input value of the neurons into their output value. The effectiveness of the activation function is influenced by two important factors: sparsity and the neural network’s ability to handle reduced gradient flow to its lower layers40. In CNNs, pooling is often used for dimension reduction. Both maximum and average pooling functions, referred to as max-pooling and average-pooling, are commonly utilized.

In this study, we utilized a convolutional block (ConvB) to process the raw sensor data and extract low-level features. The ConvB, as depicted in Fig. 2, consists of four layers: 1D-convolutional (Conv1D), batch normalization (BN), exponential linear unit (ELU), and max-pooling (MP). Conv1D employs multiple trainable convolutional kernels to capture different features, generating a feature map for each kernel. The BN layer is employed to stabilize and accelerate the training process, while the ELU layer enhances the model’s expressive capability. Additionally, the MP layer is used to reduce the size of the feature map while retaining the most significant characteristics.

Structure of gated recurrent unit

Gate recurrent unite (GRU) was developed as a new RNN-based approach to prevent the exploding/vanishing gradient issue; nevertheless, the design’s memory cells result in a higher memory consumption41. The GRU is a straightforward variation of the LSTM in which individual memory cells are omitted from its design42. As seen in Fig. 3a, a GRU’s network has an update and a reset gate that handles the update level of each concealed state, i.e., it determines which data must flow to the next stage and which does not. GRU computes the hidden state \(h_t\) at time t based on the output of the update gate \(z_t\), the reset gate \(r_t\), and the current input \(x_t\). The prior hidden state \(h_{t-1}\) is determined as follows:

Figure 3
figure 3

Structure of Bidirectional Gated Recurrent Unit: (a) GRU cell and (b) unroll BiGRU.

$$\begin{aligned} z_t= & {} s(W_zx_t \oplus U_zH_{t-1}) \end{aligned}$$
(1)
$$\begin{aligned} r_t= & {} s(W_rx_t \oplus U_rH_{t-1}) \end{aligned}$$
(2)
$$\begin{aligned} g_t= & {} \tanh (W_gx_t \oplus U_g(r_t \otimes H_{t-1})) \end{aligned}$$
(3)
$$\begin{aligned} h_t= & {} ((1 - z_t) \otimes h_{t-1} \oplus (z_t \otimes g_t)) \end{aligned}$$
(4)

where \(\sigma\) is a sigmoid function and \(\oplus\) is an elementary addition operation, and \(\otimes\) is an elementary multiplication operation.

Schuster and Paliwal43 introduced a bidirectional RNN (BiRNN) in 1997 in order to address the drawback of a conventional (unidirectional) RNN. In addition to the present input, the output at a given period also incorporates past and future data. This is performed by concurrently training the network in the forward and reverse directions. A normal RNN does this by dividing its neurons into a portion responsible for the forward direction and a portion responsible for the reverse direction. Positive neuron output is not linked to negative neuron output, and vice versa. This results in the general structure depicted in Fig. 3b. The relevant computations are shown in the following equations:

$$\begin{aligned} \overrightarrow{\mathrm{h_t}} = GRU(x_t, \overrightarrow{\textrm{h}}_{t-1}) \end{aligned}$$
(5)
$$\begin{aligned} \overleftarrow{\mathrm{h_t}} = GRU(x_t, \overleftarrow{\textrm{h}}_{t-1}) \end{aligned}$$
(6)
$$\begin{aligned} h_t = [\overrightarrow{\textrm{h}}_{t}, \overleftarrow{\textrm{h}}_{t}] \end{aligned}$$
(7)

Hybrid residual block

Commonly, simple DL algorithms employ convolution layers followed by fully connected layers for classification tasks, without incorporating shortcut connections. These architectures are known as sequential networks, where each layer passes data to the next layer. However, as the size of the sequential network increases, a challenge arises in the form of vanishing or exploding gradients. This can pose difficulties for the effective training of such networks.

To overcome this problem, ResNet utilizes residual blocks, which allow for skip connections between blocks of convolutional layers. These skip connections enhance gradient propagation and facilitate the training of increasingly deeper CNNs, mitigating the issue of gradient vanishing. A residual layer can be represented as follows:

$$\begin{aligned} \text {ELU}(x) = {\left\{ \begin{array}{ll} x &{} \quad \text {if } x \ge 0\\ \alpha (e^x - 1) &{} \quad \text {if } x < 0 \end{array}\right. } \end{aligned}$$
(8)
$$\begin{aligned} R(x) = \text {ELU}(x + f(x)) \end{aligned}$$
(9)

Where x denotes the input, f(x) denotes the layer’s output, ELU(x) denotes the exponential linear unit function, and R(x) denotes the residual block’s output. The residual element f(x) is generated in this block as two consecutive repetitions of a trio of operational processes: convolution with a filter of size 3\(\times\)1, batch normalization, and ELU activation. The f(x) feature map is then concatenated with the input x, and the ELU activation function is then applied to the combined characteristics.

In order to extract hybrid features hierarchically by incorporating both spatio-temporal and channel-wise data, we introduced the SEResidual block based on previous work by Muqeet et al.44. As depicted in Fig. 4, this residual block consists of Conv1D layers, BN layers, ELU layers, SE modules, and shortcut connections with BiGRU. The inclusion of SE modules enhances the model’s representational capacity by incorporating channel attention.

Figure 4 illustrates the construction of a SE component. After a convolution process, several feature maps are compiled. Nevertheless, specific feature maps could include duplicated data. The SE module performs feature recalibration to improve the discriminative information and disable the less valuable aspects. This module has two primary phases: squeezing and excitation. The exponential linear unit function and R(x) is the residual block’s output. The residual element f(x) is generated in this block as two consecutive repetitions of a trio of operational processes: convolution with a filter of size 3\(\times\)1, batch normalization, and ELU activation. The f(x) feature map is then concatenated with the input x. The ELU activation function is then applied to the combined characteristics.

Initially, the squeeze process comprises all information related to the channels. H\(\times\)W is the size of the feature map C\(\times\)H\(\times\)W that corresponds to one channel in U. Utilizing channel descriptor function, including global average pooling (GAP), feature maps for each channel are compressed into 1\(\times\)1 feature map45. During this step, a scalar value reflecting a global channel is established. The procedure indicated by Eq. (10), where \(U_c(i, j)\) is a feature map relating to channel c after the convolution layer has been applied to X. \(F_{squeeze}\) is the channel descriptor function, and GAP was employed in this investigation.

$$\begin{aligned} Z_c = F_{squeeze}(U_c) = \frac{1}{H \times W} \sum _{i=0}^{H} \sum _{j=0}^{W}U_c(i, j) \end{aligned}$$
(10)

The channel-wise dependencies are then examined in the excitation stage utilizing the descriptor for each channel acquired in the squeeze stage. Fully-connected (FC) layers and nonlinear functions could accomplish this goal. Equation (11) describes the excitation stage, where z is the result acquired by squeezing, \(W_i\) are the ith FC layers, is the sigmoid function, and \(F_{excite}\) is the excitation mechanism. According to the sigmoid, the resulting value of the excitation stage is between 0 and 1 and might even be employed as a calibration weight. The current feature map U is multiplied by the newly derived weight s. The design of the squeeze and excitation stages in the SE block is shown in Fig. 4, along with the operation of the SE component implemented in this investigation.

$$\begin{aligned} s = F_{excite}(z, W) = \sigma (g(z, W)) = \sigma (W_2 \text {ReLU}(W_1z)) \end{aligned}$$
(11)

In order to deploy the activations to the side path network, the final step needs reconfiguring the output U, where \(X = [x_1, x_2,..., x_n]\). \(s_nU_n\) is the channel-wise multiplication of the scalar sn by the feature map. This procedure supplies adjustable weights to the feature channels that are the basis of the SE block46.

Figure 4
figure 4

The residual block with SE module.

Hyperparameters

The DL process relies on the configuration of hyperparameters, which govern the learning procedure. In the case of the ResNet-BiGRU-SE model, the following hyperparameters were utilized: (1) learning rate (\(\alpha\)), (2) epochs, (3) batch size, (4) optimization method, and (5) loss function. Initially, the learning rate \(\alpha\) was set to 0.001. The training process involved 200 epochs and used batches of size 128. If the validation loss did not improve for 30 consecutive epochs, a predefined function was triggered to stop the training early. After six additional epochs, the ResNet-BiGRU-SE model’s learning rate was adjusted to 75% of its initial value, as the accuracy did not improve during the verification phase. To minimize errors, the Adam optimization algorithm47 was employed, with the following parameters: \(\beta _1\) = 0.9, \(\beta _2\) = 0.999, and \(\epsilon\) = 1 \(\times\) \(10^{-8}\). For error identification, the categorical cross-entropy function48 was utilized, as it has demonstrated superior performance compared to classification and mean square error metrics.

Cross validation method

The k-fold cross-validation (k-CV) technique is a valuable method for estimating the performance of a classification model using multiple data subsets49. This approach involves randomly dividing a dataset, obtained from either a single individual or multiple participants, into k non-overlapping subsets of approximately equal size. Each subset is then used to evaluate the classification model trained on the remaining k - 1 subsets. The overall effectiveness of the model is determined by computing the mean value of performance measures such as accuracy, precision, recall, and F-measure, obtained from the k-CV50. It’s worth noting that this approach can be computationally demanding, particularly when dealing with large sample sizes or high values of k. In this study, we applied the k-CV technique with k set to 5, as depicted in Fig. 5, to assess the performance of the models.

Figure 5
figure 5

k-CV technique.

Performance measurement

In order to evaluate the effectiveness of the proposed DL model, we employed a 5-CV procedure. This technique enables us to comprehensively assess the model’s performance using four widely-used evaluation metrics: accuracy, precision, recall, and F-measure. The mathematical equations representing these four assessment indicators are provided below:

$$\begin{aligned} Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \end{aligned}$$
(12)
$$\begin{aligned} Precision = \frac{TP}{TP + FP} \end{aligned}$$
(13)
$$\begin{aligned} Recall = \frac{TP}{TP + FN} \end{aligned}$$
(14)
$$\begin{aligned} F-measure = 2 \times \frac{Precision \times Recall}{Precision + Recall} \end{aligned}$$
(15)

The four measures discussed in this context are commonly employed to evaluate the effectiveness of sensor-based HAR. In this context, recognition refers to accurately identifying a specific category, known as true positive (TP), while correctly identifying all other categories as true negatives (TN). Misclassifying sensor data into another category results in a false positive (FP) identification. Likewise, misclassifying action sensor data from another category as belonging to the considered category leads to a false negative (FN) understanding of that category. The pseudo-code for the HAR algorithm used in this study is described in Algorithm 1.

figure a

Experiments and results

In this section, we present the studies conducted to determine the most efficient CNN models for sensor-based HAR. Our research focused on three benchmark smartphone sensing datasets, namely UCI-HAR, WISDM, and IM-WSHA datasets, which are commonly used for HAR tasks. The performance of the DL models was evaluated using accuracy and F-measure, which are widely recognized metrics for assessing model effectiveness in HAR applications.

In the investigation, we compared the CNN backbone models VGG1651, ResNet1852, PyramidNet1853, Inception-V354, Xception55, and Inception-ResNet34. These models were presented as a solution to the issue of image recognition; consequently, we reconstructed the framework of these models for HAR. Furthermore, the identification capabilities of CNN models and our suggested model are compared.

Experiment setting

This research utilized Google Colab Pro+ with a Tesla V100-SXM2-16GB graphics processor module to accelerate the training of DL models. The ResNet-BiGRU-SE and other primary DL models were developed in a Python library with TensorFlow and CUDA backends. These studies focused on the following Python libraries:

  • Numpy and Pandas were used for managing data while retrieving, processing, and analyzing sensor data.

  • Matplotlib and Seaborn were applied for charting and presenting the results of data exploration and model evaluation.

  • Scikit-learn (Sklearn) was utilized as a module for sampling and data production in investigations.

  • TensorFlow, Keras, and TensorBoard were operated to produce and train models using DL.

Experimental results

In this study, we assessed the proposed framework by comparing it to baseline DL algorithms using three publicly available datasets: UCI-HAR, WISDM, and IM-WSHA. The following subsections present the experimental findings of these DL methods trained on smartphone sensing data from these benchmark datasets.

Table 2 Recognition performance of DL models on the UCI-HAR dataset.

In the first experiment, we evaluated the performance of the proposed ResNet-BiGRU-SE model on the UCI-HAR dataset. The results are summarized in Table 2. The findings indicate that the proposed model outperforms other CNN models, achieving an impressive average accuracy of 98.92% and an F-measure of 98.99%. It is noteworthy that the proposed model has a relatively small number of training parameters, with only 127,814 values. This demonstrates its efficiency despite its compact design.

Table 3 Recognition performance of DL models on the WISDM dataset.

The results presented in Table 3 are obtained from the second experiment conducted using the WISDM dataset. These findings demonstrate that the proposed ResNet-BiGRU-SE model outperforms other CNN models, achieving an impressive average accuracy of 98.80% and an F-measure of 98.62%. It is worth noting that despite its superior performance, the proposed model has a relatively small number of training parameters, with only 126,854 values. This highlights the efficiency of the model’s design in terms of parameter utilization.

Table 4 Recognition performance of DL models on the IM-WSHA dataset.

The findings from the third investigation, which utilized the IM-WSHA dataset, are summarized in Table 4. The results clearly demonstrate that the ResNet-BiGRU-SE model outperforms other CNN models, achieving a remarkable accuracy rate of 98.45% and an F-measure of 97.60%. These results highlight the superior performance of the ResNet-BiGRU-SE model in accurately classifying activities based on the IM-WSHA dataset.

Discussion

Comparison results with state-of-the-art models

We conducted a comprehensive comparison of our proposed model with state-of-the-art DL models in the field of sensor-based HAR. In Table 4, we compared our ResNet-BiGRU-SE network with several other DL techniques, namely 1D-CNN56, Bidir-LSTM57, CNN-LSTM58, SDAE59, and CNN-GRU60. Each of these models was developed in accordance with its respective study descriptions.

Notably, our suggested ResNet-BiGRU-SE model achieved an outstanding success rate of 98.92% on the UCI-HAR dataset, surpassing the performance of all the other models. The comparative results are presented in Table 5, providing a clear illustration of the superior performance of our proposed ResNet-BiGRU-SE model in comparison to the other models.

Table 5 Comparison results of the proposed model and previous works using UCI-HAR dataset.

In our evaluation using the WISDM dataset, we compared our proposed model with state-of-the-art DL algorithms for sensor-based HAR. Table 6 provides a comprehensive comparison of our ResNet-BiGRU-SE network with several other DL approaches, including LSTM61, CNN with statistical features14, U-Net62, and CNN-GRU60. Each of these models was implemented based on the descriptions provided in their respective publications.

Remarkably, our suggested ResNet-BiGRU-SE model achieved an impressive accuracy of 98.80% on the WISDM dataset, outperforming all the other models. This outstanding performance further highlights the superiority of our proposed ResNet-BiGRU-SE model in accurately classifying activities based on the WISDM dataset.

Table 6 Comparison results of the proposed model and previous works using WISDM dataset.

The findings from our study strongly support our hypothesis that our hybrid DL model, which combines local spatio-temporal characteristics with long-term contextual understanding, improves the comprehension of sensor data and ultimately enhances the performance of activity classification. Additionally, the results suggest that deep residual models exhibit favorable performance when applied to raw signals. However, the inclusion of BiGRU and SE modules further enhances the effectiveness of HAR for real-life human motion detection. These findings highlight the significance of incorporating both architectural enhancements and feature extraction techniques in order to achieve optimal results in HAR applications.

Table 7 provides a comprehensive evaluation of various advanced techniques on the IM-WSHA dataset. One approach utilized a reweighted genetic algorithm (GA) to combine statistical and frequency features extracted from a previous study63, resulting in an accuracy of 81.92%. Another approach involved the utilization of a random forest model with stochastic gradient descent (SGD) optimization, denoted as64, which achieved a recognition accuracy rate of 90.18% in order to enhance the effectiveness of HAR. Remarkably, when applied to the IM-WSHA dataset, the ResNet-BiGRU-SE model achieved an impressive identification accuracy of 98.45%. These results highlight the superiority of our proposed model over other advanced techniques in accurately identifying human activities based on the IM-WSHA dataset.

Table 7 Comparison results of the proposed model and previous works using IM-WSHA dataset.

Influence of validation methods

Sensor-based HAR studies commonly employ three validation techniques: hold-out validation65, k-CV, and Leave-One-Subject-Out cross-validation (LOSO)50. Hold-out validation involves dividing the dataset into a training set (typically 70% of the data) and a test set (30% of the data). On the other hand, k-CV repeatedly partitions the dataset into k subsets for training and testing, evaluating the algorithm’s performance k times. LOSO validation creates a training set with n - p samples and a testing set with p samples, where p represents all data from a single subject. This approach ensures that there is no overlap between subjects in the training and testing sets.

To assess the impact of these validation techniques, we conducted supplementary investigations using three HAR datasets: UCI-HAR, WISDM, and IM-WSHA. We evaluated the effectiveness of the ResNet-BiGRU-SE model and presented the results in Fig. 6. These evaluations allowed us to determine how different validation techniques influenced the performance of our proposed model in HAR tasks.

Figure 6
figure 6

Comparative results of the proposed ResNet-BiGRU-SE conducted on different validation methods.

The findings presented in Fig. 6 clearly demonstrate the significant impact of validation approaches on the effectiveness of HAR. Among the three benchmark datasets, the ResNet-BiGRU-SE model, which utilized k-CV, achieved the highest levels of accuracy. However, it is important to note that this result may be influenced by the fact that the k-CV method does not consider scenarios where all samples come from a single study participant. This issue arises due to the time series segmentation used in the pre-processing phase. In a generalized HAR implementation, independently dividing the dataset can lead to instances where a participant’s data appears in both the training and test sets simultaneously, causing data leakage that artificially inflates the classifier’s accuracy.

On the other hand, when adopting the LOSO approach, which takes into account individual-specific data (i.e., subject labels), the accuracy of the classification model tends to decrease. The implementation of this improved assessment approach resulted in a 12% reduction in accuracy, indicating a preliminary overestimation of the results. It is important to consider these factors when selecting a validation approach to ensure accurate and reliable performance evaluation in HAR tasks.

Misclassification

To analyze the misclassification patterns of the suggested model, we conducted a comprehensive examination of the confusion matrices generated by the ResNet-BiGRU-SE model on three different HAR datasets: UCI-HAR, WISDM, and IM-WSHA. These datasets contain activity data collected from a variety of sensor categories, as summarized in Table 1. By studying the confusion matrices, we can gain insights into the specific activities that are frequently misclassified by the model and identify potential areas for improvement.

Regarding the UCI-HAR dataset, the categories of ”sitting” and ”standing” exhibited the highest frequency of misclassifications. This can be attributed to the similarity in linear acceleration patterns observed in these static actions66. However, the ResNet-BiGRU-SE model performed well in accurately categorizing the other four activities, as shown in Fig. 7a. Moving on to the WISDM dataset, Fig. 7b presents the confusion matrix of the model. It can be observed that the classification of ”walk upstairs” and ”walk downstairs” resulted in the highest number of errors, likely due to the contrasting nature of these two activities as different forms of physical movement. The utilization of gyroscope sensor data played a crucial role in distinguishing between these actions67. Lastly, the confusion matrix depicted in Fig. 7c reveals that the ResNet-BiGRU-SE model applied to the IM-WSHA dataset encountered misclassifications primarily in hand-oriented actions such as ”cooking,” ”drinking,” and ”brushing hair,” as suggested in our study. However, the model demonstrated high accuracy in classifying other diverse behaviors.

Figure 7
figure 7

Comparison results of the proposed model and previous works using IM-WSHA dataset.

Conclusions

This study focuses on examining the recognition performance of CNN-based classifiers with diverse topologies for sensor-based HAR. We conducted experiments using three widely-used HAR datasets: UCI-HAR, WISDM, and IM-WSHA, to thoroughly investigate the effectiveness of CNN-based models. The findings indicate that the ResNet architecture stands out as a suitable choice for HAR compared to other backbone architectures. In particular, we introduce a lightweight residual network, named ResNet-BiGRU-SE, specifically designed for sensor-based HAR. The proposed model’s efficiency was evaluated using the three datasets. By employing the 5-CV technique, our results demonstrate the superiority of our proposed model, achieving an impressive average recognition performance of 98.92% for UCI-HAR, 98.80% for WISDM, and 98.45% for IM-WSHA. Furthermore, our suggested model exhibits a reduced number of training parameters compared to previous HAR models. Moving forward, our future work aims to expand the scope of our research by including scenarios with larger populations, diverse locations, various sensor types, and activities of daily living (ADLs).