Intelligent system for human activity recognition in IoT environment

In recent years, the adoption of machine learning has grown steadily in different fields affecting the day-to-day decisions of individuals. This paper presents an intelligent system for recognizing human’s daily activities in a complex IoT environment. An enhanced model of capsule neural network called 1D-HARCapsNe is proposed. This proposed model consists of convolution layer, primary capsule layer, activity capsules flat layer and output layer. It is validated using WISDM dataset collected via smart devices and normalized using the random-SMOTE algorithm to handle the imbalanced behavior of the dataset. The experimental results indicate the potential and strengths of the proposed 1D-HARCapsNet that achieved enhanced performance with an accuracy of 98.67%, precision of 98.66%, recall of 98.67%, and F1-measure of 0.987 which shows major performance enhancement compared to the Conventional CapsNet (accuracy 90.11%, precision 91.88%, recall 89.94%, and F1-measure 0.93).


Introduction
An Intelligent Decision Support System (IDSS) is an ideal approach for solving many challenges that can currently face the world. With the popularity and widespread of Machine Learning (ML) algorithms, the creation process of IDSS is easier and faster combined with the easy access to big datasets covering all aspects of our life which helped to fight COVID-19 virus [1]. IDSS helps physicians in detecting the virus in an early stage which increases the probability of survival of the patients. Moreover, recognizing the patients' hand gestures is a popular application of IDSS in the field of smart healthcare systems. It alerts the staff for the patients' requests in time without delays in remote monitoring environments [2]. The importance of IDSS in the medical fields is especially appreciated in poor countries as the healthcare service is very weak and, in some places, it does not exist. IDSS can fill the gaps in the services by providing on-time and cheap service without the need for expensive equipment and trained personnel.
Nowadays, there is tremendous growth in IoT-enabled devices for empowering decision-making processes in complex systems. The fast development and miniaturization of sensors and reduced need for power requirement leads to a revolution in the field of Human Activity Recognition (HAR). Detection of early signs of critical disease like diabetes [3] and heart disease [4], even detection of early signs of COVD-19 using smart watches' [5] sensors data, became a reality.
One trend that has gained importance recently is moving from one size fits all in the field of medicine to Personalized Health Care (PHC) and medicine [6,7]. This happened due to the growth in aging population and the rise of the costs of chronic diseases. Therefore, a new solution is needed for this problem. This solution should include new ways to monitor and measure the vital signs of every patient to tailor and customize the medication plan for specific needs. This can be achieved through using ML and the Internet of Things (IoT) through using suitable sensors around the patient which send the data continuously to the doctors and hospitals to make informed decisions. Such information is used to help the beneficiaries regarding controlling the daily-life activities [8].
The idea behind this paper is to develop an IDDS for automatically collecting and classifying the daily-life activities through integrating the power of IoT with ML algorithms. This provides the things in this system with such intelligence that can sense, understand, and act according to the information collected through the sensors installed on personal smart phones. The rest of this paper is organized as follows. "Related work" provides some related works. "The proposed model" discusses the proposed work. "Evaluation and results" presents an evaluation of the proposed model and discusses the results. "Conclusion and future work" provides conclusions and future suggested work.

Related work
Dorgham et al. [9] proposed a modern hybrid evolutionary approach that incorporates Genetic Algorithm (GA) with efficient evolutionary techniques. A Decision Support System (DSS) was implemented to assist hospital personnel in the assignment operation. The authors demonstrated the efficacy of the proposed approach to solve many benchmark instances recorded in the literature relevant to the smart health-care system using a true deep experimental analysis. In addition, their hybrid algorithm outperforms powerful approaches from the literature that have the best-known outcomes.
Zhou et al. [10] proposed HAR model based on Long-Short Term Memory (LSTM) Deep Learning (DL) algorithm for empowering the Internet of Healthcare Things (IoHT). It used deep Q-network for automatic labeling of data with reward-based on the distance to handle the issue of lack of labeled data. Then, the fusion of user's body sensors data and environmental data were applied for feeding the model. The results showed that this work outperformed other approaches like SVM, DNN, and Random Forest (RF) with a ROC curve up to 0.95.
Anguita et al. [11] proposed a system based on Support Vector Machines (SVM). The data is collected using a smartphone (Samsung Galaxy S2). Each person of the participants is doing a different activity: laying, walking, sitting, standing, walking up-stairs, and walking down-stairs. The results of the experiments are conducted through comparing two versions of SVM. The performance of the first version, Multi-Class SVM, achieves 89.3% accurate regarding predicting six different classes. However, the second version, Multi-Class Hybrid Fusion SVM, achieves 89% accuracy.
Murad et al. [12] suggested using deep recurrent neural network (DRNN) model. This model helps capturing the entire long-range of relations in the input data rather than being restricted to the size of the kernel window. Also, the model uses three different architectures: unidirectional, bidirectional, and cascading. Performance using DRNN with other algorithms on UCI-HAR dataset is concluded as follows: DRNN has reached the highest accuracy of 96.7% compared to 96% from SVM, and 95.2% from convolution neural network (CNN) and outperformed the others (SVM, K-nearest neighbor, and CNN).
Another proposed an approach for HAR using Deep Belief Neural Network (DBNs) which is built by sequentially stacking multiple Restricted Boltzmann Machines (RBM) [13]. They used a deep activity recognition model with three layers of one thousand neurons each. The results showed that their approach is better than the traditional methods. Also, their results showed that a hybrid DL and Hidden Markov Model (HMM) achieved recognition accuracy of 99.13%.
Chen and Xue [14] presented a CNN model for HAR through modifying the convolution kernel for the purpose of adapting the characteristics of tri-axial acceleration signals. The results showed that their model achieved an accuracy of 93.8% with no using of feature extraction based on a dataset of 31,688 samples gathered from nine activities.
Qin et al. [15] proposed a unique architecture for HAR that utilized data from multiple sensors. This system converts time series data collected from sensors into images. These transformed images were used to keep required features and patterns for the task of HAR. For enabling the model to be trained and evaluated on the collected data from different sensors, the authors used a fusion residual network by merging two networks and training different data pixelwise correlations. This model provided state-of-the-art performance with an accuracy of 93.41% on HHAR dataset and 98.5% on MHEALTH dataset.
Xia et al. [16] proposed a deep learning model that fuses LSTM layers with convolution layers to draw out the activity attributes without human interference in the feature selection process and classify them correctly. This model collected smartphone sensor data and fed it to two-layer LSTM followed by the convolution layers. The evaluation of the model was carried out on three public datasets. It achieved an accuracy of 95.85%, 95.78%, and 92.63% on WISDM UCI-Har, and OPPORTUNITY datasets, respectively.
Irvine et al. [17] proposed data driven HAR classifier as an ensemble of neural networks (NNs) for improving the quality of the public datasets. They used an ensemble of four NNs which generated and integrated using support function fusion. They introduced different approaches for handling the disputes between the different models. The final ensemble model achieved the best performance that reached an accuracy of 80.39%.
Mliki et al. [18] proposed an approach to HAR using noninvasive means depending on UAV-taken video sequence of human movement. This approach consists of two stages. The first is an offline stage that generates two CNN models (i.e., human/ non-human and the human activity model). The second is the inference stage that is concerned with indicating humans and their activities by adapting CNN. This system outperformed other methods on UCF-ARG dataset with an accuracy of 56% using instance classification and 68% on the entire sequence of frame classification.
Soleimani et al. [19] proposed a new method called Subject Adaptor Generative Adversial Network (SA-GAN). This method helps in handling the issue of the lack of big enough labeled data. The proposed model used GAN framework to execute cross-subject transfer learning in the domain of HAR depending on the data collected from wearable devices. In more than 66% of experiments, the model outperformed other compared approaches, while in the remaining 25% of experiments, it came in second. This work reached of 90% of the accuracy by supervised training over the same domain data in some cases.
Mazzia et al. [20] presented a modified version of capsule networks by substituting the dynamic routing with a novel non-iterative, highly parallelizable routing algorithm that can handle a smaller number of capsules with ease. Extensive testing with other capsule implementations has shown the efficacy of their approach and the potential of capsule networks to effectively embed more generalizable visual representations.
Jiang et al. [21] used artificial neural network (ANN) to approximate the time-dependent distributions of non-Markovian models using solutions of much simpler timeinhomogeneous Markovian models; the approximation does not increase the model's dimensionality while also allowing the kinetic parameters to be inferred. This network is trained using a small number of noisy measurements derived from experimental data or stochastic simulations of the non-Markovian model. They showed that the Markovian models learned by the NN is accurately reflecting the stochastic dynamics across parameter space using a range of models where the delays are caused by transcriptional processes and feedback control.
Attal et al. [22] applied and compared some ML approaches: k-Nearest Neighbor (kNN), SVM, Gaussian Mixture Models (GMM), RF, k-Means, Gaussian mixture models (GMM), k-Means, Gaussian mixture models (GMM), and HMM for HAR. The dataset contains some main daily living human activities. Some of these activities are walking, lying, and standing. They used three inertial wearable accelerometers placement on the human body dataset. Raw data and extracted/selected features were input for the classifiers. The results showed that that KNN has the high performance among all compared approaches. Also, they showed that MM has better performance among the compared unsupervised classifiers.
Shoaib et al. [23] collected data from 13 human activities performed indoors. In these experiments, each participant had a mobile phone in his right pocket and another at his right wrist. Three motion sensors at the wrist and pocket positions based on different scenarios were evaluated. The authors extracted different features for these sensors over different window sizes without overlap. They used Scikit-learn toolkit for analyzing the performance. Naive Bayes (NB), KNN, and decision tree were applied for practical simple and complex activity recognition. Also, they used ten-fold stratified cross-validation. Results proved that there is relatively smaller enhanced recognition because of data combination taken through different sensors at pocket and wrist positions. Also, they showed that increasing size of the window leads to improve the recognition results of various complex activities. However, this factor has a limited effect on the simple activities.
Garcia et al. [24] presented an ensemble called EkVN for HAR. It combines kNN, Decision Tree, and NB. It is based on heuristic hand-crafted feature extraction. The features were extracted from accelerometer, magnetometer, and gyroscope sensors. The results showed the accuracy of EkVN is more sensitive to data from different users to the window size and to the overlapping factor. Also, they [25] presented a multi-classification approach called EAE for HAR using an ensemble of Auto-Encoders (AEs). In EAE, each AE is trained with data for unique class for reconstructing the sensor measurements; each AE is associated with a label/activity. EAE can be updated with the user's data when loss drops are occurred below a known value. The results of experimentations based on WISDM, MHealth, and PAMAP2 HAR datasets showed that EAE is efficient and competitive among all compared works. Also, they showed that structure of this modular classifier can permit for more flexible models.
Dua et al. [26] developed a DNN-based model that uses CNN, as well as a Gated Recurrent Unit as an endto-end model that performs automatic feature extraction and activities classification. The raw data is utilized from wearable sensors without using neither pre-processing nor customized features extraction. This work achieved 96.20%, 97.21%, and 95.27%, respectively, on UCI-HAR, WISDM, and PAMAP2 datasets. Overall, the results showed that the performance of the suggested model outperformed other compared works.
Rashid et al. [27] proposed a low-power edge devicefriendly Adaptive CNN for energy-efficient HAR called AHAR. During the inference phase, AHAR employs an adaptive design that choices a component of the baseline design to use. Two datasets, Opportunity and w-HAR, were used to validate the work for categorizing locomotor activities. This work achieved a weighted F1 score of 91.79% and 91.57%, respectively, when compared to fog/cloud computing techniques for the first dataset. Also, it achieves F1 score of 97.55% and 97.64%, respectively, on the w-HAR dataset. When compared to the works on the both datasets, this work is much more energy-efficient (422.38 × less) and memoryefficient (14.29 × less).
Mekruksavanich et al. [28] proposed a revolutionary hybrid model called CNN-LSTM to handle HAR problem. It is a deep learning multichannel architecture. Using DHA public dataset of smart-watch accelerometer, the results proved that this model exceeds other compared deep learning approaches in terms of different performance measures. It achieved 96.87% accuracy.
For the HAR challenge, Athavale et al. [29] presented a pre-trained VGG16 model. This CNN model is used to learn the deep features. The signal classification of human activity, which is recorded by the accelerometer sensor of the mobile phone, was done using VGG16. The accelerometer sensor on a smartphone records these data. The features were trained using VGG16 CNN model is fifth max-pooling layer and fed to SVM. The fully connected layer of this model was replaced by the SVM classifier. This work achieved 79.55% accuracy and 71.63% F-Score based on UniMiB dataset that includes samples of human everyday life activity.
Shang et al. [30] proposed a WiFi-based HAR system. This system can determine different activities via the Channel State Information (CSI) from WiFi devices. They presented a special deep learning framework, LSTM-CNN. It can automatically extract features from temporal and spatial domains. The authors proved the effectively of their work in classifying different activities. Also, the experimentations results proved that this work is better than the compared models on HAR of CSI data; it achieves an average accuracy of 94.14% in multi-activity classification.
Poma et al. [31] presents a way to search for the best number of filters for each convolution layer of a CNN. They advocated searching for the best number of filters in the convolution layer of CNN. In addition, to identify the parameters of the fuzzy system memberships, they applied Fuzzy Gravitational Search Algorithm approach. ORL dataset is used that contains 40 images of different human faces with ten images for each face. The results proved that this work achieves a high%age of recognition.

The proposed model
This paper proposes an intelligent decision support system for recognizing the human' daily activities that feed the sensing data to the recognition model after handling their imbalanced issues. Figure 1 show our overall proposed framework. It has three steps: • Data collection Tri-axial accelerometers which are integrated in the smartphone have been used for gathering 3D time-series data that represent the linear acceleration based on vibration in three directions X, Y and Z.
Our model uses the raw Wireless Sensor Data Mining (WISDM) dataset [32]. • Balancing dataset This is done by applying the random oversampling technique to handle the issues of biased dataset. • Activity recognition A modified version of 1-D capsule neural network was used to recognize the activities which were exercised and notify the user with the activity class in accordance with the sensor's readings.

Using over-sampling for balancing the dataset
In WISDM dataset [32], the samples that represent walking and jogging activity classes out-number the samples of the other classes by large margin. Due to the imbalanced behavior of WISDM dataset that adversely affect the performance of the classifier, the Random-SMOTE algorithm [33] is used to increase the number of the minority class to reach the optimal balanced ratio of 1:1. This is done by randomly selecting examples from the minority class and adding them to the training dataset. For a dataset that has N attributes, taking an attribute n as a sample, the new Fig. 1 The overall proposed framework value is randomly generated using the Random-SMOTE algorithm [33].

Proposed 1D capsule neural network for HAR
A capsule neural network (CapsNet) is a newly developed machine learning that was introduced in [34] as a development of CNN. The idea behind its architecture came from adding structures known as "capsules" to a CNN. Capsules are structures of neurons that are activated when a set of attributes are related to a class activity. Usually, an artificial neuron produces a single value and formally a scalar value is related to the probability of the existence of the class in the feature vector. In CapsNet, the scalar output is replaced with the vector-based capsules. The output of the higher capsule (parent) is computed by the scalar product of the coefficient representation of the probabilities of its related lower capsules (children). The closer the child to the parent is, the higher the coefficient between the parent and the child is. In this paper, we propose 1D-HARCapsNet model as a modified version of 1D capsule neural network presented by Suri and Gupta [35]. The proposed model is applied for recognizing the human activities based on the immediate observations of the human actions. Instead of using a single level of convolutional layer, 1D-HARCap-sNet architecture implements three levels 1-D convolutional layer . The rest of the architecture comprises the primary capsule layer, the activity capsule layer, and the output layer. Figure 2 shows the structure of the proposed 1D-HARCapsNet from the input to the output.
The input data has 80 3D vectors (80 × 3). The model feeds the data to three consequent levels of convolution layer (3-Conv1D) of sizes (80 × 3, 51 × 256 and 42 × 512) respectively. Next, it uses the primary capsule convolution layer of size 40 × 1024 where its generative output is sent to the fully connected activity layer that produces a scalar vector. Finally, this value is passed to the output layer which generates the most likely target class. Table 1 illustrates the structure of the proposed 1D-HARCapsNet model.

The 3-Conv1D layer
Input data samples with (80 × 3) size represent 80 data point wide with the height of three data points are fed into a sequence of three Conv1D with different activation functions to construct the feature maps. The first level of the 3-Conv1D implements 256 filters with a kernel size of (30 × 30) and uses the tanh activation function to calculate the hyperbolic tangent value of the given input. The output is 51 data points wide and the height is 256 data points which is sent to the next level. The Second level implements 512 filters with a kernel size (10 × 10) and uses the ReLu activation function that generates the input directly if it is not negative, otherwise it will output zero. The output of this level is 42 data points wide and 512 data point height which is sent to the last level of the 3-Conv1D layer. The third level implements 1024 filters with a kernel size (3 × 3) and uses  the tanh activation function. Totally, the output of this layer is 40 data points wide and 1024 height data points which is sent to the next layer as an array of feature maps for further processing.

The primary capsule layer
The primary capsule layer is a 1-D convolution (Conv1D) layer with a kernel size (30 × 30). It implements the reshape function to convert the array of the feature maps into the corresponding vectors. Finally, it is passed to the squashing function to convert the vector output to a value between 0 and 1.

The activity capsule layer
It replaces each capsule in the network with its actual class activity by implementing the dynamic routing algorithm. Routing by agreement is based on the ability of the lower capsule (i) in the primary capsule layer to predict the output of the higher capsule (j) in the activity capsule layer. For each capsule i and capsule j, the prediction of the output of capsule j is denoted by U j|i and calculated by Eq. 1: where u i represents the output of the capsule i and W ij is the weight matrix. Next, the total input S i to capsule j in the activity capsule layer is calculated using a weighted sum overall the prediction vectors as given in Eq. 2. (1) where C ij are the coupling coefficients between the capsule i and all the capsules in the higher layer. It is calculated using a routing softmax function as given in Eq. 3.
where b ij indicates log prior probability of the capsule j in coupled to capsule i, k. Finally, the scalar output vector of capsule j is obtained by applying a non-linear squashing function to its total input according to Eq. 4.

The output layer
The output layer is a fully connected layer that consists of 240 sigmoid units that predicts the most likely target class activity y based on the scalar vector x as illustrated in Eq. 5.

Evaluation and results
In the evaluation process, the widely used criteria such as: accuracy, precision, recall, and F-measure will be used. All the four criteria depend on the confusion matrix [36].

Evaluation criteria
Multiple performance evaluation criteria are used for ensuring the improvement of the proposed model compared to other existing models. The confusion matrix [36] is one of the most used evaluation metrics in the field of machine learning. Correct predication is considered as True Positive (TP), but if it is negative and is predicted as such, it is considered True Negative (TN). If it is negative and classified as positive, this is considered False Positive (FP). In case it is positive and classified as negative, this is considered False-Negative (FN). The confusion matrix values are used for measuring other important metrics such as: geometric mean, accuracy, error rate, recall, and F1-measures). Accuracy [37] is the correctly predicted    . 4 Evaluation results of the suggested test cases (9-16) for 1D-HARCapsNet F1-measure [37] is also called F-measure. It presents the harmonic means between precision and recall as illustrated in Eq. 9:

Wireless sensor data mining (WISDM) dataset
WISDM time-series dataset is used for the task of (HAR) using the tri-axial accelerometer sensor on most android smartphones [32].

The hyper parameters of the proposed 1D-HARCapsNet
This paper introduces 1DHARCapsNet model with the following hyper parameters. The number of epochs is 25 and 50, the learning rate values are 0.001 and 0.002, the number of routing are5 and 10, and the initial weights are 0.002, 0.003, 0.004 and 0.005 as illustrated in Table 3.

Recognition experiments of the proposed 1D-HARCapsNet
We have conducted our experiments on Kaggle cloud in which the dataset was split into 80% for training and 20% for testing. Table 4 shows the used hardware specifications.
The performance of the proposed 1D-HARCapsNet model is compared with the conventional one-dimensional deep capsule network architecture [35] having the same hyper parameters indicated in Table 3. A series of experiments were conducted to evaluate the results by constructing different 32 test cases generated using the suggested hyper parameters. Table 5 illustrates the variation of the conventional CapsNet recognition results. The best achieved results are 90.11% accuracy, 91.81% precision, 89.94% recall and 0.903F-measure. Table 5 results on the WISDM dataset using the modified architecture without applying Random SMOTE. Table 7 illustrates the variation of the proposed system recognition results. In this model the data is balanced using random SMOTE algorithm and then it is being fed to the proposed structure of 1D-HARCapsNet with the above-mentioned hyper parameters.  samples rate. It is the ratio between correctly predicted samples to the total number of samples due to its straightforward meaning. It is one of the most used metrics in the field of the machine learning evaluation as illustrated in Eq. 6: Precision (positive predictive) [37] is the ratio of correctly predicted positive class to the total number of the positive predicted samples in the dataset as illustrated in Eq. 7: Recall or hit rate or true positive rate (TPR) is also known as sensitivity such as in [37]. It is the rate of corrected predicted samples to the total number of positive samples in the dataset as illustrated in Eq. 8:   Table 8 illustrates the obtained accuracy, precision, recall and F-measure of our proposed model compared with the stateof-the-art models [38][39][40][41][42][43][44][45] on raw version of WISDM dataset. The Accuracy of the proposed model has the highest accuracy of 98.67%. In the second place, Spatio-Temporal Deep Learning [46] has accuracy of 98.53%, in third-place Deep learning low power device [41] has accuracy of 98.2% while in the third-place, CNN + BLSTM [44] has accuracy of 97.8%. Based on Precision, the proposed model has achieved the highest precision of 98.66%. In the second place, Random Forest Classifier [43] has precision of 98.1% while in the third-place CNN + BLSTM [44] has precision of 97.8%. Based on recall, the proposed model has achieved the highest recall of 98.67%. In the second place, Random Forest Classifier [43] has recall of 98.1% while in the third-place, CNN + BLSTM) [44] has recall of 97.8%. On basis F-measure, the proposed model has achieved the highest F-measure with 0.987. In the second the place, the Random Forest Classifier [43] has 0.981of F-measure while in the third place, CNN + BLSTM [44] has 0.978 of F-measure. Generally, the proposed model has performed the best across the four performance evaluation criteria.

Conclusion and future work
In this paper, a modified version of the 1-D capsule neural network called 1DHARCapsNet was proposed to provide an efficient intelligent decision support approach for recognizing the human activity. We implemented the Random SMOTE algorithm to handle the issue of imbalanced behavior of WISD dataset. The proposed model comprises four layers: 3-Conv1D layer, the primary capsule layer, the activity capsule layer, and the output layer. The experimental results were evaluated on a raw version of WISDM dataset. The performance was assessed based on the four criteria: accuracy, precision, recall, and F-measure. Compared to the state-of-the-art algorithms, the proposed model proved its ability to recognize the human activity and outperform the others.
In the future studies, we suggest using Gray Wolf Optimizer (GWO) [50] for feature selection to improve the performance to surpass the-state-of-the-art algorithms and to provide optimal performance. GWO helps reducing the effects of noise and redundancy of data on the overall performance of the system, especially accuracy. Also, in the future work, optimization of the proposed model for different embedded devices will be performed to embed the classifier within power constrained microcontrollers, and to ensure the security of user's data and preserve its privacy. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.