Human activity recognition from sensor data using spatial attention-aided CNN with genetic algorithm

Capturing time and frequency relationships of time series signals offers an inherent barrier for automatic human activity recognition (HAR) from wearable sensor data. Extracting spatiotemporal context from the feature space of the sensor reading sequence is challenging for the current recurrent, convolutional, or hybrid activity recognition models. The overall classification accuracy also gets affected by large size feature maps that these models generate. To this end, in this work, we have put forth a hybrid architecture for wearable sensor data-based HAR. We initially use Continuous Wavelet Transform to encode the time series of sensor data as multi-channel images. Then, we utilize a Spatial Attention-aided Convolutional Neural Network (CNN) to extract higher-dimensional features. To find the most essential features for recognizing human activities, we develop a novel feature selection (FS) method. In order to identify the fitness of the features for the FS, we first employ three filter-based methods: Mutual Information (MI), Relief-F, and minimum redundancy maximum relevance (mRMR). The best set of features is then chosen by removing the lower-ranked features using a modified version of the Genetic Algorithm (GA). The K-Nearest Neighbors (KNN) classifier is then used to categorize human activities. We conduct comprehensive experiments on five well-known, publicly accessible HAR datasets, namely UCI-HAR, WISDM, MHEALTH, PAMAP2, and HHAR. Our model significantly outperforms the state-of-the-art models in terms of classification performance. We also observe an improvement in overall recognition accuracy with the use of GA-based FS technique with a lower number of features. The source code of the paper is publicly available here https://github.com/apusarkar2195/HAR_WaveletTransform_SpatialAttention_FeatureSelection.


Introduction
Human activity recognition (HAR) is an emerging topic of research in the larger fields of ambient computing and context-aware computing. Recognizing daily life activities is becoming increasingly important in pervasive computing with lots of applications like intelligent surveillance systems [1], healthcare [2], abnormal behavior detection [3], human-computer interaction [4,5], aid to elderly people to improve the quality of their lives, etc. HAR frameworks provide a way to sense, recognize, and classify specific movements or activities of a person using the data obtained from various sensors. A typical supervised HAR framework can be divided into basic blocks consisting of sensor data accusation, dividing the raw data into fixed-size windows, feature extraction, and finally classification. Each and every activity is represented by one or more fixed-size feature vectors extracted in the feature extraction step. These feature vectors are used for training the classifier.
Based on the usages of the sensor, HAR can be mainly categorized into vision-based HAR and wearable sensorbased HAR. The vision-based technique recognizes and classifies activities by analyzing video or images [6,7] captured using a camera. Though vision-based techniques have a mature theoretical basis, these techniques have various limitations like ambient light, camera position, potential obstacles, and invasion of privacy issues, which make them difficult in real-life applications. Wearable sensors and inertial sensors of smart devices nowadays become the more promising ways of collecting human activity data as these are easy to use, small in size, and nonintrusiveness on subjects. Besides, these sensors have none/ low installation cost and low energy consumption. Smartphone and smartwatches have become a convenient option for HAR as it comes with various embedded sensors like accelerometer, gyroscope, magnetometer, compass, etc.
For the activity prediction tasks, generalizing any model for different activities and sensors is a very challenging task. Based on humans, the activity signal pattern may vary significantly as different humans perform these activities differently. Even the same activity can have different signal patterns as any specific human can do the same activity differently at a different time. Similarly, different activities can have similar signal patterns, which makes the activity classification task more confusing and challenging.
In the recent past, researchers have introduced several handcrafted feature extraction methods to extract various spatiotemporal features from the raw sensor data. The traditional supervised machine learning techniques-Support Vector Machine (SVM) [8][9][10], K-Nearest Neighbors (KNN) [11][12][13], Decision Tree [14], Ensemble approach [15,16]-are used for classification. However, there are certain limitations of this approach like the requirement of domain expertise and rigorous data pre-processing. Also, failing to establish a proper spatial and temporal relationship among handcrafted features limits the flexibility of these approaches.
Recently, deep learning techniques gain more popularity among researchers. Ability to detect various features automatically from the raw data and to learn various deeper low levels of features gives deep learning techniques an edge over the traditional machine learning techniques. Several deep learning models are successfully applied in different areas like natural language processing [17], image segmentation [18], classification [19], etc. Specifically, convolutional neural networks (CNNs) are well known for producing outstanding results in image recognition [20,21]. However, reformulating features of time series data as visual clues have raised much attention among computer scientists [22]. The most successful way is to describe features as visual cues [23]. Time series data can be encoded into corresponding activity images using supervisory and non-hyper-visual learning techniques in computer vision to enable deep learning techniques, specifically CNNs, to perform image recognition.
It is to be noted that a feature extraction procedure may produce some irrelevant or redundant features which increase the overall feature space. This is also true for the feature vectors produced by the deep learning model. Hence, these irrelevant features must be eliminated in order to ensure a good classification accuracy and less computational time. A feature selection (FS) algorithm tries to improve the performance of a learning algorithm and decrease the time and space requirements. FS algorithms can be divided into two categories: wrapper and filter. A wrapper method uses a classifier to calculate the fitness of each candidate solution (i.e., a subset of features) and thereby select the subset of features that has the best fitness score. On the other hand, filter-based methods rank the features in order of their importance and eliminate the less important features. Since filter methods do not need a learning algorithm, they tend to perform faster than the wrapper methods in general. However, wrapper methods are known to generate better classification accuracy than filter methods [24]. FS is an NP-hard problem as there can be 2n possible solutions for a feature space containing 'n' no of features. Determining the best solution from all the possible solutions is not a feasible option as the computational time required would be quite high. Hence, an alternative and feasible solution is to perform a guided search over the entire feature space using a heuristic strategy. This will not only decrease the computational time significantly but also produce a near-optimal solution.
In this paper, we have proposed an architecture that encodes sensor data into corresponding images and a model that enables HAR to be carried out using a spatial attentionaided CNN model in image recognition. However, the feature set, produced by this CNN model, is quite large in size. To this end, we have proposed an FS approach for selecting the optimal feature subset by eliminating the irrelevant feature attributes which also saves computational time and memory. This implies that we have used the said CNN model as the deep feature extractor only. For FS, a modified version of Genetic Algorithm (GA) [25] is used. Rather than utilizing a time-consuming classifier in each iteration, we have utilized three filter techniques specifically Mutual Information (MI) (entropy based), ReliefF (distance based), and Minimum Redundancy Maximum Relevance-mRMR (statistic based). These three methods rank the features obtained from the CNN model. We have re-ranked the features using the mean of the ranks of the features given by three filter methods. These ranks are used as the fitness of the candidate solutions (i.e., chromosomes). We have also proposed a guided mutation strategy which aims to increase the fitness of the individual chromosomes. The reduced feature set is then fed to the KNN classifier for predicting the accuracy of the overall HAR model.
The key contributions of the proposed work are as follows: 1. We have proposed a unique image encoding framework based on Continuous Wavelet Transform (CWT) to represent the sensor data into the corresponding spatiotemporal representation. 2. A spatial attention-aided CNN model is used to extract image features from the encoded images. 3. In order to reduce computational overhead, we have introduced a modified GA-based feature selection framework that uses three filter-based methods to determine the fitness of each candidate solution. 4. We have also proposed a guided mutation technique as an improvement over random mutation to increase the fitness score of each candidate solution.
The rest of the paper is structured as follows. Section 2 describes some relevant methods proposed by the other researchers. Details of the proposed method are mentioned in Sect. 3. In Sect. 4, we have reported the results of the proposed model while evaluated on five benchmark HAR datasets. In Sect. 5, we further discuss our findings. Finally, we have concluded the paper in Sect. 6.

Related work
Deep learning-based models have achieved outstanding results in a variety of fields including HAR as mentioned in recent surveys [26,27]. Many state-of-the-art models have been developed using various deep learning techniques like CNN, Recurrent Neural Network (RNN), etc. CNN models showed lots of promise and achieved higher recognition accuracy than other state-of-the-art methods. Nair et al. [28] used the Temporal CNN architecture, a class of temporal models that used a hierarchy of temporal convolutions, which was able to take variablelength sequence data and learn long-term dependencies. Münzner et al. [29] proposed a CNN-based sensor fusion technique to solve the problems of normalization and fusion of multimodal sensors. In [30][31][32][33][34], authors have used various CNN architectures to improve the recognition accuracy of HAR. Ensemble of CNN models is found in [35][36][37] which aims to achieve better performance than the individual models.
RNN, another deep learning technique, was also extensively used by many researchers for HAR. RNN has the special ability to learn sequences of spatial data. Like, long short-term memory (LSTM)-based networks can learn long-term dependencies from any sequences of data which make it more applicable in wearable/inertial sensor-based HAR. Preeti Agarwal and Mansaf Alam [38] developed a lightweight model using shallow RNN combined with LSTM for activity recognition. Authors in [39][40][41][42] used LSTM-based architectures to learn spatiotemporal features for the classification of human activities. Researchers also proposed various hybrid models like the combination of CNN-RNN [43], CNN-LSTM [44][45][46][47][48], LSTM-CNN [49], CNN-GRU (Gated Recurrent Unit) [50], and achieved significant improvement in recognition accuracy. Inspired by the recent success of deep learning techniques especially CNN in computer vision, encoding time series data as images gain more acceptance among researchers. This method allows the machine to visually recognize and classify by learning visual patterns and structures. Zhiguang Wang and Tim Oates [22] introduced two frameworks for encoding time series data as images known as Gramian Angular Field (GAF) and Markov Transition Field (MTF). They used Tiled CNNs to classify the single GAF and MTF images as well as the compound GSF-MTF images. The authors in [51] found that varied time series features are not evident in the temporal domain but present in the frequency domain. As an alternative graphical representation for time series classification, they investigated the use of recurrence plots and proposed a method capable of extracting texture features from that graphical representation and used those features to classify time series data. In their work, Garcia-Ceja et al. [52] proposed a similar approach. They modeled the physical activity as a set of recurrence plots' distance matrices to capture temporal patterns in the signal. Afterward, a CNN was used to classify the distance matrices and obtain the final prediction. In [53], the authors experimentally found that image representation of time series data introduces different feature type that was not available in1D sensor data. Hence, they first encoded the sensor signal as a 2D texture image using a recurrence plot to visualize the recurrent nature of a trajectory through phase space. Then, they used a CNN model to learn different levels of features from the texture images. To address the variability in the distinctive region scale and sequence length, Zhang et al. [54] proposed two stages approach, where firstly they encoded the sensor data using Multi-scale Signed Recurrence Plots (MS-RP), an improvement in recurrence plot, and then applied a Fully Convolutional Networks and ResNet to handle these images. Hur et al. [55] proposed a novel encoding technique for converting an inertial sensor signal into an image with minimal distortion, namely Iss2Image (Inertial sensor signal to Image). Iss2Image divided real-valued sensor reading into three parts: integers, first two decimal places, and the next two decimal places, and then encoded as a threechannel image. Finally, a CNN model was used for imagebased activity classification. Another similar encoding technique was proposed by Daniel et al. in [56]. The proposed INIM framework first encoded the sensor's signal into 3D RGB images and then used a residual network trained on the ImageNet dataset [57] for activity recognition. Qin et al. [58] introduced a novel method to encode time series data into twochannel GAF images by unifying global and local time series features. Then, they presented a fusion ResNet framework, which learned the generated GAF image pixels correspondences between acceleration and angular velocity features. Almost similar work was done by the authors in [59]. Contrary to the previous work, they used four different types of activity images and made each one multimodal by convolving it with two spatial domain filters: the Prewitt filter and the high-boost filter. ResNet-18 was used to extract the deep features from multi-modalities and fused by canonical correlation-based fusion. Finally, a multi-class SVM was used for activity recognition. In [60], the authors have implemented the idea of transforming the 1D signal into 2D using Fast Fourier Transform (FFT). This frequency-domain image was called the spectrogram, which represents the composition of a signal from several frequencies over time and acts as an input to a three-layered CNN model for features extraction and classification. Lawal et al. [61] in their work encoded sensors signal into spectrogram using Short-Time Fourier Transformation (STFT). A simplified two-stream VGG-Net [20] like CNN architecture was proposed for activity and location recognition.
A few researchers have also tried to choose the relevant features utilizing various FS-based techniques [62,63] for improving the overall accuracy in the field of activity recognition. Buenaventura et al. [64] proposed a HAR model based on sensor fusion in smartphones which used a filterbased method to rank the features. An enhanced HAR method was proposed by Fan et al. [65] where Bee Swarm Optimization (BSO) with a deep-Q-network was used. Dewi et al. [66] performed a comparative study on HAR datasets using four classifiers namely Random Forest (RF), SVM, KNN, and Linear Discriminant Analysis (LDA) from which it was concluded that RF has the highest accuracy. Nguyen et al. [67] proposed a position-based FS method for body sensors for daily activity recognition. Filter-based methods were used to reduce the feature set followed by a correlationbased optimization and a classifier to determine the overall accuracy of the proposed method.
GA is one of the oldest and most widely used metaheuristic algorithms which have been explored by numerous researchers in various domains such as image contrast enhancement, class imbalance, stock price prediction, image segmentation, medical diagnostic, image steganography, feature selection, etc. Saitoh [68] proposed an image contrast enhancement technique based on GA that assessed an individual's fitness by evaluating the intensity of spatial edges included in the image. GA was used to search for a solution in global space, and the original gray image was converted to a contrast-enhanced image by observing the relationship between the input and output gray levels. In [69], an efficient image contrast enhancement using GA and fuzzy intensification operator was proposed which improved the visibility information of an image by manipulating the image intensity information. A novel oversampling approach was introduced by Arun et al. [70] to address the class imbalance problem using GA. Synthetic samples of the minority class are generated based on the distribution measure which ensures that the samples are efficient and diverse within each class. Experimental results indicated that GA-based oversampling approach improved the fault prediction performance and reduced the false alarm rate. Ha et al. [71] proposed a novel undersampling method using GA for imbalanced data classification. The performance of the prototype classifier was maximized by minimizing the loss between distributions of original and undersampled majority objects. A novel method for stock market forecasting with Artificial Neural Network (ANN) and GA was proposed by Sharma et al.
[72]. The dataset was partitioned into training, testing, and validation sets, and the stock data of COVID-19 period were used for model validation. Furthermore, in [73] a combination of GA and LSTM was proposed for stock prediction. In the initial step, GA was used to obtain ranked important factors, and finally, the optimal factors along with LSTM were used for prediction. Chun et al. [74] proposed a robust image segmentation using GA with a fuzzy measure. A fuzzy validity function was proposed which measured the degree of separation and compactness within the finely segmented regions. To maximize the quality of regions obtained by split and merge processing, a usable region segmentation was searched using GA. In [75], an image segmentation method with GA was proposed where GA was used for segmenting the images into four gray classes. A cardiovascular disease prediction using GA and neural network was proposed by Amma [76] where the weights of the neural network were determined using GA which provided a good set of weights in a few iterations. Initially, the dataset was pre-processed followed by training the system and storing the final weights which were finally used for predicting the risk of cardiovascular disease. Uyar et al. [77] proposed a GA-based trained recurrent fuzzy neural network (RFNN) method for the diagnosis of heart diseases. Hossain et al. [78] introduced a secured image steganography method based on GA and ballot transform for the integrity of important files over internet. In addition to achieving a good accuracy, various parameters such as precision, F-score, probability of misclassification error, mean square error, etc. were also calculated.
Owing to the success of GA in solving various complex optimization problems, many researchers have used GA for the FS purpose which is a binary optimization problem. Some areas where GA is used as an FS method are: microstructural image classification [79], cancerous gene identification [80], handwritten Devanagari numeral recognition [81], handwritten Bangla word recognition [82], handwritten Bangla, Devanagari and Roman numeral classification [83], video and sensor-based HAR [62], etc. Rostami et al. [84] developed a novel community-based FS method to group similar features into feature clusters. This method predicted the number of feature clusters automatically, hence eliminating the need to determine it beforehand. GA is then applied to select the optimum subset of features by defining an objective function with an importance value attached to each feature subset. In [85], a novel cancer classification technique was proposed using deep learning and GA. It was applied to determine and classify the cancer types from the publicly available gene expression data. Tian et al. [86] proposed deep learning model selection framework based on GA for visual data classification. The process of identifying the most relevant and useful features generated by pre-trained models for different tasks was automated by the framework. In [87], a deep learning method was developed to classify different brain activities along with GA to eliminate the redundant features. Various deep learning models, namely X_axis Classification Model (XCM), Y_axis Classification Model (YCM), and Z_axis Classification Model (ZCM), were used for this purpose. These models were used to classify among the vision, movement, and forward brain activities followed by an effective combination method based on GA and Genetic Weighted Summation (GWS) rule. In 2019, Ghosh et al. [88] introduced a combination of GA and PSO for feature selection which utilized the exploitation ability of GA with the exploration capacity of PSO. Guha et al. [89] proposed a deluge-based GA to strengthen the exploitational ability and performed good on the wellknown UCI datasets. In 2021, kilicarslan et al. [90] proposed a hybrid model based on GA and deep learning for nutritional anemia disease classification. GA was used to optimize the hyperparameters of Stacked Autoencoder (SAE) and CNN models. The proposed method achieved an accuracy of 98.50% when applied on real anemia dataset. Ince [91] proposed a deep learning and GA-based intelligent and automatic content visualization system. The method segmented the input image into panoptic image instances and used these to generate new images using GA. The results proved that the said method was efficient to create visually enhanced content for digital use.
Motivations: From the above discussion, it can be concluded that many researchers around the world have tried to classify human activities by analyzing the activity images. It can be observed that recognizing human activities from sensor data has always been an interesting and challenging task. Some activities such as running and walking are easy to recognize. However, there are some complex activities which are relatively difficult to classify. Developing an efficient activity recognition model can lead to the development in many potential fields such as health, sports, and understanding the psychological state of a person. For this purpose, machine learning and deep learning-based methods contributed significantly to the development of competent HAR models. However, many of these methods use heavy networks (mainly deep learning-based methods) and some even produced lower classification accuracy due the use of some irrelevant features. On the contrary, FS-based techniques not only speed up the process (i.e., take less computational time) but also increase the classification accuracy significantly. However, wrapper-based FS techniques which use a learning algorithm are slower than filter-based methods. Keeping the above facts in mind and to further speed up the process, a modified version of GA method is proposed here, which uses three filter-based methods to calculate the fitness of the chromosomes that effectively acts as the fitness function of GA. The proposed method has been evaluated on five publicly available datasets. It is observed that this method is much faster than the traditional GA, and the overall framework also outperforms many existing methods in terms of classification accuracy.

Proposed method
Here in this section, we first briefly discuss the proposed activity image encoding technique. Then, we explore the features extraction process from the encoded images. Finally, we present the proposed novel FS technique used for HAR. Figure 1 shows the working procedure of the proposed framework.

Continuous wavelet transform
Wavelet transform has been applied in time-frequency analysis and spatial domain signal analysis over the years, and this is one of the most effective mathematical tools used for signal processing. A wavelet transform is a signal convolution with a set of functions derived from translations and dilations of a primary function. The primary function is known to as the mother wavelet, and the translated or dilated functions are referred to as wavelets.
A wavelet is a rapidly decaying wave-like oscillation defined as function wðtÞL 2 ðRÞ with a zero mean and exists for a finite duration, localized both time and frequency. By scaling and translating this wavelet wðtÞ, we can produce a family of wavelets by using Eq. (1) as where a; bR and a [ 0. a is known as the scaling parameter, and b is the transitional value. The wavelet transform of a continuous signal with respect to wavelet function w t ð Þ is defined as Eq. (2) where x(t) is a time-domain signal; w Ã a;b t ð Þ is the complex conjugate of mother wavelet. From Eqs. (1) and (2), we get Eq. (3), which defines the CWT as CWT is nothing but the inner product of signal x(t) with a continuous wavelet w t ð Þ scaled by parameter a and translated by value b. The pseudocode for the CWT is shown in Algorithm-1.
The outputs of the CWT are CWT coefficients, which reflect the similarity between the analyzed signal and the wavelet. These coefficients can be represented as a 2D image equivalent to the power spectrum, where time and scale/frequency are the 2 dimensions. However, the CWT coefficients depend on the choice of the mother wavelet.
One of the main advantages of wavelet transform is the presence of a wide variety of wavelets to choose from that best match the shape. In this work, we use the Gaussian Derivative Wavelets, specifically fifth-order derivatives of the function given in Eq. (4) where C is the order-dependent normalization constant. The fifth-order Gaussian Derivative wavelet is a realvalued odd function, which is anti-symmetric around zero. The shape of the fifth-order Gaussian Derivative wavelet and various scaled wavelets is shown in Fig. 2.
As the wavelet is a real-valued function, hence the imaginary part of the wavelet is zero.

Inertial sensor to image encoding using CWT
In order to encode the raw sensor time series data into an image form, we use the 1D CWT, which takes 1D time series as input and generates a 2D frequency-time domain scalogram. This scalogram is nothing but the CWT coefficients. Figure 3 depicts the image encoding process.
Performing CWT on the entire time series dataset is practically infeasible. Hence, instead, we perform CWT on each sample of size t Â c where t is the number of timestamps and c is the total number of sensor channels. The pseudocode for CWT-based image encoding technique is given in Algorithm-2. The value of t and c varies from dataset to dataset. Each of the channels in c is a 1D time series and acts as the input to the CWT. We use t as the scale parameter. For each such sensor channel, we get a t Â t scalogram as the output. Hence, for one sample, we get a c-dimensional t Â t scalogram where each dimension corresponds to each sensor channel.   Based on the above-mentioned way, we encode each and every activity of a dataset as a t Â t Â c-dimensional image.

Features extraction using spatial attentionaided CNN
A CNN is large a deep neural network that simulates and understands stimuli as the visual cortex of the brain processes. A typical CNN model can be thought of as a combination of two components: the features extractor part and the classification part. The hidden layers are the CNN's features extractor, which consists of a series of convolution layers followed by pooling layers that try to detect complex features and patterns belonging to the image of a particular class by convolving with various filters followed by subsampling. The classification part then utilizes these features and computes the prediction probabilities as output. Even though CNN performs very well in the image classification task, sometimes the requirement of huge data for more accurate prediction limits its use as a classifier. As a result, in the current work, rather than using the CNN model as a classifier, we only used it as a features extractor. Figure 4 shows the architecture of the proposed feature extractor. It mainly consists of a CNN having four convolution layers and spatial attention sub-networks. The spatial attention sub-networks, which are variants of widely used CNNs, use attention modules to fine-tune the feature maps in each convolution layer, thereby enhancing CNN's learning ability.
Following each convolution layer, we have used a maxpooling layer to lessen data variance and a dropout layer to avoid over-fitting. Before the max-pooling layer, the attention feature maps from the spatial attention sub-network are added to re-calibrate the original features. This layering scheme is repeated three times with a different number of 3 Â 3 filters. All neurons of these convolution layers have Re-LU (Rectified Linear Unit) as an activation function to learn the nonlinear representation. The details of the network architecture are given in Table 1.
At last, the output features are first flattened and then pass through a fully connected layer, which generates a 1024-dimensional feature vector from the input image.

Spatial attention module
Recently, the attention mechanisms attract more and more researchers' interest and have been widely used with the CNN and RNN models in many domains like computer vision and image processing. This mechanism enables the network to pay more focus to some discriminating regions in certain time periods, which improves the learning ability of the network. In this article, we design a class of attention module to focus on where is an informative part present in the encoded images.
The proposed spatial attention module generates a spatial attention feature map by utilizing the inter-spatial relationship of features. As shown in Fig. 5, a 1 Â 1 convolution layer is first used to fuse the information along the channels, generating a 2D feature map YR HÂW . Then, we apply two 2D convolution layers to generate the spatial attention features map Y sa R HÂWÂC . For these two 2D convolution layers, the number of convolution filters varies Fully Connected Layer (1024 units) ? ReLU ---  Table 2.
We use Re-LU as the activation function for the convolution layers and padding operator to avoid the change in spatial size. Finally, we use Y sa to re-calibrate F l using Eq. (5).
where F l is the features map from the previous convolution layer. This F l À Á 0 acts as the input for the next CNN layer in the network.

Feature selection
Feature extraction using CNN produces a large dimension of features, which needs to be processed by the classifier. Many a times, only a small subset of these features is important. The remaining features are redundant or insignificant and only tend to increase the computational time and space. Moreover, the presence of these redundant features also decreases the classification accuracy. To address this issue, FS has been performed on the set of features obtained from the said CNN model. In the proposed method, we have used GA as the unsupervised FS algorithm, and three different filter methods are used to calculate the fitness of each chromosome in the population of GA.

Filter methods
To calculate the fitness of the individual chromosomes, we rely on three filter-based methods, namely MI, ReliefF, and mRMR.
1. Mutual Information: MI [92] is used to measure the nonlinear relations between two random variables. It is used to quantify the quantity of data obtained from a random variable by observing the other random variable. It can be referred to as the reduction in uncertainty of a random variable when the other variable is known. Hence, a high MI value suggests a large reduction in uncertainty while a low value suggests less reduction. It can be calculated using Eq. 6: where P X;Y x; y ð Þ denotes the joint probability density function of X and Y and the marginal density functions are denoted by P X x ð Þ and P Y y ð Þ. The similarity the joint distribution P X;Y x; y ð Þ to the product of the factored marginal distributions is determined by MI. It equals zero if and only if two random variables are independent, and higher values indicate greater dependency. 2. Relief-F: Relief was proposed by Kira and Rendell [93] for binary class problems by using the Euclidean distance measure. Relief-F algorithm is based on the Relief algorithm, a filter method used in FS. Relief was designed primarily for use in the problems of binary classification with discrete or numerical features. Relief assigns a relative weight/score to each feature and acts as a filter method by eliminating the lowranked features. The feature score changes according to the detection of feature value differences between neighboring instance pairs. If a difference in feature value is discovered with the same class (a 'hit') in a neighboring instance pair, the feature score falls. On the other hand, if a feature value difference is observed with different class values ('miss') in a neighboring instance pair, the feature score climbs. However, it is limited to only two class problems. An extension of the Relief-F algorithm can be used to solve multi-class problems by searching for k closest misses in each class and averaging their contributions for updating W, weighted by each class's prior probability. In the contribution of weights to each feature, it takes the average of k nearest hits and misses. This k can be adjusted and set based on the dataset in question. Furthermore, Relief-F can handle missing data by employing a conditional probability of feature weights. It is defined by the formula given in Eq. (7).
where x i;j , x M l i;j or x H l i;j denotes the j-th component of sample x i , its l-th closest Miss x M l i , or its l-th closest Hit x H l i , respectively. n is the total number of samples, and K is the number of Misses or Hits considered for each sample. 3. Minimum redundancy maximum relevance: mRMR [94] is a filter ranking approach in FS that ranks features according to correlation to the class and itself.
Preferably, features with a high correlation with the class (output) and a low correlation between themselves are chosen. For continuous features, correlation with the class (relevance) can be evaluated by the Fstatistic values and the correlation between features (redundancy) can be determined using Pearson Correlation Coefficient (PCC) values. A greedy search is applied to select the features one by one as the final goal is to maximize the objective function, which is determined by relevance and redundancy. MID (Mutual Information Difference) and MIQ (Mutual Information Quotient) criteria are the two commonly used types of the objective function which represent the difference between relevance and redundancy, or the quotient of relevance and redundancy. It is calculated using the formula given in Eq. 8 where i is the i-th iteration, f is the feature that is evaluated, F is the F-static, f 0 ði À 1Þ denotes the features selected until i À 1 iterations, and corr is Pearson correlation.

Genetic Algorithm: an overview
GA is a popular meta-heuristic evolutionary algorithm which is used for solving complex optimization problems. It is a nature-inspired algorithm with biological features like selection, crossover, and mutation. GA comprises the following steps-initial population creation, parent selection, crossover, mutation, and generation of child chromosomes. Initially, a random population is generated with a finite number of chromosomes, each filled with some random values of fixed length. Parent chromosomes are selected from this set of chromosomes which are further used to create the child chromosomes after performing crossover and mutation. A fitness function is defined to evaluate the fitness of each chromosome. If the fitness values of the child chromosomes surpass the fitness of some existing chromosomes in the current population, they replace the chromosomes having low fitness values. The fitness measures the quality of the represented solution obtained at each iteration. These processes are repeated until the generation of the next set of chromosomes that go through the same selection, crossover, and mutation process, and eventually, the subsequent generations are generated through this method. Individuals with the least fitness die as new generations form, making room for new offspring. This leads to a near optimal solution after a fixed number of iterations. A binary version of GA is used in FS, with each chromosome represented as a vector of '0's and '1's. A '0' indicates that the corresponding feature is not selected, whereas a '1' indicates that the corresponding feature is selected.

Proposed GA variant
GA is one of the oldest and classical evolutionary algorithms, inspired by nature. Over the years, various researchers have utilized this algorithm in the field of FS and optimization. It is proved to be one of the best-known algorithms which provide a near-optimal subset of features from the whole feature space. Exploration and exploitation are performed by the key operators, i.e., crossover and mutation. Numerous modifications have been suggested by various researchers to improve GA and reach the near optimal solution. The mutation in GA is decided by a mutation probability which is quite random in nature. Moreover, the fitness of each candidate solution is determined by a learning algorithm (i.e., a classifier) which is often very time-consuming. Keeping the above facts in mind, we propose a modified version of GA which estimates the fitness of the candidate solutions by calculating the aggregate of three filter-based methods, thereby improving the computational time significantly. Also, instead of random mutation, a different mutation method is proposed which improves the fitness of the individual candidate solution. A multi-point crossover is used and for parent selection is done using Roulette wheel for better exploitation. The pseudocode of the mutation technique is described in Algorithm-3.

Fitness function
Wrapper-based FS methods generally use a learning algorithm (i.e., a classifier) to evaluate the fitness of the chromosomes. Since GA is commonly used a wrapper-based method, it follows the same logic; however, it increases the computational time. To overcome this problem, the usage of classifier is replaced by determining the score of each feature vector (i.e., a chromosome) by the help of filter methods, which aids in assessing the strength of each chromosome in an unsupervised way.
A chromosome is a binary vector with '0' indicating that the feature is to be not taken and '1' indicating that the feature is to be taken. By using the three filter methods, we get a filter value (i.e., a score) corresponding to each feature. The filter value of each feature is the average of the value of the three filter methods. We can say that the feature column with the maximum filter value is most important while the feature with the minimum filter value is least important. Hence, to calculate the score of each individual chromosome, we take the mean of the filter values of all the features which are currently '1.' We have described the pseudo-code of the fitness value calculation in Algorithm-4.
In FS, we intend to increase the classification accuracy of the problem under consideration and decrease the number of features selected simultaneously. In order to do so, we define a single objective function which estimates the overall fitness of each chromosome (feature subset). This objective function is defined in Eq. 9.
where F is the fitness of the chromosome, a½0; 1 represents the relative weightage between the fitness value and number of features not selected, |F| is the number of features in the given dataset, and |f| is the number of features in the feature subset.
Since we aim to increase the fitness value and reduce the number of features in the feature subset, our objective is to increase the Fitness_overall value.

Experiments and results
We have performed experiments using five popular and publicly available HAR datasets-UCI-HAR, WISDM, MHEALTH, PAMAP2, and HHAR. This section contains information about the datasets used, the performance metrics, and the results obtained.

Model implementation
The proposed model is built using the Keras API and the Tensorflow backend. For the CWT part, we have used PyWavelets [95], an open-source python wavelet transform library. The experiments were performed on a laptop with having AMD Ryzen 7 4800 H (2.90 GHz) processor with 16 GB of RAM and NVIDIA GeForce GTX 1660 Ti GPU with 4 GB of VRAM. The PC is powered by a 64-bit Windows 10 operating system. The feature extractor model is trained under a supervised learning methodology. We have randomly initialized all the weight and bias used for different layers. Adam optimizer is used, and we have tried to minimize the sparse categorical cross entropy losses. The CNN model is trained for 150 epochs with a batch size of 32. Table 3 summarizes the hyper-parameter details used to tuned our model.
For FS techniques, we have experimented with different values of various hyper-parameters. Finally, for our proposed method with FS, we have used 10 as the population size, the value of crossover probability has been set to 0.6. For the KNN classifier, we have set the k value equal to 5.  Table 4. Total nine features (body acceleration, total acceleration, and angular velocity signals in all X, Y, Z-axis) were captured using the embedded accelerometer and gyroscope at a constant sampling rate of 50 Hz. The raw signals were first pre-processed The samples were captured using a smartphoneembedded accelerometer, and the data collection process was controlled using an application that was executed on an android smartphone. The experiment was carried out on 36 people, and each performed six activities-Walking, Jogging, Sitting, Standing, Upstairs, and Downstairs with an Android phone in their front leg pocket. The entire set of data includes a class of 9 people with annotated human activities who had specific physical descriptions. Most of the participants were men, and their dominant hand was the right hand. In actuality, PAMAP2 has only one left-handed and one female subject, with ids 102 and 108, respectively. Each individual was required to adhere to a protocol that included 12 separate tasks. A detailed description of all the activities and the class distribution are shown in Table 7. There are almost 10 h of activity data in this collection. After removing the anomalous data, we have segmented the sensor data by a fixed-length sliding window with 50% overlapping. We have then randomly partitioned the dataset into two parts, where 70% are used for training and the remaining 30% for testing. 5. HHAR [101]: The Heterogeneity Dataset for Human Activity Recognition (HHAR) from Smartphone and Front elevation of arms The subject was raising the right hand up to 90 degree 8.58

Database description
Jogging The subject was running outside at a speed of 6-7 km/h 8.95 Jump front & back First, the subject leaped forward, and then, without turning, leaped back to starting position 3.02

Knees bending
The subject slowly bent both knees and then raise the weight up 8.55 Lying down The subject didn't move while lying motionless on a bed 8.95

Running
The subject was moving forward at a speed of 9-10 km/h 8.95 Sitting & relaxing In a relaxed position, the individual was seated in a chair 8.95 Standing still The subject did nothing and remained still 8.95 Waist bends forward The subject stands steady and reached out to touch the leg with his/her hands 8.25 Walking The subject went at a speed of 4-5 km/h in a straight line 8.95 Smartwatches is a dataset used for assessing the performance of various HAR algorithms (classification, automatic data segmentation, sensor fusion, feature extraction, etc.) that use a variety of sensor types. The collection includes readings from two motion sensors, namely accelerometer and gyroscope, frequently found in smartphones, that captured as users carried smartwatches and smartphones while doing some programmed tasks in any sequence. To reflect the sensor heterogeneity which can be anticipated in actual deployments, the dataset is compiled using a variety of device models and use scenarios. This dataset recorded 6 different activities of 9 individuals using 6 types of mobile devices (4 smartphones and 2 smartwatches). Table 8 shows detailed description and class distribution of HHAR dataset. In our experiment, we have used only the smartphone's accelerometer data. We have divided the sensor data into segments using a fixedlength sliding window with 50% overlapping. The dataset is then randomly divided into two sections, with 30% being used for testing and the rest 70% being used for training. Table 9 presents the summarized information about the five datasets. UCI-HAR, WISDM, and HHAR datasets contain 6 activities, but the number of sensors is different. The MHEALTH and PAMAP2 both datasets contain the 12 activities with more additional sensors. HHAR contains the largest number of training and testing data, whereas PAMAP2 contains more additional sensors compared to the rest of the datasets.

Performance metrics
In this paper, we mainly use accuracy, precision, recall, F1-score, and confusion matrix as the performance measures. We have used micro-averaging score for calculating precision, recall, and F1-score. Accuracy is defined as the proportion of correctly predicted samples to the total number of samples. A True Positive (TP) outcome is one in which the model correctly predicts the positive class. A True Negative (TN), on the other hand, is an outcome in

Nordic_walking
The subject performed outside on asphaltic terrain, using asphalt pads on the walking poles 9.68

Ascending_stairs
The subject covered a distance of five floors while going upstairs 6.04

Cycling
The subject was riding a real bicycle with slow to moderate space 8.47 descending_stairs The subject covered a distance of five floors while going downstairs 5.40

Ironing
The subject was ironing 1-2 shirts or T-shirts 12.28

Lying
The subject was lying quietly while doing nothing, small movements were allowed 9.90

Rope_jumping
The subjects used the method that worked best for them, which was typically the basic leap or the alternate foot jump 2.54

Running
The subject was jogging outside at a suitable speed 5.06 Sitting The subject was permitted to sit in a chair in whatever position that makes them feel comfortable and to switch positions while they are there 9.54

Standing
The subject was motionless and stood still 9.78 Vacuume_cleaning The subject was vacuum cleaning one or two office rooms 9.02

Walking
The subject went at a speed of 4-5 km/h in a straight line 12.29 Table 8 Activity details of HHAR dataset Activity Description Class distribution (in %)

Bike
The subject was riding a motorcycle on a free road 16.36

Sit
The subject was lounging comfortably in a chair 17.66

Stairsdown
The subject went down a set of steps to a lower level 14.32

Stairsup
The subject ascended a set of steps to move up a floor 15.80

Stand
The subject showed no action and stood stationary 16.42

Walk
The subject was moving straight ahead at a brisk to moderate speed 19.44 1. Precision: Precision is defined as the percentage of positive samples identified correctly, based on the total number of samples identified as positive. Precision can be calculated using Eq. (11).
2. Recall: Recall is the proportion of positive samples that are accurately identified out of all positive trials. We can calculate the recall using Eq. (12).
3. F1-score: F1-score is a comprehensive approximation of the model's accuracy, and it is nothing but the harmonic mean of precision and recall. It can be calculated using Eq. (13).

Results
To thoroughly measure the performance of the proposed models, we first evaluate the method without FS and compare it with the result found using the method with FS. Table 10 summarizes the performance of our proposed model without FS. Use of the FS technique helps us reduce the number of features, which also improves the overall accuracy of our model. Table 11 provides the detailed performance metrics obtained by our model using the FS method.
From Tables 10 and 11, it can be seen that the FS technique reduces the size of the feature set by almost 1/3 of the original feature set in the majority of the cases. This reduced feature set improves the recognition accuracy by 0.71% for UCI-HAR, 1.06% for WISDM, 0.18% for MHEALTH, 0.76% for PAMAP2, and 0.88% for HHAR datasets.
The accuracy and loss plots obtained using the feature extractor model on the UCI-HAR dataset are shown in Fig. 6, while the accuracy and loss plots for WISDM, MHEALTH, PAMAP2, and HHAR datasets are shown in Figs. 7, 8, 9, 10, respectively.  Figure 11 shows the confusion matrices of the proposed method without FS and with FS side by side.

Evaluation on UCI-HAR dataset
On the UCI-HAR dataset before applying FS, out of 2947 test samples, a total of 2910 samples are correctly classified by our model. After applying the FS technique, the total number of correctly classified samples increases to 2931, and overall, the accuracy is improved from 98.74 to 99. 45%. If we compare Fig. 11a, b, we can see that FS technique improves the discrimination between Standing and Sitting. It also improves the recognition accuracy of the walking activity class. Even after applying the FS, there is still confusion between sitting and standing. The main reason could be that the two exercises are comparable from the perspective of movement sensors. Data from accelerometers and gyroscopes alone are insufficient for mining dipper discriminative information.

Evaluation on WISDM dataset
When we have tested our trained model on the WISDM dataset, the FS techniques improve the overall recognition accuracy from 98.34 to 99.38%. Figure 12 represents confusion matrices of our proposed method without and with FS. If we compare the confusion matrices of Fig. 12, it is clear that the reduced optimal features map generated by the FS technique helps the classifier to recognize each activity more accurately as the classifier makes less confusion. In the case of WISDM, when we have tested our trained model with 1452 number of new instances, FS techniques increase the number of correctly classified samples from 1428 to 1443.

Evaluation on MHEALTH dataset
In the case of MHEALTH dataset, we have tested our proposed methods with a total of 1052 new samples. Figure 13 depicts the confusion matrices of our proposed method without FS and with FS. The confusion matrices present in Fig. 13 show that though the model without FS performed well, the model gets a little confused while recognizing complex activities like knees bending and

Evaluation on PAMAP2 dataset
The confusion matrices of the proposed technique without FS and with FS are shown side by side in Fig. 14. Prior to using FS, the model obtains 97.55% classification accuracy with a total of 8884 correctly classified samples when tested on a total of 9107 newly created activity samples.
With a total of 8952 correctly identified samples, the model achieves 98.29% classification accuracy after applying FS. Even if the FS approach lessens miss-classification, the model still confuses the activity class vacuum_cleaning with other activity classes, as shown in Fig. 14a, b. The complex nature of this activity class is mainly responsible for the confusion.

Evaluation on HHAR dataset
With a total of 52,872 additional samples used to test our proposed model on the HHAR dataset, the FS approach increases the overall recognition accuracy from 96.87 to 97.72%. The confusion matrices of our suggested technique without and with FS are shown in Fig. 15. We can observe by comparing Figs. 15a, b that the use of the FS approach results in an increase in the number of correctly identified samples from 51,219 to 51,669. Similar to the PAMAP2 dataset, the proposed model still conflates different activity groups even though the FS approach helps to reduce miss-classification. The primary factor may be that the limited accelerometer data from a smartphone may not be sufficient to discern these intricate actions.

Impact of FS hyper-parameters on model performance
The classification model's performance is greatly influenced by the FS hyper-parameters. This section examines the effect of key FS hyper-parameters such as population size, crossover probability, and the number of iterations on the model's overall accuracy.

Effect of population Size
The population size is an important parameter that has a direct impact on the ability to find the best solution in the search space. Having a large population increases the likelihood of obtaining an optimal solution. In this paper, we have experimented with different population sizes, beginning with 5 and increasing to 30 with a fixed interval of 5. The population size vs accuracy graphs for the five datasets are shown in Fig. 16. For UCI-HAR, WISDM and MHEALTH datasets, the accuracy increases linearly and reaches the global maximum when the population size is 10. As the population size increases, the accuracy follows a zigzag pattern. For WISDM and MHEALTH, accuracy reaches the minimum when the population size is 25. At the same time, the accuracy does not vary much for PAMAP2 and HHAR datasets. Hence, for our proposed method, we have used 10 as the default population size.

Effect of crossover probability
Crossover is used as a genetic operator for producing new candidate solutions from an existing population stochastically. The crossover probability is the likelihood that a crossover will occur in specific mating. In this experiment, we have varied the crossover probability as 0.1, 0.2 to 0.9 and tried to observe how the accuracy changes. Figure 17 depicts the relation of the crossover probabilities and the accuracy. As we increase the crossover probability, the change in accuracy varies differently for different datasets. For UCI-HAR and MHEALTH datasets, initially, accuracy decreases and then starts to increase as the crossover probability increases. The accuracy reaches the minimum when the crossover probability is 0.3 for UCI-HAR and 0.2 for MHEALTH. For the WISDM dataset, the accuracy first follows a zigzag pattern followed by a sharp fall and reaches the minimum when the crossover probability is 0.6. Further increase in the crossover probability increases the accuracy. The accuracy for the PAMAP2 dataset reaches its lowest point at 0.4 before beginning to rise. The accuracy declines as the crossover probability rises further. On the HHAR dataset, however, the accuracy does not change significantly when the crossover probability rises. Figure 18 depicts the change in accuracy as the number of iterations of GA increases. The accuracy of this hyperparameter, like that of other hyper-parameters, varies depending on the dataset. As we increase the number of iterations from 5 to 30 with a uniform interval of 5, the accuracy of the UCI-HAR dataset gradually increases and reaches a maximum when the number of iterations is 30, whereas for the WISDM and MHEALTH datasets, the accuracy initially increases and then begins to decrease as the number of iterations exceeds 15. When the number of iterations exceeds 25, the accuracy begins to increase again. The accuracy reaches its peak when the number of iterations is set to 10 for the WISDM dataset and 15 for the MHEALTH dataset. With more repetitions, the accuracy for the PAMAP2 dataset grows in a zigzag pattern. In contrast, the accuracy for the HHAR dataset first rises gradually from 5 to 10. The accuracy starts to drop as soon as any iteration is over 10, and it reaches its lowest point at 30. In our experiment, we have used 30 as the default number of iterations.

Comparison with state-of-the-art methods
To assess the efficacy and generalizability of our proposed model, we have compared it to a number of state-of-the-art models.
The comparison results for the UCI-HAR, WISDM, MHEALTH, PAMAP2, and HHAR datasets are shown in Tables 12, 13, 14, 15 and 16, respectively. The comparison is done based on the classification accuracy. The results Fig. 18 No. of iteration versus accuracy graphs for all five HAR datasets show that our proposed model without FS has achieved higher recognition accuracy compared to most of the other HAR models. The use of FS technique has improved recognition accuracy even more. For all five datasets, our proposed method with FS outperforms the state-of-the-art algorithms considered here for comparison.

Discussion
The overall results shown in the previous section indicate the effectiveness of our proposed models for HAR. The proposed spatial attention module assists in extracting high-quality features by focusing on the specific spatiotemporal properties that the CWT-based encoding is able to express in a better way. In this study, we also analyze how well the FS process works, and we find that in comparison with the initially extracted features, only a limited number of important features are needed for recognizing human activities. In addition to speeding up the computation, the reduced feature set improves recognition accuracy to a significant margin (see Tables 10, 11 ). These days HAR systems are used in a variety of industries, such as sports analysis, health monitoring, and fall detection for the elderly persons. In sports analysis, the team management needs to analyze players' physical ability and various motion patterns to improve the quality of games. Similarly, in the case of fall detection, an alarm needs to be generated automatically so that a fall may be recognized. Hence, the more accuracy we are able to achieve, the more dependable the system will become. Although our proposed model performs well in most of the cases, it is to be noted that in some cases, it gets confused to distinguish similar activity classes. The model also faces problems to distinguish between activity groups that come with identical sensor data patterns. For example, the model gets confused when sees 'Walking' with 'Upstairs' and 'Downstairs.' Similarly, 'Standing' and 'Sitting' are the most confusing activity classes as both are static activities and generate almost similar signal patterns. Our model finds it difficult to discriminate between ''Walking'' with ''Nordic walking,'' ''Vacuum_cleaning'' with ''Ironing,'' and ''Upstairs'' in the PAMAP2 dataset if we take dataset-specific activities into account. Similar to this, the proposed model for the HHAR dataset frequently conflates the activity classes ''stairup'' and ''stairdown,'' as well as the activity classes ''stairdown'' and ''walk.'' Figure 15 shows that following FS, the model misclassifies more ''walk'' activities as ''bike,'' ''sit,'' and ''stand,'' demonstrating that the FS method does not necessarily decrease the misclassification rate for the confusing cases.

Conclusion
Sensor-based HAR deals with the prediction of specific movements or activities of a person based on the sensor data. It has been an interesting research problem as it can be used to obtain the identity of a person, their personality, and psychological state. It can also be applied to identify complex sport activities and medical domains such as health monitoring systems. Due to its vast scope of practical applications, it is important to ensure that the model fulfills the demanding challenges of the task and hence has gained popularity among the research community in recent times. In this paper, we have proposed a model for HAR based on sensor data. We have used Spatial Attention-aided CNN as the feature extractor and a novel FS technique for selecting the most prominent features using a modified version of the popular evolutionary algorithm called GA. Our proposed method has been experimented on five public datasets-UCI-HAR, WISDM, MHEALTH, PAMAP2, and HHAR. It can be observed that the results obtained are better than state-of-the-art methods. However, there are still some major scopes of improvement to enhance the overall performance of the method. In our future endeavors, we intend to improve the classification accuracy with fewer number of features by exploiting some other recent meta-heuristic algorithms. We also plan to work on some other human activity datasets like video based or still image based and use some pre-trained CNN models to obtain a good set of initial features.