Abstract
Systems of sensor human activity recognition are becoming increasingly popular in diverse fields such as healthcare and security. Yet, developing such systems poses inherent challenges due to the variations and complexity of human behaviors during the performance of physical activities. Recurrent neural networks, particularly long short-term memory have achieved promising results on numerous sequential learning problems, including sensor human activity recognition. However, parallelization is inhibited in recurrent networks due to sequential operation and computation that lead to slow training, occupying more memory and hard convergence. One-dimensional convolutional neural network processes input temporal sequential batches independently that lead to effectively executed operations in parallel. Despite that, a one-dimensional Convolutional Neural Network is not sensitive to the order of the time steps which is crucial for accurate and robust systems of sensor human activity recognition. To address this problem, we propose a network architecture based on dilated causal convolution and multi-head self-attention mechanisms that entirely dispense recurrent architectures to make efficient computation and maintain the ordering of the time steps. The proposed method is evaluated for human activities using smart home binary sensors data and wearable sensor data. Results of conducted extensive experiments on eight public and benchmark HAR data sets show that the proposed network outperforms the state-of-the-art models based on recurrent settings and temporal models.
1 Introduction
Human activity recognition (HAR) is a significant research field in ubiquitous computing for monitoring behaviors of people which plays an important role in various applications such as healthcare monitoring [1], security surveillance system [2] and resident situation assessment [3]. In healthcare monitoring, HAR, as one of the significant applications of intelligent environment and wearable sensor technologies, has been used to monitor the activity of daily living (ADL) in order to support and assist senior people, disabled and cognitive impaired [4]. HAR from smart home setting equipped by ubiquitous sensors in the field of ambient assisted living has gained increased attention for improving the quality of independent living of the residents within the smart home environment [5]. Smart homes with unobtrusive sensing technology for HAR have been used as a suitable solution for enhancing independent living when privacy is concerned [4, 6]. Wearable sensors are mainly embedded into mobile devices, wristwatches, clothes, glasses, belts, or shoes. Wearable sensors can be worn on the human body to capture their interaction with their physical surroundings, motion, posture, and location. Wearable sensors such as gyroscope, accelerometer, GPS, and RFID-readers (used together with RFID tags) have commonly been used to record information about users’ movement e.g., walking, running, and laying [7]. Wearable sensors have been used for HAR since they can collect information such as body movement and position [7, 8]. The aim of HAR is to identify and recognize simple and complex human daily activities using smart home and wearable sensors data. HAR based on the sensors’ data is a challenging task since sometimes the data could be noisy which leads to ambiguity in the interpretation of human activities [9]. Noise in the data could be caused by errors in the sensor connection system which fails to provide correct sensor activations. HAR systems based on sensors’ data have been notably progressed and obtained promising results by the current development of machine learning in elderly-care alert systems and assistance in emergencies [10]. Monitoring long-term daily routine activities of a resident in a smart home setting provide utility to determine and assess wellness. Particularly, remote monitoring of daily routine activities such as eating, sleeping, or medication intake enables caregivers to track and assess the functional health status and to support the needs of the elderly people living alone [9]. Moreover, smart home and wearable sensors are able to provide sufficient information to properly detect the postural and ambulatory activities [11, 12].
Traditional machine learning approaches such as naive Bayes, support vector machine, and hidden Markov models have made tremendous progress on HAR and obtained satisfying results [8]. However, these approaches entirely rely on hand-crafted heuristic feature extraction, which is highly data dependent, usually limited by domain experts. Handcrafted features are not always generalizable across application domains and time consuming. Moreover, handcrafting does often not generate a sufficient number of features from a given dataset [13]. Recently, deep learning methods have been increasingly used in various applications of computer vision [14] and natural language processing, speech [15] and audio recognition [16]. Besides, deep learning methods have been used for HAR systems based on smart home sensors and wearable sensors. Most of these HAR systems have shown encouraging results for different purposes and from different datasets of daily routine activities. Among the deep learning methods, long short-term memory (LSTM) as a sequential deep learning model and variation of recurrent neural network (RNN) has achieved state-of-the-art performance for temporal information processing in various applications [17]. Particularly, LSTM for recognizing activities of daily living (ADLs) shows state-of-the-art performance on various activity recognition benchmark datasets using wearable sensors and smart home sensors [6, 13]. ADLs are introduced as the normal daily activities where we perform for self-care such as eating, drinking, and bathing [18]. Even though LSTM models improve the performance of HAR systems, training LSTM models are computationally expensive due to using the gating mechanism that allows long-term dependencies. Moreover, LSTM models occupy more memory and cannot process timesteps of input data in parallel since each timestep needs the results of the previous timestep to be processed [19, 20]. One dimensional convolutional neural network (1D CNN) has been used instead of LSTM to capture the sequential temporal information in the input data for HAR systems [21, 22]. Despite training of 1D CNN models are extremely faster compared to recurrent methods such as LSTM due to the absence of recurrent connections, the achieved results based on 1D CNN fall short of the results shown by LSTM in HAR systems. Moreover, 1D CNN is not sensitive to the timestep order which is the key in HAR systems. To address these problems, we propose dilated causal self-attention convolution that entirely forgoes recurrent settings to improve the performance of HAR. We adopted dilated causal convolution which is used as a part of the WaveNet to generates raw audio waveforms [23]. Dilated causal convolution is used to allow long-range temporal dependencies in the WaveNet that outperforms LSTM [24]. While dilated causal convolution captures long-range temporal sequential information [25], it is crucial to focus on particular information from the feature maps generated by dilated causal convolution using the self-attention mechanism [26]. The self-attention mechanism that is leveraged by transformer [19] can enable temporal models to expose context from the feature maps within the sequence.
To summarise, the main contributions of this paper are:
-
i.
Proposing a model to accelerate training time and improve the results of activity recognition compared to state-of-the-arts using dilated causal and self-attention convolution.
-
ii.
Causal convolutions within the proposed method are used to prevent information leakage from future to past.
-
iii.
Dilated convolutions within the proposed method are used to maximize the receptive field by orders of magnitude and aggregate multi-scale contextual information without considerably increasing computational cost.
-
iv.
Multi-head self-attention within the proposed method is used to effectively expose deep semantic correlations from action sequences involving human activities.
-
v.
Conducting extensive experiments using eight benchmark datasets of human daily activities from smart homes and wearable sensors to validate the proposed approach, which shows our proposed method can improve the accuracy by 5% up to 9% and reduce the training time compared with recurrent neural network-based architecture methods.
2 Related work
Human action recognition is a challenging research area based on sensor data and has attracted much attention in machine learning fields. Numerous methods have been proposed to model and recognize ADLs [8, 27]. Early research modeled activity recognition using support vector machine(SVM), decision tree, k-nearest neighbor (KNN) naïve Bayes [28]. HAR systems based on these traditional approaches have gained reasonable recognition results. However, these approaches solely process extracted heuristic-manual features of human activities. Hand-crafted features are usually limited by the availability of knowledge domain experts and a time-consuming task. Hence, deep learning models have been proposed in various applications to address these problems [29].
Deep learning models have shown satisfying results and reported state-of-the-art accuracy obtained on various HAR benchmark datasets [4, 6]. Moreover, deep learning models have been jointly used to handle imbalanced data and improve generalization of HAR [18]. Recurrent network-based architectures such as RNN and LSTM have been firmly established as state-of-the-art methods in sequence problems modeling including activity recognition [19]. RNN is employed to recognize the human daily activities from smart home sensor data [30]. The results show that RNN is useful in modeling and recognizing human activities. Yet RNN cannot properly process very long sequences and suffers from both gradient vanishing and exploding problems [4]. LSTM that solves vanishing and exploding gradient by the capability of handling long-term dependencies is often used to process temporal sequential data [31]. For instance, satisfying results are shown by employing LSTM to recognize human activities on diverse collected sensor data [4, 6, 32, 33]. Further, different LSTM architectures are proposed to improve the performance of HAR systems such as stacked LSTM [34], bidirectional LSTM [33], ensemble LSTM [35]. Moreover, combined CNN with LSTM is also employed to further improve the performance of HAR systems [6, 28].
The Dilated CNN can be used instead of the standard CNN to increase the convolution receptive field without losing resolutions [27]. Since the dilated convolution only appends empty elements between the elements of the standard convolution kernel, extra computational cost is not required for dilated convolution process. Dilated convolution is proposed [36, 37] for human activity recognition from wearable sensors data. Dilated convolution is also used for voice activity detection and audio source separation [38, 39].
Weakly supervised learning based on combined CNN and LSTM with self-attention layers using reinforcement learning trained on wearable sensors data for human activity recognition form [40]. However, this method requires large computing resources and has high complexity because it works based on reinforcement learning. Moreover, Convolution LSTM with self-attention mechanism is proposed to capture the spatio-temporal context in human activities and to focus on significant timesteps from temporal wearable sensors data [13]. References [41, 42] employed a self-attention mechanism to improve the performance of HAR systems based on wearable sensors. The results are improved using self-attention compared to the state-of-the-art. Betancourt et al. proposed an LSTM model based on the attention mechanism. The model is only tested on two wearable sensor datasets. The limitation of these methods is firstly the recurrent setting from the proposed methods leads to slow down the learning process. Secondly, the proposed methods from these studies are only applied to HAR systems based on wearable sensors, and the learning time is not considered. Reference [43] proposed DeepConvLSTM model based on attention mechanism for human activity recognition on smart home datasets. The method considered each time-step in the sequential data as a word and a specified time-window as the sentence. However, the method is evaluated only on three smart home datasets and only compared to bidirectional LSTM. Moreover, due to the sequential operation of recurrent settings in this method, parallelization is limited which makes this model computationally expensive. Gao et al. proposed a CNN dual attention without a recurrent setting for HAR systems, however, the method is only tested on wearable sensor datasets [44].
To address these limitations of HAR and improve the performance as well as reduce the learning time of HAR systems from smart homes sensor and wearable sensor data, we propose dilated causal and multi-head self-attention convolution.
3 Background
In section, we describe the temporal models i.e., 1D CNN, LSTM, and the hybrid 1D CNN and LSTM model.
3.1 Temporal modeling via LSTM
LSTM is an artificial RNN and used to learn from temporal sequential data. LSTM can handle and learn from long-term dependencies which alleviate vanishing and exploding gradient problem [31]. LSTM as a temporal model has been used to recognize ADLs from sensor data [4, 6]. LSTM processes temporal data using forget gate, input gate, and output gate to append or delete information to the cell state throughout the processing of the sequence data. The cell state is the main part of LSTMs that carry and transfer relevant information from earlier timesteps to later timesteps. Figure 1 shows the connection of the gates with the cell state in a single LSTM cell. The gates learn to keep relevant information and forget irrelevant information during training to update the information on the cell state. Hence each LSTM cell works as a memory to remove, read, and write information that is controlled by the forget, output, and input gates, respectively. Forget gate process both inputs the previous output \(h_{t-1}\) and new time step \(X_{t}\) using sigmoid activation function to indicate relevant or irrelevant information. The forget gate keeps the information if the outcome of the sigmoid function is 1 while deletes the information if the outcome of the sigmoid function is 0. Equation (1) shows how the forget gates within a single LSTM cell is computed. The next step consists of two parts to determine new information kept in the cell state. The first part is the input gate that indicates new information from the current input (\(X_{t}, h_{t-1}\)) is appended to the cell state. The tanh activation function is the second part that renders \(\tilde{C}_{t}\) a vector of new candidate values and can be added to the cell state. Equations (2) and (3) show how the input gate and the new candidate values are computed, respectively. A new cell state \(C_{t}\) is generated based on the summation of the multiplication of these two parts and the multiplication of the forget gate with the previous cell state \(C_{t-1}\). Equation (4) shows how the new cell state is computed. The multiplication of the previous cell state with the forget gate deletes part of the information which was decided to be forgotten earlier. Then the new candidate values are scaled by how much the cell state is updated using \(it \times \tilde{C}_{t}\). Finally, the sigmoid activation function processes both the previous hidden state \(h_{t-1}\) and the current input timestep \(x_{t}\) to produce the output gate.
Finally, the output gate is computed based on filtered information using two different activation functions and also specifies the next hidden state. Then the tanh activation function processes the newly updated cell state. The output of the tanh functions multiplies by the output of the sigmoid function to render the next hidden state. The updated cell state and the newly generated hidden state pass information to the next timestep. Equations (5) and (6) show how the calculation of output gate and hidden state.
where x is the input data, \(\sigma\) is the sigmoid activation function, \(\tanh\) is the hyperbolic tangent activation function, W is the weight matrix.
LSTM has been used for HAR application and achieved promising results [4, 6, 9, 45]. Hence in this paper, LSTM as a temporal model is used to be compared with the proposed method. Two layers of LSTM with a flattened layer are stacked. Then the outputs of the flattened layer are passed into a fully connected layer with ReLU activation function and followed by a softmax layer. Figure 2 shows the architecture of the LSTM model.
Fast LSTM implementation backed by cuDNN (CUdNNLSTM) [46] is also used in this study with the same architecture of LSTM model. CUdNNLSTM is a version of LSTM that uses the CuDNN library, and it can only be run on a GPU to accelerate training and inference time.
3.2 Temporal modeling via 1D CNN
1D CNN has been widely used in HAR systems and has shown satisfying results [6, 21]. 1D CNN can properly extract features from raw and consider local dependency that is likely to be correlated. 1D CNN can also learn hierarchical data representations of human activities that lead to improving HAR systems [45]. 1D CNN compared to LSTM has obtained competitive results in several applications such as activity recognition, machine translation, and audio generation with much faster learning time. However, 1D CNN is not sensitive to order that is significant in activity recognition [8]. Hence 1D CNN alone is not an optimal solution instead of LSTM. In this paper, 1D CNN is employed, and its results are shown. The 1D CNN model is designed by stacking two convolutional layers each with 64 filters. The kernel size of the 1D CNN in this study is equal to 3 that indicates the length of the 1D convolution window with stride size of 1. A Max-pooling layer with the window size equal to 2 is applied after the convolution layers to down-sample the features maps. The feature maps are flattened to be processed by the fully-connected, i.e., a dense layer with ReLU activation function followed by a soft-max layer. Figure 3 shows the architecture of 1D CNN.
3.3 Temporal modeling via Hybrid: 1D CNN + LSTM
The hybrid model based on stacking 1D CNN and LSTM sequentially has been used to improve the performance of HAR system [6, 18]. In this study, the hybrid model is employed by stacking one layer of each 1D CNN and LSTM to human activities from smart home data. Figure 4 shows the architecture of the hybrid model. The input data are firstly fed into the 1D CNN layer to extract features before the LSTM layer to support sequence recognition. The input sub-sequences sensor data are processed independently by 1D CNN hence timestep orders are not considered. The feature maps of 1D CNN are down-sampled by a max-pooling layer with the window size equal to 2 before the LSTM layer. The feature maps are processed by the LSTM and then flattened followed by fully-connected layers, i.e., a dense layer with ReLU activation function and a soft-max layer. Furthermore, 1D CNN layers in the hybrid model are often applied when recurrent-based models cannot realistically handle and process long-term dependencies from input sequence data. In such cases, 1D CNN in the hybrid model can make the long-term dependencies shorter through down-sampling by extracting higher-level features. Then the extracted features generated by 1D CNN could be better processed by the recurrent-based models [47]. However, order sensitivity is not considered in the extracted features by the 1D CNN. Hence, the hybrid of 1D CNN and LSTM is not the most acceptable solution to improve the performance of activity recognition [18].
3.4 Temporal modeling via Bidirectional LSTM
Bidirectional LSTM trains input data in forward and backward directions by using previous and subsequent information of a specific time step in two separate recurrent layers [48]. Figure 5 shows bidirectional LSTM where inputs of backward states are not connected to the outputs of the forward states. Including future information in addition to past information in bidirectional LSTM appears at first sight to violate causality [49]. Although Bidirectional LSTM has been successfully proposed in HAR and achieved satisfying results, Bidirectional LSTM is indeed expensive to train since it has a double recurrent setting in each layer [33]. Bidirectional LSTM is used in this study by stacking two forward and backward LSTMs layers. The outputs of these two layers are flattened and then fed to a fully-connected layer, i.e., a dense layer with ReLU activation function and a soft-max layer.
4 Proposed method
In this section, we describe the proposed method, dilated causal convolution, and self-attention mechanism, for HAR in smart home data. We aim to design and propose a more efficient convolutional network model better than recurrent-based architecture models in terms of recognition score and training time. The distinguishing characteristics of our proposed method are: (1) the proposed model stops information leakage from future to past using causal convolution; (2) the proposed model can handle temporal sequential data of any length and map it to a series output of the same length; (3) the model can simultaneously focus on different important time steps of the sequence input using the multi-head self-attention mechanism. The details of the proposed model are described in the following subsections.
4.1 Sequence modeling
Before describing the details of the proposed model, we show the sequence modeling task for human activities. Input human activity sequences \(x_0,\ldots, x_T\) are fed into a model to predict corresponding activity outputs \(y_0,\ldots,y_T\) at each time. Predicting the activity output \(y_t\) for particular time t should be derived only by considering the observed times steps before time t: \(x_0,\ldots, x_t\) [20]. Hence, sequence modeling is a function \(f:x_{0},\ldots,x_{T} \rightarrow y_{0},\ldots,y_{T}\) (where x and y are the input and output, respectively) that renders the mapping as shown in Eq. (7).
The model f is expected to minimize a loss L between, \(L(\hat{y}_0,\ldots,\hat{y}_T, f(x_0,\ldots, x_T))\), the actual label and the predicted outputs where the input sequential data and the outputs are rendered based on some distribution. This formalism could not directly be used for domains such as sequence-to-sequence prediction or machine translation since these domains require the entire sequence input (past and future states) [20]. However, the setting can be extended for these domains.
4.2 Dilated causal convolutions
Causal convolutions used in the proposed method to control the model and predict output at time t based on only the convolutions of the sequence inputs from time t and earlier in the previous layers [20]. Causal convolutions also preserve the ordering of sequential input patterns. However, causal convolutions require very large filters or many hidden layers to expand the receptive field [23]. To maximize the receptive field and aggregate multi-scale contextual information without considerably increasing computational cost, dilated convolutions are integrated into the proposed method. Dilated convolutions enable the model to increase the receptive field exponentially using a few layers and keeping the computational efficiency [25]. The dilated causal convolution DCC for one dimensional input sequence \(x \in R^{n}\) with a filter \(f:\{0,\ldots,k-1\} \rightarrow R\) on element s of the sequence is defined as:
where d is the dilation factor, k is the filter size, and \(s - d \cdot i\) shows the past direction. The dilation factor d is exponentially increased when the depth of the model is increased i.e., \(d =2^{l}\) at layer l of the model. Formally, we increase the dilation factors d exponentially by a factor of 2 in each layer \(l=1,\ldots,L\) where L is the number of layers of the dilated causal convolutions in the proposed model. Equation (9) shows the dilation factor in this study.
In addition, the dilation convolution renders the standard convolution when d = 1. Figure 6 shows the dilation causal convolutions in the proposed model for dilations 1, 2, and 4. Dilated convolution with different dilation factors can be integrated with a filter at different ranges. The filters convolve input values over an area larger than its length using dilated convolutions by skipping input values with a certain step which is the dilation factor. Hence, dilation convolution is equivalent to a standard convolution with one dilating, but importantly more efficient. Dilation convolution effectively enables the model to aggregate multi-scale contextual information with fewer layers and the same receptive field compared to a standard convolution [25]. Therefore, the number of learnable parameters is reduced by using stacked dilated causal convolutions that lead to yield more efficient training and light-weight model.
4.3 Self-attention network
The self-attention mechanism is a robust technique to compute correlation and the weighted combination between all the time steps in the input sequence [19]. After applying dilated causal convolution to render aggregated multi-scale contextual information, multi-headed self-attention is used to enable the model to focus on important and relevant time steps more than the insignificant time steps from the sequential feature maps during recognition. Hence, the attention mechanism aims to learn the most important time steps from the sequence feature maps that aid in determining more accurate recognition. Moreover, self-attention identifies relative weights for each time step in the sequence feature map by considering its similarity to all the other time-steps within the sequence. Then, the representation of each time step with relevant and important information from other time steps is transformed by the relative weights according to their importance. Self-attention mechanism has three learned linear transformation: query Q, key K, and values V, where Q and K have same vector dimension \(d_{k}\), and V and outputs have same size of dimension \(d_{v}\) [19]. To obtain attention scores, dot product attention is applied between each query as considered to the transformed matrix of a specific time step and the key matrix of every other time step. Then the softmax function is applied on the scaled dot product value of the queries and keys to generate the attention scores. Lastly, the attention scores are used to produce a weighted representation of the value matrix for each of the time steps in the sequence. Equation (10) shows the multi-head self-attention is entirely implemented as a matrix multiplication operation.
The model computes the attention numerous times in parallel (multi-head) to capture distinct correlation information of the input sequence. Hence, hj in Eq. (10) shows output from attention head j, and sa refers to self-attention. Distinct parameters are used in Eq. (10) for computing each the key, query, and value of the n attention heads. The outputs from the distinct multi-head attention are concatenated and transformed to the dimension of the input sequence using the learned parameter \(W_{o}\) as defined in Eq. (11). The outputs of the multi-head self-attention (\({M}_{ha}\)) are fed into fully-connected layers, i.e., a dense layer with ReLU activation function and a soft-max layer.
The proposed method based on dilated causal convolution foregoes recurrent architectures to accelerate the training and inference time. Causal convolution maintains the ordering of data which is crucial for HAR systems. Dilated convolution increases the receptive field and produces feature maps with multi-scale receptive fields using the different dilated rates in the convolution layers. Dilated convolution preserves the resolution of the data since the layers are dilated instead of pooling. The multi-head self-attention mechanism is employed in the proposed method to capture informative timesteps in the feature map to improve the recognition. Dilated causal convolution with a self-attention mechanism is used to make the proposed method computationally efficient and improve the result scores. Algorithm 1 besides Fig. 6 provides more information about how the layers of the proposed method are stacked.

5 Experimental setup and evaluation
In the section, we will show the details of the experimental setup and evaluation with the details of five used datasets, evaluation methods and results.
5.1 Datasets and preprocessing
5.1.1 Ordonez smart home datasets
Human activity datasets collected in five smart homes using embedded binary sensors are used in this study to evaluate the proposed method. Ordóñez home A and B [50] are two real-world smart homes that can record human daily physical activities using non-intrusive binary sensors. Different binary sensors are used in these two smart homes to detect different human activities. For example, passive infrared (PIR) sensors are used to detect human movements in a limited area. Pressure sensors on beds and couches are used to detect the user’s presence. Reed switches on cupboards and doors are used to measure open or close status, and float sensors in the bathroom to measure toilet being flushed or not. Table 1 shows details about the residents, sensors, and the number of activities of the Ordóñez smart homes A and B. In Ordóñez smart home A, twelve binary sensors were used to record nine human activities in fourteen days over a period of 20,358 min. In Ordóñez smart home B, twelve binary sensors were used to record ten human activities in twenty-two days over a period of 30,469 min. The common activities from Ordóñez homes A and B are Breakfast, Lunch, Sleeping, Grooming, Leaving, Idle, Snack, Showering, Spare Time/TV, and Toileting, respectively. In addition to these activities, Ordóñez home B has the activity Dinner.
5.1.2 Kasteren smart home datasets
Kasteren home A, B, and C datasets were recorded from other three different smart homes using non-intrusive and embedded binary sensors as well [51]. Table 1 also shows the details of these three datasets regarding to the residents, the number of sensors and activities. In Kasteren home A, fourteen sensors used to record ten human activities in 25 days over a period of 40,005 min. In Kasteren home B, twenty three binary sensors used to record thirteen human activities in 14 days over a period of 38,900 min. In Kasteren home C, twenty one binary sensors used to record sixteen human activities in nineteen days over a period of 25,486 min.
5.1.3 Wearable smartphone (inertial sensors) dataset
Dataset for human activity recognition was build by recording activities of daily living (ADL) of 30 study participants while carrying a waist-mounted smartphone with embedded inertial sensors [52, 53]. The participants within an age bracket of 19–48 years performed six daily activities in which three activities are static postures (standing, sitting, lying), and three activities are dynamic activities (walking, walking downstairs, and walking upstairs). The participants wore a smartphone (Samsung Galaxy S II) on the waist to record the activities. Embedded accelerometer and gyroscope were used to capture 3-axial linear acceleration and 3-axial angular velocity at a constant rate of 50 Hz. The activities were video-recorded to manually annotate the dataset. The dataset is randomly split into a training set with 70% of participants’ data and a testing set with 30% of participants’ data. Participants performed six activities: (i) Walking; (ii) Walking_upstairs; (iii) Walking_downstairs; (iv) Sitting; (v) Standing; (vi) Laying. Table 4 shows the frequency distribution of activities in the training and testing sets. The accelerometer and gyroscope signals were preprocessed using noise filters. Furthermore, the signals were sampled in fixed-width sliding windows of 2.56 s and 50% overlap.
5.1.4 Wearable wireless identification and sensing data
Fourteen elderly volunteers from 78 to 78 ± 4.9 years old wore Wearable Wireless Identification and Sensing Platform (W\(^2\)ISP) tag [54,55,56]. The W\(^2\)ISP placed on top of their garment at the sternum level to capture trunk movements and recognize activities: (i) sit on bed; (ii) sit on chair; (iii) lying; (iv) ambulating. The activities were performed in two clinical room configurations (Roomset1 and Roomset2) for ambulatory monitoring of older patients. Table 5 shows the frequency distribution of activities from both datasets:Roomset1 and Roomset2.
5.1.5 Preprocessing smart home data
The timeline of the human daily activities for all the smart homes data is segmented in time slots using the window size \(\Delta\)t = 1 min. The raw sensor data from smart homes provide the start time and end time of the sensor activations as well as the type (such as pressure sensor), location (such as bed), and place (such as bedroom) of the sensors. To generate the input datasets by preprocessing the raw sensor data, multiple and incremental fuzzy temporal windows (FTW) are used. FTW is used as a successful technique to segment the sensor data and prepare the input datasets [4, 6, 9, 18, 57]. FTW has shown that it can capture signal sensors of a long and short duration of human activities such as sleep or snack from raw sensor data [4, 57]. This increases the recognition results of the temporal models. Furthermore, temporal models i.e., LSTM and 1D CNN achieved better recognition results for activity recognition when the input datasets are generated by FTW compared to other methods such as Equally Sized Temporal Windows (ESTWs), Raw and Last Activation (RLA), and Raw and Last Next Activation (RLNA) [4, 6].
5.2 Models hyper-parameters
In this section, the parameters of all the models in this study are shown. A range of the following parameters used in a series of trial and error experiments over these ranges to find optimal parameters.
-
Learning rates from 0.0001 to 0.01.
-
Batch sizes values 32, 64, 128, and 256
-
Dropout rate values 20%, 30%, 40%, and 50%.
-
Number of epochs from 1 to 100.
Based on the series of trial and error experiments, we observed that 0.001 for the learning rate, 64 for the batch size with a 20% dropout rate with 50 epochs are the most appropriate hyper-parameters for the models to converge. To find a proper number of epochs, early stopping as a regularization technique is used to terminate the training when validation error starts increasing. Hence, the training was stopped at the minimum of the validation loss. To find a proper learning rate over the ranges in experiments, other hyper-parameters were fixed. This process is repeated until all the hyper-parameters are set. A large batch size can make training faster and require more memory space [6]. On the contrary, smaller batch size requires less memory space with slightly slower training but can cause the model to converge quickly, hence it is mostly a trade-off problem [6]. The 20% dropout rate is used to prevent the models from overfitting as a regularization technique [58]. The dropout technique ignores randomly selected neurons during the training process. The dropout technique temporally disconnects the ignored neurons on the forward pass hence in the backward pass their weights will not be updated. Layer normalization that normalizes the input data across the features is used after each dilation causal convolution [59]. Layer normalization can reduce the training time as empirically shown in [59].
5.3 Measure evaluation
F1-score as a metric is used to compare the performance of the proposed approach with other temporal methods. Accuracy is often used to evaluate the performance of classifiers. However, accuracy in the presence of imbalanced classes cannot be an appropriate measure for classification because less presented classes have a very little impact on accuracy as compared to the prevalent classes [6]. Hence, F1-score is employed to measure and evaluate all the temporal models since F1-score is the weighted average of recall and precision that can provide more insight into the functionality of the temporal models than the accuracy metric [4]. F1-score is calculated in Eqs. (12) and (13).
where TP, FP, FN are the number of true positives, false positives, and false negatives, respectively. Moreover, F1-score is widely used in activity recognition [4, 6, 18, 35].
5.4 Results and discussion
In this section, the experimental results of the proposed dilated causal convolution with the self-attention model for HAR are presented and discussed. The achieved results of each activity based on multiple models compared with the proposed are presented. Besides, the training time of all the temporal models is shown to be easily compared with the training time of the proposed method. The results of the proposed method are compared with temporal models: 1D CNN, LSTM, hybrid 1D CNN + LSTM, CudNNLSTM, and Bidirectional LSTM. The proposed method improved the results of HAR by 5% up to 7% compared with LSTM, 1D CNN, hybrid 1D CNN + LSTM, CudNNLSTM, and Bidirectional LSTM and reduced the training time. Figure 7 shows the results of the proposed method compared to the state-of-the-art techniques on eight datasets. The results indicate that the proposed method outperformed the temporal and recurrent-based models for human activity recognition from all the datasets.
5.4.1 Results from Ordóñez datasets
Tables 6 and 7 show the F1-score and training time (seconds) of the proposed method against the temporal models from Ordóñez smart homes A and B datasets. The results show that the proposed method outperforms the temporal models (LSTM, 1D CNN, hybrid 1D CNN + LSTM, CudNNLSTM, and Bidirectional LSTM). The training time in seconds is shown in Tables 6 and 7 for all the employed methods. The training time of the proposed method is much lower compared to the training time of LSTM, hybrid 1D CNN+LSTM, and Bidirectional LSTM with slightly higher training time than the 1D CNN training time. This indicates that the proposed method reduced the training time and improved the HAR systems significantly from Ordóñez smart homes. Importantly the proposed method accelerated the training time compared to the CudNNLSTM model which is a fast LSTM version and backed by Cuda library.
5.4.2 Results from Kasteren datasets
Tables 8, 9 and 10 show the results of the proposed method compared to the temporal models (LSTM, 1D CNN, hybrid 1D CNN + LSTM, CudNNLSTM, and Bidirectional LSTM) from Kasteren smart homes A, B, and C, respectively. The results of the F1-score show that the proposed method improved HAR from Kasteren datasets. The results show that the result scores are improved for each activity and the average result score. The proposed method considerably reduced the training time compared with the recurrent neural network-based architecture methods with reasonably higher training time than the 1D CNN training time. The results indicate that dilated causal convolution with self-attention can effectively improve the performance of HAR systems and reduce the training time.
5.4.3 Results from wearable sensors datasets
Figures 8, 9, and 10 show the results of the experiments that achieved based on wearable sensors for HAR. The results of the proposed method compared to the results of the temporal models (LSTM, 1D CNN, hybrid 1D CNN + LSTM, CudNNLSTM, and Bidirectional LSTM). Tables 11, 12, and 13 show the detailed results and the training time of all the models. Table 11 particularly shows the results of the experiments obtained based on smartphone sensors data. The results of the wearable sensors data demonstrate the outstanding performance of our proposed method compared to the state-of-the-art techniques. The training time of the proposed method is shorter than the training time of all the temporal and recurrent models except the training time of 1D CNN. The proposed method improved the performance of each activity as well as the average performance of all activities compared to recurrent and temporal models based on all wearable sensor data.
5.4.4 Proposed method compared to the DeepConvLSTM + attention
Results of our proposed method are compared with the results achieved by the DeepConvLSTM + Attention [13] for all the datasets. Results of the DeepConvLSTM + Attention are shown from all the datasets in this research and the training time. Since the DeepConvLSTM + Attention works based on the combination of 2D CNN and LSTM with an attention mechanism, it requires more time to process the input data compared to our proposed method. Moreover, compared to DeepConvLSTM + Attention, our proposed method achieved better result scores with much faster training times in all the datasets. For instance, our proposed method achieved the F1-score of 90.78 and 87.51 from Ordóñez smart homes A and B datasets, respectively, while the DeepConvLSTM + Attention achieved the F1-score of 84.97 and 84.51 for the same datasets with higher training times.
Our proposed method dispenses the recurrence setting entirely to accelerate the training time and boost the performance of HAR systems. Dilated convolution aggregates multi-scale contextual information to render informative feature maps. Causal convolution in the proposed method ensures the model cannot violate the ordering of the sequential temporal input data. The proposed method can focus on the important timesteps using the attention mechanism to improve the recognition process. The proposed method improved the results of each activity in addition to the average results of all the activities and all the datasets.
5.4.5 Ablation study of the proposed method
Ablation studied is conducted to show performance of the proposed method without dilated convolution, causal convolution and attention mechanism. Table 14 shows the results of these models and the results of the proposed method without these three techniques as well as the results of the proposed method from all the datasets. The results show that how the proposed method is affected by each of the dilated convolution, causal convolution and attention mechanism. For example, the proposed method achieved the F1-score of 90.78, while the proposed method without dilated convolution achieved the F1-score of 84.93, without attention achieved the F1-score of 83.24, without causal convolution achieved the F1-score of 85.41. Moreover the proposed method without these three techniques achieved the F1-score of 80.54. The proposed method without using attention mechanism has achieved lowest results scores from all the datasets compared to the proposed method without using dilated and causal convolutions. Hence, the results indicated that the attention mechanism has a higher contribution in the proposed method compared to dilated and causal convolutions. Beside the ablation study, the proposed method is compared with the DeepConvLSTM + Attention method and many temporal and recurrent models: LSTM, 1D CNN, hybrid 1D CNN + LSTM, CudNNLSTM, and Bidirectional LSTM.
6 Conclusion
This study proposes dilated causal convolution with multi-head self-attention to accelerate training time and improve the performance of HAR systems from smart home and wearable sensor data. Thorough experiments are conducted on eight real-world smart home and wearable datasets to evaluate the proposed method against the temporal and recurrent-based architecture methods. The results of the experiments show that the proposed method significantly improved the accuracy of HAR and reduced the training time compared to the state-of-the-art techniques. The proposed method improved the performance of HAR systems by up to 7% compared with LSTM, 1D CNN, hybrid 1D CNN + LSTM, CuDNNLSTM, and Bidirectional LSTM using wearable sensors and smart home sensors data.
The operation of the self-attention mechanism scales quadratically with the input sequence length which can increase training time because it appends more weight parameters to the model. To address this limitation, our future work will investigate a newly proposed method in human activity recognition to further accelerate the training time and enhance the performance of HAR by introducing a lightweight multi-head self attention mechanism.
References
Ogbuabor G, La R (2018) Human activity recognition for healthcare using smartphones. In: Proceedings of the 2018 10th international conference on machine learning and computing, pp 41–46 (2018)
Niu W, Long J, Han D, Wang Y-F (2004) Human activity detection and recognition for video surveillance. In: 2004 IEEE international conference on multimedia and expo (ICME) (IEEE Cat. No. 04TH8763), vol 1, pp 719–722. IEEE
Lee D, Helal S (2013) From activity recognition to situation recognition. In: International conference on smart homes and health telematics, pp 245–251. Springer
Javier M-Q, Shuai Z, Chris N, Espinilla M (2018) Ensemble classifier of long short-term memory with fuzzy temporal windows on binary sensors for activity recognition. Expert Syst Appl 114:441–453
Hamad R, Jarpe E, Lundstrom J (2018) Stability analysis of the T-SNE algorithm for human activity pattern data. In: 2018 IEEE international conference on systems, man, and cybernetics (SMC), pp 1839–1845. IEEE
Hamad RA, Salguero AG, Bouguelia M, Espinilla M, Quero JM (2019) Efficient activity recognition in smart homes using delayed fuzzy temporal windows on binary sensors. IEEE J Biomed Health Inform
Wang W, Liu AX, Shahzad M, Ling K, Lu S (2015) Understanding and modeling of wifi signal based human activity recognition. In: Proceedings of the 21st annual international conference on mobile computing and networking, pp 65–76. ACM
Jindong W, Yiqiang C, Shuji H, Xiaohui P, Lisha H (2019) Deep learning for sensor-based activity recognition: a survey. Pattern Recogn Lett 119:3–11
Ali HR, Masashi K, Jens L (2020) Efficacy of imbalanced data handling methods on deep learning for smart homes environments. SN Comput Sci 1(4):1–10
Iram F, Muhammad F, Young-Koo L, Sungyoung L (2013) Analysis and effects of smart home dataset characteristics for daily life activity recognition. J Supercomput 66(2):760–780
Liang C, Yufeng W, Bo Z, Qun J, Vasilakos Athanasios V (2018) Gchar: an efficient group-based context-aware human activity recognition on smartphone. J Parallel Distrib Comput 118:67–80
Nweke HF, Teh YW, Al-Garadi MAA (2018) Deep learning algorithms for human activity recognition using mobile and wearable sensor networks: state of the art and research challenges. Expert Syst Appl
Singh SP, Lay-Ekuakille A, Gangwar D, Sharma MK, Gupta S (2020) Deep CONVLSTM with self-attention for human activity decoding using wearables. arXiv preprint arXiv:2005.00698
Lee H, Grosse R, Ranganath R, Ng AY (2009) Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: Proceedings of the 26th annual international conference on machine learning, pp 609–616
Hinton G, Deng L, Yu D, Dahl GE, Mohamed A-R, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN et al (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag 29(6):82–97
Lee H, Pham P, Largman Y, Ng AY (2009) Unsupervised feature learning for audio classification using convolutional deep belief networks. In: Advances in neural information processing systems, pp 1096–1104
Zhao R, Wang J, Yan R, Mao K (2016) Machine health monitoring with LSTM networks. In: 2016 10th international conference on sensing technology (ICST), pp 1–6. IEEE
Ali HR, Longzhi Y, Lok WW, Wei B (2020) Joint learning of temporal models to handle imbalanced data for human activity recognition. Appl Sci 10(15):5293
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Bai S, Kolter JZ, Koltun V (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271
Singh D, Merdivan E, Hanke S, Kropf J, Geist M, Holzinger A (2017) Convolutional and recurrent neural networks for activity recognition in smart environment. In: Towards integrative machine learning and knowledge extraction, pp 194–205. Springer
Lee S-M, Yoon SM, Cho H (2017) Human activity recognition from accelerometer data using convolutional neural network. In: 2017 IEEE international conference on big data and smart computing (bigcomp), pp 131–134. IEEE
van den Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Alex G, Nal K, Andrew S, Koray K (2016) Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499
Pu J, Zhou W, Li H (2018) Dilated convolutional network with iterative optimization for continuous sign language recognition. In: IJCAI, vol 3, p 7
Yu F, Koltun V (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122
Lin Zhouhan, Feng Minwei, Nogueira dos Santos Cicero, Yu Mo, Xiang Bing, Zhou Bowen, Bengio Yoshua (2017) A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130
Chen K, Zhang D, Yao L, Guo B, Yu Z, Liu Y (2020) Deep learning for sensor-based human activity recognition: overview, challenges and opportunities. arXiv preprint arXiv:2001.07416
Kun X, Jianguang H, Hanyu W (2020) LSTM-CNN architecture for human activity recognition. IEEE Access 8:56855–56866
Bengio Y (2013) Deep learning of representations: Looking forward. In: International conference on statistical language and speech processing, pp 1–37. Springer
Fang H, Si H, Chen L (2013) Recurrent neural network for human activity recognition in smart home. In: Proceedings of 2013 Chinese intelligent automation conference, pp 341–348. Springer
Sepp H, Jürgen S (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Masaya I, Sozo I, Takeshi N (2018) Deep recurrent neural network for mobile human activity recognition with high throughput. Artif Life Robot 23(2):173–185
Hernández F, Suárez LF, Villamizar J, Altuve M (2019) Human activity recognition on smartphones using a bidirectional LSTM network. In: 2019 XXII symposium on image, signal processing and artificial vision (STSIVA), pp 1–5. IEEE
Ullah M, Ullah H, Khan SD, Cheikh FA (2019) Stacked LSTM network for human activity recognition using smartphone data. In: 2019 8th European workshop on visual information processing (EUVIP), pp 175–180. IEEE
Guan Yu, Thomas P (2017) Ensembles of deep LSTM learners for activity recognition using wearables. Proc ACM Interact Mobile Wear Ubiquit Technol 1(2):1–28
Zeng Y, Xiao Z, Hung K-W, Lui S (2021) Real-time video super resolution network using recurrent multi-branch dilated convolutions. Signal Process Image Commun 93:116167
Yingjie L (2020) Wu J (2020) A novel multichannel dilated convolution neural network for human activity recognition. Math Probl Eng
Chang S-Y, Li B, Simko G, Sainath TN, Tripathi A, van den Oord A, Vinyals O (2018) Temporal modeling using dilated convolution and gating for voice-activity-detection. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5549–5553. IEEE
Woon-Haeng H, Hyemi K, Oh-Wook K (2021) Integrating dilated convolution into dense LSTM for audio source separation. Appl Sci 11(2):789
Jun H, Qian Z, Liqun W, Ling P (2018) Weakly supervised human activity recognition from wearable sensors by recurrent attention learning. IEEE Sens J 19(6):2287–2297
Mahmud S, Tonmoy M, Bhaumik KK, Rahman AKM, Amin MA, Shoyaib M, Asif Hossain KM, Ali AA (2020) Human activity recognition from wearable sensor data using self-attention. arXiv preprint arXiv:2003.09018
Betancourt C, Chen W-H, Kuan C-W (2020) Self-attention networks for human activity recognition using wearable devices. In: 2020 IEEE international conference on systems, man, and cybernetics (SMC), pp 1194–1199. IEEE
Murahari VS, Plötz T (2018) On attention models for human activity recognition. In: Proceedings of the 2018 ACM international symposium on wearable computers, pp 100–103
Gao W, Zhang L, Teng Q, Wu H, Min F, He J (2020) Danhar: dual attention network for multimodal human activity recognition using wearable sensors. arXiv preprint arXiv:2006.14435
Hammerla NY, Halloran S, Ploetz T (2016) Deep, convolutional, and recurrent models for human activity recognition using wearables. arXiv preprint arXiv:1604.08880
Appleyard J, Kocisky T, Blunsom P (2016) Optimizing performance of recurrent neural networks on GPUS. arXiv preprint arXiv:1604.01946
Francisco Javier Ordóñez and Daniel Roggen (2016) Deep convolutional and LSTM recurrent neural networks for multimodal wearable activity recognition. Sensors 16(1):115
Mike S, Paliwal Kuldip K (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681
Alex G, Jürgen S (2005) Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw 18(5–6):602–610
Fco O, Paula DT, Araceli S et al (2013) Activity recognition using hybrid generative/discriminative models on home environments using binary sensors. Sensors 13(5):5460–5477
van Kasteren TLM, Englebienne G, Kröse BJA (2011) Human activity recognition from wireless sensor network data: benchmark and software. In: Activity recognition in pervasive intelligent environments, pp 165–186. Springer
Anguita D, Ghio A, Oneto L, Parra X, Reyes-Ortiz JL (2013) A public domain dataset for human activity recognition using smartphones. In: ESANN, vol 3, p 3
Jorge-L R-O, Luca O, Albert S, Xavier P, Davide A (2016) Transition-aware human activity recognition using smartphones. Neurocomputing 171:754–767
Luis STR, Ranasinghe DC, Shi Q (2013) Evaluation of wearable sensor tag data segmentation approaches for real time activity classification in elderly. In: International conference on mobile and ubiquitous systems: computing, networking, and services, pp 384–395. Springer
Shinmoto TRL, Ranasinghe DC, Shi Q, Sample AP (2013) Sensor enabled wearable RFID technology for mitigating the risk of falls near beds. In: 2013 IEEE international conference on RFID (RFID), pp 191–198. IEEE
Wickramasinghe A, Ranasinghe DC (2016) Recognising activities in real time using body worn passive sensors with sparse data streams: To interpolate or not to interpolate? In: Proceedings of the 12th EAI international conference on mobile and ubiquitous systems: computing, networking and services on 12th EAI international conference on mobile and ubiquitous systems: computing, networking and services, pp 21–30
Quero JM, Orr C, Zang S, Nugent C, Salguero A, Espinilla M (2018) Real-time recognition of interleaved activities based on ensemble classifier of long short-term memory with fuzzy temporal windows. In: Multidisciplinary digital publishing institute proceedings, vol 2, p 1225
Nitish S, Geoffrey H, Alex K, Ilya S, Ruslan S (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
Ba JL, Kiros JR, Hinton GE (2016) Layer normalization. arXiv preprint arXiv:1607.06450
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Hamad, R.A., Kimura, M., Yang, L. et al. Dilated causal convolution with multi-head self attention for sensor human activity recognition. Neural Comput & Applic 33, 13705–13722 (2021). https://doi.org/10.1007/s00521-021-06007-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-021-06007-5
Keywords
- Activity recognition
- Smart home
- Self-attention
- Dilated causal convolution