1 Introduction

Human activity recognition (HAR) is a significant research field in ubiquitous computing for monitoring behaviors of people which plays an important role in various applications such as healthcare monitoring [1], security surveillance system [2] and resident situation assessment [3]. In healthcare monitoring, HAR, as one of the significant applications of intelligent environment and wearable sensor technologies, has been used to monitor the activity of daily living (ADL) in order to support and assist senior people, disabled and cognitive impaired [4]. HAR from smart home setting equipped by ubiquitous sensors in the field of ambient assisted living has gained increased attention for improving the quality of independent living of the residents within the smart home environment [5]. Smart homes with unobtrusive sensing technology for HAR have been used as a suitable solution for enhancing independent living when privacy is concerned [4, 6]. Wearable sensors are mainly embedded into mobile devices, wristwatches, clothes, glasses, belts, or shoes. Wearable sensors can be worn on the human body to capture their interaction with their physical surroundings, motion, posture, and location. Wearable sensors such as gyroscope, accelerometer, GPS, and RFID-readers (used together with RFID tags) have commonly been used to record information about users’ movement e.g., walking, running, and laying [7]. Wearable sensors have been used for HAR since they can collect information such as body movement and position [7, 8]. The aim of HAR is to identify and recognize simple and complex human daily activities using smart home and wearable sensors data. HAR based on the sensors’ data is a challenging task since sometimes the data could be noisy which leads to ambiguity in the interpretation of human activities [9]. Noise in the data could be caused by errors in the sensor connection system which fails to provide correct sensor activations. HAR systems based on sensors’ data have been notably progressed and obtained promising results by the current development of machine learning in elderly-care alert systems and assistance in emergencies [10]. Monitoring long-term daily routine activities of a resident in a smart home setting provide utility to determine and assess wellness. Particularly, remote monitoring of daily routine activities such as eating, sleeping, or medication intake enables caregivers to track and assess the functional health status and to support the needs of the elderly people living alone [9]. Moreover, smart home and wearable sensors are able to provide sufficient information to properly detect the postural and ambulatory activities [11, 12].

Traditional machine learning approaches such as naive Bayes, support vector machine, and hidden Markov models have made tremendous progress on HAR and obtained satisfying results [8]. However, these approaches entirely rely on hand-crafted heuristic feature extraction, which is highly data dependent, usually limited by domain experts. Handcrafted features are not always generalizable across application domains and time consuming. Moreover, handcrafting does often not generate a sufficient number of features from a given dataset [13]. Recently, deep learning methods have been increasingly used in various applications of computer vision [14] and natural language processing, speech [15] and audio recognition [16]. Besides, deep learning methods have been used for HAR systems based on smart home sensors and wearable sensors. Most of these HAR systems have shown encouraging results for different purposes and from different datasets of daily routine activities. Among the deep learning methods, long short-term memory (LSTM) as a sequential deep learning model and variation of recurrent neural network (RNN) has achieved state-of-the-art performance for temporal information processing in various applications [17]. Particularly, LSTM for recognizing activities of daily living (ADLs) shows state-of-the-art performance on various activity recognition benchmark datasets using wearable sensors and smart home sensors [6, 13]. ADLs are introduced as the normal daily activities where we perform for self-care such as eating, drinking, and bathing [18]. Even though LSTM models improve the performance of HAR systems, training LSTM models are computationally expensive due to using the gating mechanism that allows long-term dependencies. Moreover, LSTM models occupy more memory and cannot process timesteps of input data in parallel since each timestep needs the results of the previous timestep to be processed [19, 20]. One dimensional convolutional neural network (1D CNN) has been used instead of LSTM to capture the sequential temporal information in the input data for HAR systems [21, 22]. Despite training of 1D CNN models are extremely faster compared to recurrent methods such as LSTM due to the absence of recurrent connections, the achieved results based on 1D CNN fall short of the results shown by LSTM in HAR systems. Moreover, 1D CNN is not sensitive to the timestep order which is the key in HAR systems. To address these problems, we propose dilated causal self-attention convolution that entirely forgoes recurrent settings to improve the performance of HAR. We adopted dilated causal convolution which is used as a part of the WaveNet to generates raw audio waveforms [23]. Dilated causal convolution is used to allow long-range temporal dependencies in the WaveNet that outperforms LSTM [24]. While dilated causal convolution captures long-range temporal sequential information [25], it is crucial to focus on particular information from the feature maps generated by dilated causal convolution using the self-attention mechanism [26]. The self-attention mechanism that is leveraged by transformer [19] can enable temporal models to expose context from the feature maps within the sequence.

To summarise, the main contributions of this paper are:

  1. i.

    Proposing a model to accelerate training time and improve the results of activity recognition compared to state-of-the-arts using dilated causal and self-attention convolution.

  2. ii.

    Causal convolutions within the proposed method are used to prevent information leakage from future to past.

  3. iii.

    Dilated convolutions within the proposed method are used to maximize the receptive field by orders of magnitude and aggregate multi-scale contextual information without considerably increasing computational cost.

  4. iv.

    Multi-head self-attention within the proposed method is used to effectively expose deep semantic correlations from action sequences involving human activities.

  5. v.

    Conducting extensive experiments using eight benchmark datasets of human daily activities from smart homes and wearable sensors to validate the proposed approach, which shows our proposed method can improve the accuracy by 5% up to 9% and reduce the training time compared with recurrent neural network-based architecture methods.

2 Related work

Human action recognition is a challenging research area based on sensor data and has attracted much attention in machine learning fields. Numerous methods have been proposed to model and recognize ADLs [8, 27]. Early research modeled activity recognition using support vector machine(SVM), decision tree, k-nearest neighbor (KNN) naïve Bayes [28]. HAR systems based on these traditional approaches have gained reasonable recognition results. However, these approaches solely process extracted heuristic-manual features of human activities. Hand-crafted features are usually limited by the availability of knowledge domain experts and a time-consuming task. Hence, deep learning models have been proposed in various applications to address these problems [29].

Deep learning models have shown satisfying results and reported state-of-the-art accuracy obtained on various HAR benchmark datasets [4, 6]. Moreover, deep learning models have been jointly used to handle imbalanced data and improve generalization of HAR [18]. Recurrent network-based architectures such as RNN and LSTM have been firmly established as state-of-the-art methods in sequence problems modeling including activity recognition [19]. RNN is employed to recognize the human daily activities from smart home sensor data [30]. The results show that RNN is useful in modeling and recognizing human activities. Yet RNN cannot properly process very long sequences and suffers from both gradient vanishing and exploding problems [4]. LSTM that solves vanishing and exploding gradient by the capability of handling long-term dependencies is often used to process temporal sequential data [31]. For instance, satisfying results are shown by employing LSTM to recognize human activities on diverse collected sensor data [4, 6, 32, 33]. Further, different LSTM architectures are proposed to improve the performance of HAR systems such as stacked LSTM [34], bidirectional LSTM [33], ensemble LSTM [35]. Moreover, combined CNN with LSTM is also employed to further improve the performance of HAR systems [6, 28].

The Dilated CNN can be used instead of the standard CNN to increase the convolution receptive field without losing resolutions [27]. Since the dilated convolution only appends empty elements between the elements of the standard convolution kernel, extra computational cost is not required for dilated convolution process. Dilated convolution is proposed [36, 37] for human activity recognition from wearable sensors data. Dilated convolution is also used for voice activity detection and audio source separation [38, 39].

Weakly supervised learning based on combined CNN and LSTM with self-attention layers using reinforcement learning trained on wearable sensors data for human activity recognition form [40]. However, this method requires large computing resources and has high complexity because it works based on reinforcement learning. Moreover, Convolution LSTM with self-attention mechanism is proposed to capture the spatio-temporal context in human activities and to focus on significant timesteps from temporal wearable sensors data [13]. References [41, 42] employed a self-attention mechanism to improve the performance of HAR systems based on wearable sensors. The results are improved using self-attention compared to the state-of-the-art. Betancourt et al. proposed an LSTM model based on the attention mechanism. The model is only tested on two wearable sensor datasets. The limitation of these methods is firstly the recurrent setting from the proposed methods leads to slow down the learning process. Secondly, the proposed methods from these studies are only applied to HAR systems based on wearable sensors, and the learning time is not considered. Reference [43] proposed DeepConvLSTM model based on attention mechanism for human activity recognition on smart home datasets. The method considered each time-step in the sequential data as a word and a specified time-window as the sentence. However, the method is evaluated only on three smart home datasets and only compared to bidirectional LSTM. Moreover, due to the sequential operation of recurrent settings in this method, parallelization is limited which makes this model computationally expensive. Gao et al. proposed a CNN dual attention without a recurrent setting for HAR systems, however, the method is only tested on wearable sensor datasets [44].

To address these limitations of HAR and improve the performance as well as reduce the learning time of HAR systems from smart homes sensor and wearable sensor data, we propose dilated causal and multi-head self-attention convolution.

3 Background

In section, we describe the temporal models i.e., 1D CNN, LSTM, and the hybrid 1D CNN and LSTM model.

3.1 Temporal modeling via LSTM

LSTM is an artificial RNN and used to learn from temporal sequential data. LSTM can handle and learn from long-term dependencies which alleviate vanishing and exploding gradient problem [31]. LSTM as a temporal model has been used to recognize ADLs from sensor data [4, 6]. LSTM processes temporal data using forget gate, input gate, and output gate to append or delete information to the cell state throughout the processing of the sequence data. The cell state is the main part of LSTMs that carry and transfer relevant information from earlier timesteps to later timesteps. Figure 1 shows the connection of the gates with the cell state in a single LSTM cell. The gates learn to keep relevant information and forget irrelevant information during training to update the information on the cell state. Hence each LSTM cell works as a memory to remove, read, and write information that is controlled by the forget, output, and input gates, respectively. Forget gate process both inputs the previous output \(h_{t-1}\) and new time step \(X_{t}\) using sigmoid activation function to indicate relevant or irrelevant information. The forget gate keeps the information if the outcome of the sigmoid function is 1 while deletes the information if the outcome of the sigmoid function is 0. Equation (1) shows how the forget gates within a single LSTM cell is computed. The next step consists of two parts to determine new information kept in the cell state. The first part is the input gate that indicates new information from the current input (\(X_{t}, h_{t-1}\)) is appended to the cell state. The tanh activation function is the second part that renders \(\tilde{C}_{t}\) a vector of new candidate values and can be added to the cell state. Equations (2) and (3) show how the input gate and the new candidate values are computed, respectively. A new cell state \(C_{t}\) is generated based on the summation of the multiplication of these two parts and the multiplication of the forget gate with the previous cell state \(C_{t-1}\). Equation (4) shows how the new cell state is computed. The multiplication of the previous cell state with the forget gate deletes part of the information which was decided to be forgotten earlier. Then the new candidate values are scaled by how much the cell state is updated using \(it \times \tilde{C}_{t}\). Finally, the sigmoid activation function processes both the previous hidden state \(h_{t-1}\) and the current input timestep \(x_{t}\) to produce the output gate.

Finally, the output gate is computed based on filtered information using two different activation functions and also specifies the next hidden state. Then the tanh activation function processes the newly updated cell state. The output of the tanh functions multiplies by the output of the sigmoid function to render the next hidden state. The updated cell state and the newly generated hidden state pass information to the next timestep. Equations (5) and (6) show how the calculation of output gate and hidden state.

$$\begin{aligned}&f_{t} = \sigma (W_{f}\cdot [h_{t-1},x_{t}]+ b_{f}) \end{aligned}$$
$$\begin{aligned}&i_{t} =\sigma (W_{i}\cdot [h_{t-1},x_{t}]+b_{i})\end{aligned}$$
$$\begin{aligned}&\tilde{C}_{t} = \tanh (W_{C}\cdot [h_{t-1},x_{t}]+ b_{c}) \end{aligned}$$
$$\begin{aligned}&C_{t} = f_{t} \times C_{t-1} + i_{t} \times \tilde{C}_{t} \end{aligned}$$
$$\begin{aligned}&o_{t} = \sigma (W_{o}\cdot [h_{t-1},x_{t}]+ b_{o}) \end{aligned}$$
$$\begin{aligned}&h_{t}= o_{t}\times \tanh C_{t} \end{aligned}$$

where x is the input data, \(\sigma\) is the sigmoid activation function, \(\tanh\) is the hyperbolic tangent activation function, W is the weight matrix.

LSTM has been used for HAR application and achieved promising results [4, 6, 9, 45]. Hence in this paper, LSTM as a temporal model is used to be compared with the proposed method. Two layers of LSTM with a flattened layer are stacked. Then the outputs of the flattened layer are passed into a fully connected layer with ReLU activation function and followed by a softmax layer. Figure 2 shows the architecture of the LSTM model.

Fig. 1
figure 1

Single LSTM cell

Fig. 2
figure 2

Architecture of the LSTM model

Fast LSTM implementation backed by cuDNN (CUdNNLSTM) [46] is also used in this study with the same architecture of LSTM model. CUdNNLSTM is a version of LSTM that uses the CuDNN library, and it can only be run on a GPU to accelerate training and inference time.

3.2 Temporal modeling via 1D CNN

1D CNN has been widely used in HAR systems and has shown satisfying results [6, 21]. 1D CNN can properly extract features from raw and consider local dependency that is likely to be correlated. 1D CNN can also learn hierarchical data representations of human activities that lead to improving HAR systems [45]. 1D CNN compared to LSTM has obtained competitive results in several applications such as activity recognition, machine translation, and audio generation with much faster learning time. However, 1D CNN is not sensitive to order that is significant in activity recognition [8]. Hence 1D CNN alone is not an optimal solution instead of LSTM. In this paper, 1D CNN is employed, and its results are shown. The 1D CNN model is designed by stacking two convolutional layers each with 64 filters. The kernel size of the 1D CNN in this study is equal to 3 that indicates the length of the 1D convolution window with stride size of 1. A Max-pooling layer with the window size equal to 2 is applied after the convolution layers to down-sample the features maps. The feature maps are flattened to be processed by the fully-connected, i.e., a dense layer with ReLU activation function followed by a soft-max layer. Figure 3 shows the architecture of 1D CNN.

Fig. 3
figure 3

Architecture of the 1D CNN model

3.3 Temporal modeling via Hybrid: 1D CNN + LSTM

The hybrid model based on stacking 1D CNN and LSTM sequentially has been used to improve the performance of HAR system [6, 18]. In this study, the hybrid model is employed by stacking one layer of each 1D CNN and LSTM to human activities from smart home data. Figure 4 shows the architecture of the hybrid model. The input data are firstly fed into the 1D CNN layer to extract features before the LSTM layer to support sequence recognition. The input sub-sequences sensor data are processed independently by 1D CNN hence timestep orders are not considered. The feature maps of 1D CNN are down-sampled by a max-pooling layer with the window size equal to 2 before the LSTM layer. The feature maps are processed by the LSTM and then flattened followed by fully-connected layers, i.e., a dense layer with ReLU activation function and a soft-max layer. Furthermore, 1D CNN layers in the hybrid model are often applied when recurrent-based models cannot realistically handle and process long-term dependencies from input sequence data. In such cases, 1D CNN in the hybrid model can make the long-term dependencies shorter through down-sampling by extracting higher-level features. Then the extracted features generated by 1D CNN could be better processed by the recurrent-based models [47]. However, order sensitivity is not considered in the extracted features by the 1D CNN. Hence, the hybrid of 1D CNN and LSTM is not the most acceptable solution to improve the performance of activity recognition [18].

Fig. 4
figure 4

Hybrid 1D CNN + LSTM model

3.4 Temporal modeling via Bidirectional LSTM

Bidirectional LSTM trains input data in forward and backward directions by using previous and subsequent information of a specific time step in two separate recurrent layers [48]. Figure 5 shows bidirectional LSTM where inputs of backward states are not connected to the outputs of the forward states. Including future information in addition to past information in bidirectional LSTM appears at first sight to violate causality [49]. Although Bidirectional LSTM has been successfully proposed in HAR and achieved satisfying results, Bidirectional LSTM is indeed expensive to train since it has a double recurrent setting in each layer [33]. Bidirectional LSTM is used in this study by stacking two forward and backward LSTMs layers. The outputs of these two layers are flattened and then fed to a fully-connected layer, i.e., a dense layer with ReLU activation function and a soft-max layer.

Fig. 5
figure 5

Bidirectional LSTM model

4 Proposed method

In this section, we describe the proposed method, dilated causal convolution, and self-attention mechanism, for HAR in smart home data. We aim to design and propose a more efficient convolutional network model better than recurrent-based architecture models in terms of recognition score and training time. The distinguishing characteristics of our proposed method are: (1) the proposed model stops information leakage from future to past using causal convolution; (2) the proposed model can handle temporal sequential data of any length and map it to a series output of the same length; (3) the model can simultaneously focus on different important time steps of the sequence input using the multi-head self-attention mechanism. The details of the proposed model are described in the following subsections.

4.1 Sequence modeling

Before describing the details of the proposed model, we show the sequence modeling task for human activities. Input human activity sequences \(x_0,\ldots, x_T\) are fed into a model to predict corresponding activity outputs \(y_0,\ldots,y_T\) at each time. Predicting the activity output \(y_t\) for particular time t should be derived only by considering the observed times steps before time t: \(x_0,\ldots, x_t\) [20]. Hence, sequence modeling is a function \(f:x_{0},\ldots,x_{T} \rightarrow y_{0},\ldots,y_{T}\) (where x and y are the input and output, respectively) that renders the mapping as shown in Eq. (7).

$$\begin{aligned} \hat{y}_0,\ldots,\hat{y}_T = f(x_0,\ldots, x_T) \end{aligned}$$

The model f is expected to minimize a loss L between,   \(L(\hat{y}_0,\ldots,\hat{y}_T, f(x_0,\ldots, x_T))\), the actual label and the predicted outputs where the input sequential data and the outputs are rendered based on some distribution. This formalism could not directly be used for domains such as sequence-to-sequence prediction or machine translation since these domains require the entire sequence input (past and future states) [20]. However, the setting can be extended for these domains.

4.2 Dilated causal convolutions

Causal convolutions used in the proposed method to control the model and predict output at time t based on only the convolutions of the sequence inputs from time t and earlier in the previous layers [20]. Causal convolutions also preserve the ordering of sequential input patterns. However, causal convolutions require very large filters or many hidden layers to expand the receptive field [23]. To maximize the receptive field and aggregate multi-scale contextual information without considerably increasing computational cost, dilated convolutions are integrated into the proposed method. Dilated convolutions enable the model to increase the receptive field exponentially using a few layers and keeping the computational efficiency [25]. The dilated causal convolution DCC for one dimensional input sequence \(x \in R^{n}\) with a filter \(f:\{0,\ldots,k-1\} \rightarrow R\) on element s of the sequence is defined as:

$$\begin{aligned} DCC (x \star _{d} f)(s) = \sum \limits _{i=0}^{k-1} f(i)\cdot x_{s-d \cdot i} \end{aligned}$$

where d is the dilation factor, k is the filter size, and \(s - d \cdot i\) shows the past direction. The dilation factor d is exponentially increased when the depth of the model is increased i.e., \(d =2^{l}\) at layer l of the model. Formally, we increase the dilation factors d exponentially by a factor of 2 in each layer \(l=1,\ldots,L\) where L is the number of layers of the dilated causal convolutions in the proposed model. Equation (9) shows the dilation factor in this study.

$$\begin{aligned} d\in [ 2^{0},2^{1},2^{2},\ldots,2^{L-1} ] \end{aligned}$$

In addition, the dilation convolution renders the standard convolution when d = 1. Figure 6 shows the dilation causal convolutions in the proposed model for dilations 1, 2, and 4. Dilated convolution with different dilation factors can be integrated with a filter at different ranges. The filters convolve input values over an area larger than its length using dilated convolutions by skipping input values with a certain step which is the dilation factor. Hence, dilation convolution is equivalent to a standard convolution with one dilating, but importantly more efficient. Dilation convolution effectively enables the model to aggregate multi-scale contextual information with fewer layers and the same receptive field compared to a standard convolution [25]. Therefore, the number of learnable parameters is reduced by using stacked dilated causal convolutions that lead to yield more efficient training and light-weight model.

Fig. 6
figure 6

Dilated causal convolution and slef-attention model

4.3 Self-attention network

The self-attention mechanism is a robust technique to compute correlation and the weighted combination between all the time steps in the input sequence [19]. After applying dilated causal convolution to render aggregated multi-scale contextual information, multi-headed self-attention is used to enable the model to focus on important and relevant time steps more than the insignificant time steps from the sequential feature maps during recognition. Hence, the attention mechanism aims to learn the most important time steps from the sequence feature maps that aid in determining more accurate recognition. Moreover, self-attention identifies relative weights for each time step in the sequence feature map by considering its similarity to all the other time-steps within the sequence. Then, the representation of each time step with relevant and important information from other time steps is transformed by the relative weights according to their importance. Self-attention mechanism has three learned linear transformation: query Q, key K, and values V, where Q and K have same vector dimension \(d_{k}\), and V and outputs have same size of dimension \(d_{v}\) [19]. To obtain attention scores, dot product attention is applied between each query as considered to the transformed matrix of a specific time step and the key matrix of every other time step. Then the softmax function is applied on the scaled dot product value of the queries and keys to generate the attention scores. Lastly, the attention scores are used to produce a weighted representation of the value matrix for each of the time steps in the sequence. Equation (10) shows the multi-head self-attention is entirely implemented as a matrix multiplication operation.

$$\begin{aligned} f^{(hj)} _{sa} (Q, K, V) = {\rm softmax}\left(\frac{ Q \cdot K^{T}}{\sqrt{d_{k}}}\right) V \end{aligned}$$

The model computes the attention numerous times in parallel (multi-head) to capture distinct correlation information of the input sequence. Hence, hj in Eq. (10) shows output from attention head j, and sa refers to self-attention. Distinct parameters are used in Eq. (10) for computing each the key, query, and value of the n attention heads. The outputs from the distinct multi-head attention are concatenated and transformed to the dimension of the input sequence using the learned parameter \(W_{o}\) as defined in Eq. (11). The outputs of the multi-head self-attention (\({M}_{ha}\)) are fed into fully-connected layers, i.e., a dense layer with ReLU activation function and a soft-max layer.

$$\begin{aligned} {M}_{ha}= W_{o} \cdot {\rm concat} ({f}^{(h1)} _{sa},\ldots,{f}^{(hn-1)} _{sa},{f}^{(hn)} _{sa}) \end{aligned}$$

The proposed method based on dilated causal convolution foregoes recurrent architectures to accelerate the training and inference time. Causal convolution maintains the ordering of data which is crucial for HAR systems. Dilated convolution increases the receptive field and produces feature maps with multi-scale receptive fields using the different dilated rates in the convolution layers. Dilated convolution preserves the resolution of the data since the layers are dilated instead of pooling. The multi-head self-attention mechanism is employed in the proposed method to capture informative timesteps in the feature map to improve the recognition. Dilated causal convolution with a self-attention mechanism is used to make the proposed method computationally efficient and improve the result scores. Algorithm 1 besides Fig. 6 provides more information about how the layers of the proposed method are stacked.

figure a

5 Experimental setup and evaluation

In the section, we will show the details of the experimental setup and evaluation with the details of five used datasets, evaluation methods and results.

5.1 Datasets and preprocessing

5.1.1 Ordonez smart home datasets

Human activity datasets collected in five smart homes using embedded binary sensors are used in this study to evaluate the proposed method. Ordóñez home A and B [50] are two real-world smart homes that can record human daily physical activities using non-intrusive binary sensors. Different binary sensors are used in these two smart homes to detect different human activities. For example, passive infrared (PIR) sensors are used to detect human movements in a limited area. Pressure sensors on beds and couches are used to detect the user’s presence. Reed switches on cupboards and doors are used to measure open or close status, and float sensors in the bathroom to measure toilet being flushed or not. Table 1 shows details about the residents, sensors, and the number of activities of the Ordóñez smart homes A and B. In Ordóñez smart home A, twelve binary sensors were used to record nine human activities in fourteen days over a period of 20,358 min. In Ordóñez smart home B, twelve binary sensors were used to record ten human activities in twenty-two days over a period of 30,469 min. The common activities from Ordóñez homes A and B are Breakfast, Lunch, Sleeping, Grooming, Leaving, Idle, Snack, Showering, Spare Time/TV, and Toileting, respectively. In addition to these activities, Ordóñez home B has the activity Dinner.

5.1.2 Kasteren smart home datasets

Kasteren home A, B, and C datasets were recorded from other three different smart homes using non-intrusive and embedded binary sensors as well [51]. Table 1 also shows the details of these three datasets regarding to the residents, the number of sensors and activities. In Kasteren home A, fourteen sensors used to record ten human activities in 25 days over a period of 40,005 min. In Kasteren home B, twenty three binary sensors used to record thirteen human activities in 14 days over a period of 38,900 min. In Kasteren home C, twenty one binary sensors used to record sixteen human activities in nineteen days over a period of 25,486 min.

5.1.3 Wearable smartphone (inertial sensors) dataset

Dataset for human activity recognition was build by recording activities of daily living (ADL) of 30 study participants while carrying a waist-mounted smartphone with embedded inertial sensors [52, 53]. The participants within an age bracket of 19–48 years performed six daily activities in which three activities are static postures (standing, sitting, lying), and three activities are dynamic activities (walking, walking downstairs, and walking upstairs). The participants wore a smartphone (Samsung Galaxy S II) on the waist to record the activities. Embedded accelerometer and gyroscope were used to capture 3-axial linear acceleration and 3-axial angular velocity at a constant rate of 50 Hz. The activities were video-recorded to manually annotate the dataset. The dataset is randomly split into a training set with 70% of participants’ data and a testing set with 30% of participants’ data. Participants performed six activities: (i) Walking; (ii) Walking_upstairs; (iii) Walking_downstairs; (iv) Sitting; (v) Standing; (vi) Laying. Table 4 shows the frequency distribution of activities in the training and testing sets. The accelerometer and gyroscope signals were preprocessed using noise filters. Furthermore, the signals were sampled in fixed-width sliding windows of 2.56 s and 50% overlap.

5.1.4 Wearable wireless identification and sensing data

Fourteen elderly volunteers from 78 to 78 ± 4.9 years old wore Wearable Wireless Identification and Sensing Platform (W\(^2\)ISP) tag [54,55,56]. The W\(^2\)ISP placed on top of their garment at the sternum level to capture trunk movements and recognize activities: (i) sit on bed; (ii) sit on chair; (iii) lying; (iv) ambulating. The activities were performed in two clinical room configurations (Roomset1 and Roomset2) for ambulatory monitoring of older patients. Table 5 shows the frequency distribution of activities from both datasets:Roomset1 and Roomset2.

5.1.5 Preprocessing smart home data

The timeline of the human daily activities for all the smart homes data is segmented in time slots using the window size \(\Delta\)t = 1 min. The raw sensor data from smart homes provide the start time and end time of the sensor activations as well as the type (such as pressure sensor), location (such as bed), and place (such as bedroom) of the sensors. To generate the input datasets by preprocessing the raw sensor data, multiple and incremental fuzzy temporal windows (FTW) are used. FTW is used as a successful technique to segment the sensor data and prepare the input datasets [4, 6, 9, 18, 57]. FTW has shown that it can capture signal sensors of a long and short duration of human activities such as sleep or snack from raw sensor data [4, 57]. This increases the recognition results of the temporal models. Furthermore, temporal models i.e., LSTM and 1D CNN achieved better recognition results for activity recognition when the input datasets are generated by FTW compared to other methods such as Equally Sized Temporal Windows (ESTWs), Raw and Last Activation (RLA), and Raw and Last Next Activation (RLNA) [4, 6].

Table 1 Details of the datasets
Table 2 Frequency of activities in the Ordonez datasets
Table 3 Frequency of activities in the Kasteren datasets
Table 4 Frequency distribution of activities in the Wearable smartphone (inertial sensors) dataset
Table 5 Frequency distribution of wearable wireless identification and sensing datasets

5.2 Models hyper-parameters

In this section, the parameters of all the models in this study are shown. A range of the following parameters used in a series of trial and error experiments over these ranges to find optimal parameters.

  • Learning rates from 0.0001 to 0.01.

  • Batch sizes values 32, 64, 128, and 256

  • Dropout rate values 20%, 30%, 40%, and 50%.

  • Number of epochs from 1 to 100.

Based on the series of trial and error experiments, we observed that 0.001 for the learning rate, 64 for the batch size with a 20% dropout rate with 50 epochs are the most appropriate hyper-parameters for the models to converge. To find a proper number of epochs, early stopping as a regularization technique is used to terminate the training when validation error starts increasing. Hence, the training was stopped at the minimum of the validation loss. To find a proper learning rate over the ranges in experiments, other hyper-parameters were fixed. This process is repeated until all the hyper-parameters are set. A large batch size can make training faster and require more memory space [6]. On the contrary, smaller batch size requires less memory space with slightly slower training but can cause the model to converge quickly, hence it is mostly a trade-off problem [6]. The 20% dropout rate is used to prevent the models from overfitting as a regularization technique [58]. The dropout technique ignores randomly selected neurons during the training process. The dropout technique temporally disconnects the ignored neurons on the forward pass hence in the backward pass their weights will not be updated. Layer normalization that normalizes the input data across the features is used after each dilation causal convolution [59]. Layer normalization can reduce the training time as empirically shown in [59].

5.3 Measure evaluation

F1-score as a metric is used to compare the performance of the proposed approach with other temporal methods. Accuracy is often used to evaluate the performance of classifiers. However, accuracy in the presence of imbalanced classes cannot be an appropriate measure for classification because less presented classes have a very little impact on accuracy as compared to the prevalent classes [6]. Hence, F1-score is employed to measure and evaluate all the temporal models since F1-score is the weighted average of recall and precision that can provide more insight into the functionality of the temporal models than the accuracy metric [4]. F1-score is calculated in Eqs. (12) and (13).

$$\begin{aligned}&\mathbf{F1}{\text{-score}} = \frac{2 \cdot {\rm precision}\cdot {\rm recall}}{{\rm precision}+ {\rm recall}} \end{aligned}$$
$$\begin{aligned}&\mathbf{recall} = \frac{{\rm TP}}{{\rm TP}+ {\rm FN}},\quad \mathbf{precision} = \frac{{\rm TP}}{{\rm TP}+{\rm FP}} \end{aligned}$$

where TP, FP, FN are the number of true positives, false positives, and false negatives, respectively. Moreover, F1-score is widely used in activity recognition [4, 6, 18, 35].

5.4 Results and discussion

In this section, the experimental results of the proposed dilated causal convolution with the self-attention model for HAR are presented and discussed. The achieved results of each activity based on multiple models compared with the proposed are presented. Besides, the training time of all the temporal models is shown to be easily compared with the training time of the proposed method. The results of the proposed method are compared with temporal models: 1D CNN, LSTM, hybrid 1D CNN + LSTM, CudNNLSTM, and Bidirectional LSTM. The proposed method improved the results of HAR by 5% up to 7% compared with LSTM, 1D CNN, hybrid 1D CNN + LSTM, CudNNLSTM, and Bidirectional LSTM and reduced the training time. Figure 7 shows the results of the proposed method compared to the state-of-the-art techniques on eight datasets. The results indicate that the proposed method outperformed the temporal and recurrent-based models for human activity recognition from all the datasets.

5.4.1 Results from Ordóñez datasets

Tables 6 and 7 show the F1-score and training time (seconds) of the proposed method against the temporal models from Ordóñez smart homes A and B datasets. The results show that the proposed method outperforms the temporal models (LSTM, 1D CNN, hybrid 1D CNN + LSTM, CudNNLSTM, and Bidirectional LSTM). The training time in seconds is shown in Tables 6 and 7 for all the employed methods. The training time of the proposed method is much lower compared to the training time of LSTM, hybrid 1D CNN+LSTM, and Bidirectional LSTM with slightly higher training time than the 1D CNN training time. This indicates that the proposed method reduced the training time and improved the HAR systems significantly from Ordóñez smart homes. Importantly the proposed method accelerated the training time compared to the CudNNLSTM model which is a fast LSTM version and backed by Cuda library.

Fig. 7
figure 7

Average F1-score of proposed method compared with the state-of-the-art techniques from eight the datasets

Table 6 Results of F1-score and training time in seconds from Ordonez Home A dataset
Table 7 Results of F1-score and training time in seconds from Ordonez Home B dataset

5.4.2 Results from Kasteren datasets

Tables 8, 9 and 10 show the results of the proposed method compared to the temporal models (LSTM, 1D CNN, hybrid 1D CNN + LSTM, CudNNLSTM, and Bidirectional LSTM) from Kasteren smart homes A, B, and C, respectively. The results of the F1-score show that the proposed method improved HAR from Kasteren datasets. The results show that the result scores are improved for each activity and the average result score. The proposed method considerably reduced the training time compared with the recurrent neural network-based architecture methods with reasonably higher training time than the 1D CNN training time. The results indicate that dilated causal convolution with self-attention can effectively improve the performance of HAR systems and reduce the training time.

5.4.3 Results from wearable sensors datasets

Figures 8, 9, and 10 show the results of the experiments that achieved based on wearable sensors for HAR. The results of the proposed method compared to the results of the temporal models (LSTM, 1D CNN, hybrid 1D CNN + LSTM, CudNNLSTM, and Bidirectional LSTM). Tables 11, 12, and 13 show the detailed results and the training time of all the models. Table 11 particularly shows the results of the experiments obtained based on smartphone sensors data. The results of the wearable sensors data demonstrate the outstanding performance of our proposed method compared to the state-of-the-art techniques. The training time of the proposed method is shorter than the training time of all the temporal and recurrent models except the training time of 1D CNN. The proposed method improved the performance of each activity as well as the average performance of all activities compared to recurrent and temporal models based on all wearable sensor data.

Fig. 8
figure 8

F1-score from wearable dataset of RoomSet1

Fig. 9
figure 9

F1-score from wearable dataset of RoomSet2

Fig. 10
figure 10

F1-score from wearable smartphone dataset

5.4.4 Proposed method compared to the DeepConvLSTM + attention

Results of our proposed method are compared with the results achieved by the DeepConvLSTM + Attention [13] for all the datasets. Results of the DeepConvLSTM + Attention are shown from all the datasets in this research and the training time. Since the DeepConvLSTM + Attention works based on the combination of 2D CNN and LSTM with an attention mechanism, it requires more time to process the input data compared to our proposed method. Moreover, compared to DeepConvLSTM + Attention, our proposed method achieved better result scores with much faster training times in all the datasets. For instance, our proposed method achieved the F1-score of 90.78 and 87.51 from Ordóñez smart homes A and B datasets, respectively, while the DeepConvLSTM + Attention achieved the F1-score of 84.97 and 84.51 for the same datasets with higher training times.

Our proposed method dispenses the recurrence setting entirely to accelerate the training time and boost the performance of HAR systems. Dilated convolution aggregates multi-scale contextual information to render informative feature maps. Causal convolution in the proposed method ensures the model cannot violate the ordering of the sequential temporal input data. The proposed method can focus on the important timesteps using the attention mechanism to improve the recognition process. The proposed method improved the results of each activity in addition to the average results of all the activities and all the datasets.

5.4.5 Ablation study of the proposed method

Ablation studied is conducted to show performance of the proposed method without dilated convolution, causal convolution and attention mechanism. Table 14 shows the results of these models and the results of the proposed method without these three techniques as well as the results of the proposed method from all the datasets. The results show that how the proposed method is affected by each of the dilated convolution, causal convolution and attention mechanism. For example, the proposed method achieved the F1-score of 90.78, while the proposed method without dilated convolution achieved the F1-score of 84.93, without attention achieved the F1-score of 83.24, without causal convolution achieved the F1-score of 85.41. Moreover the proposed method without these three techniques achieved the F1-score of 80.54. The proposed method without using attention mechanism has achieved lowest results scores from all the datasets compared to the proposed method without using dilated and causal convolutions. Hence, the results indicated that the attention mechanism has a higher contribution in the proposed method compared to dilated and causal convolutions. Beside the ablation study, the proposed method is compared with the DeepConvLSTM + Attention method and many temporal and recurrent models: LSTM, 1D CNN, hybrid 1D CNN + LSTM, CudNNLSTM, and Bidirectional LSTM.

Table 8 F1-score results and training time in seconds of Kasteren smart home A dataset
Table 9 Results of F1-score and training time in seconds of Kasteren smart home B datasets
Table 10 F1-score results and training time in seconds of Kasteren home C datasets
Table 11 Results of F1-score and training time in seconds from smartphone dataset
Table 12 Results of F1-score and training time in seconds from wearable dataset of RoomSet1
Table 13 Results of F1-score and training time in seconds from wearable dataset of RoomSet2
Table 14 Results of F1-score of ablation studies of the proposed method

6 Conclusion

This study proposes dilated causal convolution with multi-head self-attention to accelerate training time and improve the performance of HAR systems from smart home and wearable sensor data. Thorough experiments are conducted on eight real-world smart home and wearable datasets to evaluate the proposed method against the temporal and recurrent-based architecture methods. The results of the experiments show that the proposed method significantly improved the accuracy of HAR and reduced the training time compared to the state-of-the-art techniques. The proposed method improved the performance of HAR systems by up to 7% compared with LSTM, 1D CNN, hybrid 1D CNN + LSTM, CuDNNLSTM, and Bidirectional LSTM using wearable sensors and smart home sensors data.

The operation of the self-attention mechanism scales quadratically with the input sequence length which can increase training time because it appends more weight parameters to the model. To address this limitation, our future work will investigate a newly proposed method in human activity recognition to further accelerate the training time and enhance the performance of HAR by introducing a lightweight multi-head self attention mechanism.