ConvNet-based performers attention and supervised contrastive learning for activity recognition

Hamad, Rebeen Ali; Yang, Longzhi; Woo, Wai Lok; Wei, Bo

doi:10.1007/s10489-022-03937-y

ConvNet-based performers attention and supervised contrastive learning for activity recognition

Open access
Published: 03 August 2022

Volume 53, pages 8809–8825, (2023)
Cite this article

Download PDF

You have full access to this open access article

Applied Intelligence Aims and scope Submit manuscript

ConvNet-based performers attention and supervised contrastive learning for activity recognition

Download PDF

Rebeen Ali Hamad ORCID: orcid.org/0000-0001-9489-8330¹,
Longzhi Yang¹,
Wai Lok Woo¹ &
…
Bo Wei²

1613 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Human activity recognition based on generated sensor data plays a major role in a large number of applications such as healthcare monitoring and surveillance system. Yet, accurately recognizing human activities is still challenging and active research due to people’s tendency to perform daily activities in a different and multitasking way. Existing approaches based on the recurrent setting for human activity recognition have some issues, such as the inability to process data parallelly, the requirement for more memory and high computational cost albeit they achieved reasonable results. Convolutional Neural Network processes data parallelly, but, it breaks the ordering of input data, which is significant to build an effective model for human activity recognition. To overcome these challenges, this study proposes causal convolution based on performers-attention and supervised contrastive learning to entirely forego recurrent architectures, efficiently maintain the ordering of human daily activities and focus more on important timesteps of the sensors’ data. Supervised contrastive learning is integrated to learn a discriminative representation of human activities and enhance predictive performance. The proposed network is extensively evaluated for human activities using multiple datasets including wearable sensor data and smart home environments data. The experiments on three wearable sensor datasets and five smart home public datasets of human activities reveal that our proposed network achieves better results and reduces the training time compared with the existing state-of-the-art methods and basic temporal models.

Dilated causal convolution with multi-head self attention for sensor human activity recognition

Article Open access 19 April 2021

Inception inspired CNN-GRU hybrid network for human activity recognition

Article 09 March 2022

A novel human activity recognition architecture: using residual inception ConvLSTM layer

Article Open access 21 May 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Sensors from smart home environments and wearable objects generate a large amount of valuable data used for different applications including human activity recognition (HAR). Smart home environments based on equipped sensors are designed for ambient assisted living to unobtrusively track human activities [12]. Further, wearable sensors have been also used to gather customized data about users’ habits. Wearable sensors can be embedded into different objects such as mobile, clothes, belts, wristwatches, or glasses which can be worn to record users’ movement with the aim of HAR [19]. Moreover, wearable and smart home sensors can record perceived information to sufficiently detect the ambulatory and postural activities [5, 29].

HAR is an active and challenging research field in ubiquitous computing to understand human activities, which plays a significant role in several applications in the fields of healthcare monitoring [30], security surveillance systems [27], and resident situation assessment [21]. HAR, as one of the important applications of healthcare monitoring from sensors data, is used to monitor and track vulnerable people [25]. However, human activities are highly diverse due to different sensor readings and even the same subject tends to perform an activity in different ways. Also, the intrinsic characteristic of categories denoting daily human activities is inherently imbalanced, and hence building a robust machine learning model for HAR is challenging. Moreover, occasionally generated data by sensors could be noisy which adds extra challenges and ambiguity to the interpretation of human activities [13].

Deep learning models are widely employed in different applications of computer vision, audio recognition, and natural language processing. Furthermore, deep learning approaches have improved HAR systems based on sensors generated data and show promising results. Since mostly HAR problems are formed as a sequential learning [22], Recurrent Neural Network (RNN) as a type of sequential learning and its variations particularly Long Short-Term Memory (LSTM) have demonstrated satisfying and state-of-the-art performance [25]. LSTM integrated models are commonly used and increase the performance of HAR systems, however, LSTM requires a large amount of memory and high computational capacity for its memory cells and gating mechanism in learning to process temporal sequential contextual information [13]. Further, LSTM models process timesteps of sensors temporal data sequentially because processing any timestep requires the outcomes of the previous timesteps [2, 42]. Convolutional Neural Network (ConvNet) is employed to extract the temporal contextual information for HAR systems from sensors data [13, 36]. Even though the training of one dimensional (1D) ConvNet models is remarkably faster than LSTM due to the nonexistence of recurrent settings, LSTM models show better performance than 1D ConvNet for HAR systems. Furthermore, 1D ConvNet models are not sensitive to the order of the sensors sequential data which is crucial for HAR due to processing sensors sequential temporal data in parallel.

The self-attention technique is used to focus more on important timesteps of the feature maps by computing similarity scores for all timesteps [42]. However, computational and memory requirements of the self-attention technique are quadratic with the length of the input sensor sequential data which leads to slow learning and occupying more memory.

To overcome the above challenges, we propose the causal ConvNet based on performers-attention and supervised contrastive learning. The proposed network improves the results of the HAR systems in sensors generated data. In addition, the proposed method also accelerates the learning process compared to the existing methods. Causal convolution [2, 31] is adopted to avoid violating the ordering timesteps of the input datasets, which is crucial in HAR systems. Performers-attention [6] which scales linearly with the input sequence length is proposed to reduce the computation and memory cost compared to the self-attention mechanism for HAR systems. Moreover, supervised contrastive learning is adopted to learn a good representation from the input sensors data that supports classifiers to gain useful information [3, 20]. Due to integrating supervised contrastive learning, the proposed network has two learning stages. The network learns a good representation of human activities in the first stage to learn a more accurate classifier in the second stage. Further, in the first stage, the supervised contrastive loss function is applied to learn the representation of human activities which is further propagated through a projection network. In the second stage, a linear classifier is trained on top of the frozen representations while the projection network is discarded. The two stages of learning prepare a discriminative representation that renders a more accurate classifier [20].

Moreover, due to the diversity of human activity recognition which leads to generating long-tailed datasets with skewed class distributions, often classifiers tend to be more biased towards majority classes and misclassify minority classes. To address this limitation, the focal loss function [23] based upon the effective number of samples [7] is proposed by assigning higher weights to hard-classified examples to sufficiently learn minority classes. The focal loss function is conducted in the second stage to learn a linear classifier for HAR. The proposed network is evaluated on eight benchmark HAR datasets and compared with the existing state-of-the-art methods. The experimental results demonstrate that our proposed network can obtain better results compared with the existing state-of-the-art methods. An ablation study is carried out to demonstrate the contribution of each of the components (performer attention, supervised contrastive learning: two stages learning, causal convolution, and focal loss) of the proposed network.

To summarise, we propose a causal ConvNet-based performers-attention and supervised contrastive learning to increase the accuracy of HAR systems and accelerate the learning process. The main components of the proposed network are described below:

i.
The performers-attention is adopted to effectively expose significant timesteps that involve human activities.
ii.
Supervised contrastive learning within the network is proposed to render expressive representations that help the classifier to accurately and easily recognize human activities.
iii.
Causal convolutions as part of the network are proposed to maintain the ordering of sensor data which is important for systems of HAR by preventing information flow from future to past.
iv.
The focal loss function based on the effective number of samples is proposed to down-weights well-classified examples and focus on hard-classified examples.

The remainder of this paper is structured as follows. The related works is reviewed in Section 2. A background for this study is provided in Section 3. The details of the proposed network is described in Section 4. Section 5 reports evaluations of the experimental setup. Finally, Section 6 concludes the paper.

2 Related works

Deep learning models have shown a significant breakthrough with appreciable performance on different HAR benchmark datasets [11]. Moreover, deep learning models are used not only in the form of single model learning but also joint models learning to address class imbalanced problems and improve HAR systems [15]. Since HAR is a sequential classification problem, recurrent network-based architectures, i.e., RNN and LSTM, have shown satisfying results. HAR based on RNN is conducted to recognize human activities from sensors data [8]. Although the results achieved based on RNN are reasonable, RNN cannot prevent gradient vanishing and exploding problems in processing long input sequences [25]. LSTM [17] was developed to prevent the occurrence of exploding problems and vanishing gradients using multiple switch gates. LSTM can process long-term dependencies of temporal sequential data including HAR systems. Several studies have used LSTM to model human activities from sensor data [11, 25, 33]. Moreover, LSTM is not only used alone or in ensemble form [25] to model human activities but also combined with ConvNet vertically or parallelly to process long-term dependencies and enhance HAR systems [11]. ConvNet is used for HAR systems to dispense recurrent architectures, make the learning phase faster, and process sequential temporal human activities in parallel [36].

Self-attention mechanism [42] is used with recurrent-based networks and ConvNet to focus more on the most relevant time steps and increase the accuracy of HAR systems. Hybrid ConvNet and LSTM with self-attention mechanism are used for HAR using reinforcement learning from wearable sensors [16]. Due to this hybrid method trained based on reinforcement learning, large computing resources for the training phase are required. Furthermore, the self-attention mechanism is appended to the Convolutional LSTM model for HAR to pay more attention to informative timesteps from temporal sequential wearable sensor data [37]. The self-attention mechanism is employed in further studies of HAR based on wearable sensors data [4, 24]. Recurrent network architecture from these methods leads to a delay in the training process. Moreover, these methods are built only based on wearable sensors data for HAR systems. Moreover, the DeepConvLSTM method is suggested for HAR based on sensor data from smart homes [26]. Due to the recurrent setting in this method, parallelization in processing the input sequence is restricted which makes this model computationally expensive and occupies more memory. Further, this model is only compared to bidirectional LSTM and evaluated on three smart home datasets. ConvNet based on dual attention is proposed that entirely dispense recurrent settings for activity recognition, however, the proposed model is only evaluated on wearable sensor datasets [9].

Despite the effectiveness of self-attention for HAR, computation and memory cost of the self-attention technique scales quadratically with the length of the data which delays the learning process. To remedy these limitations and enhance the performance of HAR systems from sensors data, we propose causal ConvNet-based performers-attention and supervised contrastive learning. This is because firstly the performers-attention mechanism linearly scales with the length of the sensor input data which makes the learning process faster. Secondly, supervised contrastive learning increases the performance of the proposed network by replacing one stage learning with two stages of learning where the first stage is representation learning and the second stage is classifier learning.

3 Background

3.1 Self-attention

Self-attention is a powerful mechanism that computes correlation scores for all pairs of the samples in the input sensor data. The self-attention mechanism is introduced and exploited by the Transformer architecture to process sequential data in parallel [42]. To make the model pay extra attention to the essential time steps in modelling HAR from the temporal sensor representation, the self-attention technique is employed in the training phase. Attention technique has the following learned components: query Q, key K, and values V. The dimension size of query Q and key K is d_k, where the dimension size of V is d_v [42]. The complexity of the self-attention mechanism with the length of the input temporal sequence scales quadratically which increases model learning time and requires more memory. This is the limitation of the self-attention mechanism which is addressed in Section 4.2. The attention matrix implementation is shown in (1).

$$ \boldsymbol{Z}(\boldsymbol{Q},\boldsymbol{K},\boldsymbol{V}) = softmax \left( \frac{1}{\sqrt{d_{k}}} {\boldsymbol{Q}\cdot \boldsymbol {K}^{T}}\right) \boldsymbol{V}. $$

(1)

3.2 Contrastive learning

Contrastive learning [10] has been employed for supervised and unsupervised learning as an objective function [20, 28]. The purpose of contrastive learning is to learn $f_{\theta }:\mathbb {R}^{D} \rightarrow \mathbb {R}^{d}$ (a parametric function) that able to map an input data x to a feature map ($f_{\theta } (x) \in \mathbb {R}^{d}$ with d< D) so that a cosine distance as a distance measure can project a high-dimensional input space with complex similarities to a low-dimensional feature latent embedded space. Generally, contrastive learning aims to learn representations by mapping input data to a feature space where similar examples are close together and dissimilar examples are far apart [10]. Hence, contrastive learning increases both compactnesses of intra-classes and separability of inter-classes which lead to rendering a better classifier. Moreover, learning representations of the input data support classifiers to easily extract useful information to properly distinguish categories [3]. The supervised contrastive learning [20] maps the encoded normalized samples belonging to the same class close together in embedding space and simultaneously pushing apart clusters of samples from different categories.

4 Proposed network

The proposed network is built using causal 1D ConvNet with the performers-attention based on supervised contrastive learning. The proposed method takes the minority classes from the input datasets into consideration using the focal loss function with an effective weighting samples technique as described in Section 4.1. The causal convolutions component in the proposed network is used to avoid information flow from future to past by processing results at time t based on solely the convolutions of the time steps of the temporal data from time t and earlier in the previous layer. Therefore, predicting time steps at time t cannot rely on any of the future time steps from the sensor sequential data. This helps the proposed network to maintain the ordering of the temporal data [31] which is significant for HAR systems [13]. Moreover, the details of the performers-attention are provided in Section 4.2. Figures 1 and 2 presents the structure of the proposed network and the two stages of learning in which the representation learning uses supervised contrastive loss function and the classifier learning uses the focal loss function. More details about supervised contrastive learning and both learning stages are provided in Section 4.3.

4.1 Focal loss

The focal loss [23] is introduced to address the imbalanced class problem between background and foreground classes during training in one stage object detection scenario. The focal loss is designed to down-weight well-classified examples and focuses on hard-classified examples. The loss value of hard-classified examples is much higher compared to the loss values of the well-classified examples by a classifier using the focal loss function. Since the focal loss focuses more on a sparse set of hard-classified samples, hence the focal loss is used in our proposed network to improve the learning of minority classes in HAR systems. The focal loss function is shown in (2).

$$ FL (p_{t}) = - \alpha_{t} (1-p_{t})^{\gamma} \log(p_{t}) $$

(2)

4.2 Generalized kernelizable attention

The complexity of the self attention mechanism with the length of the input temporal sequence scales quadratically which increases model learning time and requires more memory. This is the limitation of the self-attention mechanism. To address this limitation, we adopt performers-attention [6] as an efficient attention mechanism whose complexity scales linearly with the size of an input sequence L. The performers uses a Fast Attention Via Positive Orthogonal Random Features (FAVOR+) algorithm and substitutes Transformer self-attention by generalized kernelizable attention. The FAVOR+ algorithm is used to estimate the regular softmax attention by random feature map decompositions. Hence the core idea of the performers is to decompose the attention matrix into a matrix product. This algorithm leverages positive orthogonal random features to approximate softmax attention kernels with provable accuracy and O(N) for both computational and space complexity [6]. Previous attention mechanisms such as sparsity and low-rankness relied on structural assumptions for the attention matrix without approximating the original softmax function. Generalized kernelizable attention can make the model process longer input sequences and train faster compared to previous attention mechanisms. The aim of using generalized kernelizable attention and FAVOR+ is to approximate the softmax and choose the order of computation of the matrices of (1).

4.3 Supervised contrastive learning

In this study, supervised contrastive learning (SCL) is used to build a model for HAR that outperforms the state-of-the-art HAR methods. The proposed method based on SCL consists of two stages of learning. In the first stage, two components are trained which are encoder and projection networks. The first stage learns representations used in the second learning stage to build a robust and accurate classifier for HAR systems. The details of the first stage are as follows:

1.
Encoder network E(⋅) maps temporal input sequential data x to a representation vector $r= E(x) \in R^{D_{E}}$ where D_E = 512. The encoder network specifically consists of two 1D ConvNet layers followed by a fully connected layer. The performers-attention is then applied to effectively extract deep semantic correlations from action sequences involving human activities. After each layer, normalization and dropout regularization are applied to make the learning process faster and prevent the encoder from overfitting. 1D ConvNet-based networks have been proposed as fast and accurate models for HAR systems [11]. This is due to the ability of 1D ConvNet in extracting mostly correlated features by considering local dependency from temporal sequential input data.
2.
Projection network Proj(⋅) maps the representation vector r to a projected vector $z= Proj(r)\in R^{D_{E}}$ where D_E = 512. The projector network is only a single fully connected layer appended to the encoder. The Encoder and projection networks are trained using contrastive loss function to make embeddings of similar classes are close together and dissimilar classes are far apart. The projection is discarded at the end of the contrastive training. Equation 3 shows the supervised contrastive loss function which is used in the first stage to learn the encoder.

(3)

where

N is the number of random samples in a mini-batch;
N_y is the total number of samples in the mini-batch with the same label y;
z_i = Proj(E(x_i)) and z_j = Proj(E(x_j)) are the projected vectors of the samples belonging to the same class;
while z_k = Proj(E(x_k)) is the projected vector of a different class;
$\mathcal {T}$ is a positive scalar temperature parameter;
avoids inner product of the same vector;
ensures that the z_i and z_j are the projected vectors of the same class;
is used to ensure that the z_k does not belong to the class of z_i and z_j.

In the second stage, a classifier with a fully connected layer followed by a softmax layer is trained using the encoder network. However, the encoder network of the first stage is frozen and the projector network is discarded. The learned representation from the encoder network without the projector network is used to learn the classifier. In the second stage, the network uses the focal loss function to predict human activities. The proposed network causal ConvNet based on supervised contrastive learning and Performer-attention forges recurrent settings to further accelerate the learning phase and improve recognition score for HAR systems. Causal convolution ensures that the model does not violate the ordering of the time steps of the temporal sensors data. The performers-attention supports the proposed network to pay extra attention to the discriminative features to accurately recognize human activities. Supervised contrastive learning is used to build the proposed network in two stages of learning, where the first stage is used to learn a good data representation for learning the classifier in the second stage. Two stages of learning are used to learn a better representation with more discriminative features that support the classifier to better distinguish human activities compared to a normal one stage learning. The focal loss function according to the effective number of examples is used to prevent skewed learning toward majority activities and improve the recognition scores of the minority activities.

5 Experiments and evaluation

In the section, experiments and evaluations based on eight datasets of human activities are shown and discussed. Moreover, results of the proposed network compared with the existing state-of-the-art models are shown.

5.1 Datasets and preprocessing

5.1.1 Ordonez smart environment datasets

Collected daily human activities in five intelligent environments using equipped sensors are used in this research to evaluate the proposed network. Ordóñez homes A and B [32] are two smart environments that are equipped with binary sensors to read and collect human activities. Different binary sensors within these two smart homes are utilized such as pressure sensors and passive infrared sensors to capture various human movements. The details of these two smart environments are shown in Table 1. In Ordóñez smart environment A, 12 binary sensors including PIR, pressure sensor, flush, and magnetic were employed to read and collect nine daily activities in 14 days over 20,358 minutes. In Ordóñez smart home B, ten human activities are recorded using 12 binary sensors in 22 days over 30,469 minutes. There are nine common activities from these two smart environments which are Showering, Sleeping, Breakfast, Snack, Lunch, Spare Time/TV, Grooming, Toileting, and Leaving. Besides, Ordóñez smart home B has one more recorded activity which is Dinner.

Table 1 Information about experimental datasets

ConvNet-based performers attention and supervised contrastive learning for activity recognition

Abstract

Similar content being viewed by others

Dilated causal convolution with multi-head self attention for sensor human activity recognition

Inception inspired CNN-GRU hybrid network for human activity recognition

A novel human activity recognition architecture: using residual inception ConvLSTM layer

1 Introduction

2 Related works

3 Background

3.1 Self-attention

3.2 Contrastive learning

4 Proposed network

4.1 Focal loss

4.2 Generalized kernelizable attention

4.3 Supervised contrastive learning

5 Experiments and evaluation

5.1 Datasets and preprocessing

5.1.1 Ordonez smart environment datasets

5.1.2 Kasteren smart environment datasets

5.1.3 Wearable smartphone (inertial sensors) dataset

5.1.4 Wearable wireless identification and sensing data

5.1.5 Preprocessing raw smart home sensors data

5.2 Hyper-parameters of the proposed network

5.3 Evaluation of proposed network

5.4 Results and discussion

5.4.1 Results from Ordóñez datasets

5.4.2 Results from kasteren datasets

5.4.3 Results from wearable sensors datasets

5.4.4 Ablation study of the proposed network

5.4.5 Learning time of the proposed network

6 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation