1 Introduction

Anomaly detection in video surveillance is a popular research area in computer vision because of its diverse applications, such as traffic accident detection, criminal activity detection, or detecting illegal activities. And yet, detecting an abnormal activity among the vast normal situations is challenging. The first challenge is to collect and label all types of abnormal events since the frequency of normal events dominates that of abnormal events and often the abnormal parts are rare. The second challenge is their uncertain characteristic of abnormal events. For instance, an activity is regarded as anomalous in one context, but it can be a normal activity in the another case. When a pedestrian crosses the street in a crosswalk, the event is considered as a normal activity. However, the same activity is considered as abnormal when there is no crosswalk. Moreover, given that it is time-consuming and inefficient to watch and analyse the massive amounts of surveillance videos by human, an automatic anomaly detection system is essential for analysing and detecting abnormal events in surveillance videos.

As the goal of frame-level anomaly detection in videos is to identify the frames that contain different spatial and motion information, an anomaly detection model that has been trained using only normal samples (or frames) to learn a generic distribution of normal events cannot represent unseen events or activities which are considered as anomalies. However, abnormal frames can be distinguished using the reconstruction/prediction error between the ground truth sample and the reconstructed/predicted output while testing.

In video anomaly detection, motion information is one of the most important criteria by which one can make a decision whether it is normal or abnormal. Some existing approaches use a two-stream network [7, 14, 15] for anomaly detection, including a spatial stream and a temporal stream. The former learns the spatial structure of input frames while the latter leverages the optical flow between neighboring frames. However, the extraction of optical flow costs an extra computational power. Another approach uses a recurrent neural network, such as a variational LSTM [16, 22, 38] to model temporal motion information, although the model becomes too complex as the number of the stacked layer is increased [33, 35].

In machine learning, attention is a technique that imitates human cognitive attention, enhancing a part of input, such as an object, but neglecting the remaining parts. In the anomaly detection area, [39] showed that attention selectively applied to the foreground area, wherein dynamic objects were moving, enhanced the performance while neglecting the static background area.

To handle these issues, we introduce an Attention-based Spatio-Temporal NETwork (ASTNet)Footnote 1, which has an architecture of autoencoder for the efficient anomaly detection task. The proposed network aims to exploit both spatial and temporal features efficiently within a unified manner. So that, the extracted features by a Deep Convolutional Neural Network (DCNN) are fed into two parallel branches to exploit both spatial structures and motion features. Then, the spatio-temporal features are again fed into a decoder to predict the future frame. Contrast to the figure-ground separated application of attention [39], we propose a cascade attention model where a channel attention module is inserted at each layer of the decoder to better exploit the channel relationship of the features. The main contributions of our work can be summarized as follows:

  • We propose an attention-based residual autoencoder for video anomaly detection, which encodes both spatial and temporal information in a unified way.

  • The temporal shift is applied to model temporal information, since it provides high performance with a low computational cost.

  • The channel attention is applied to exploit channel dependency in a cascade type within the decoder to predict the future frame more efficiently.

  • Our model outperforms state-of-the-art performance on three standard benchmark datasets, even without using any optical flow detector.

The rest of this paper is organized as follows. An overview of related work is discussed in Section 2. Section 3 describes our proposed method. Detailed experiment results and discussions are given in Section 4. Finally, Section 5 concludes this paper.

2 Related work

Recently, anomaly detection has been attracted a lot of attentions of the researchers. There are roughly two representative approaches in the video anomaly detection: the reconstruction-based method and the prediction-based method.

Reconstruction-based method. With this, a model is trained to reconstruct the input frame. The most popular model among many is the autoencoder architecture, consisting of an encoder and a decoder: the former compresses the input into a lower-dimensional feature representation, and the latter reconstructs the output from the compressed representation as close to the input frame as possible. Then, the reconstruction error is used to distinguish the abnormal event from the normal ones since the normal events have the smaller errors, whereas the abnormal event has the bigger one.

To extract the appearance feature as well as the motion feature from the video input, some approaches [28, 30] learnt the normal events by using an autoencoder architecture, which utilised both the stacked convolutional neural network layers to learn the spatial structure and a stacked convolutional LSTM to learn the temporal representation. In some case [28], a human observer was used for validation as a continuous learning. Recently continual learning has been applied for video anomaly detection to deal with the forgetting problem happening while training deep neural networks. Doshi and Yilmaz [6] use a deep learning model to extract feature embedding for input video frames. A set of nominal feature vectors is stored in a memory module using the k-nearest-neighbors. This process is trained in multiple session for continual learning.

A two-stream model [14] was often used to capture both the appearance and motion information. Such a model typically had an architecture that included an autoencoder and a discriminator. The anomaly scores of the two streams were combined for more accurate decision. Similarly, Li et al. [15] introduced a two-stream network to encode the appearance and motion of normal events in videos. Each stream of the network included two spatio-temporal autoencoders using 3D video cuboids as input. The 3D video cuboids were stacked from multiple patches which were partitioned at the same location in continuous frames. To overcome the high computational cost of optical flow, Chang et al. [2] used two autoencoders to separately exploit spatial and temporal information of videos. The spatial autoencoder encoded the scenes and objects while the temporal one captured the movement information of the objects. Fang et al. [8] proposed a multi-encoder single-decoder model to encode both motion and content cues. The network had a motion encoder and two content encoders. The outputs of these encoders were concatenated and reconstructed by a decoder.

A 3D convolutional neural network had the capability for learning both spatial and temporal information corresponding to appearance and movement, respectively, in videos. Deepak et al. [4] showed that an encoder with a convolutional LSTM layer processed spatial information whereas a decoder captured temporal one. Recently, a deep autoencoder had been used to reconstruct the input. For instance, [1] introduced a probabilistic model using an autoregressive process to estimate the density in the latent vector, that was extracted by an encoder. In addition, [10] reconstructed the input using an autoencoder with a memory module. The memory contents were learnt during the training phase, and the model reconstructed a testing input using the memory, which was learnt from the normal samples. As a result, an abnormal event produced a large reconstruction error. On the other hand, [13] proposed a three-stage method, which required the less computational cost. The authors substituted the autoencoder with a single-hidden-layer feedforward neural network, that reconstructed the input frames by minimizing the reconstruction error with a less computation time.

The sparse coding-based anomaly detection approaches [25, 38] were to detect anomalies using a learnt event dictionary. In such a case, the normal events were reconstructed from a learnt dictionary with a small reconstruction error, while the abnormal event would lead to a large reconstruction error. Within this context, [25] proposed a sparse coding based deep neural network using the stacked recurrent neural networks to optimize the sparse coefficients, while [38] introduced an optimization network based on a novel LSTM network. A fast sparse coding network [32] adopted a two-stream neural network to extract the spatio-temporal features as it was a lightweight network to learn a normal event dictionary.

Prediction-based method.:

This approach utilises a few previous frames in predicting whether the future frame would be normal or abnormal. The basic assumption is that the normal event is predictable whereas the abnormal one is unpredictable [20]. The frame prediction approaches usually exploit both appearance and movement information of the given video since the input contains several consecutive frames, which include motion features.

Generative Adversarial Network (GAN), consisting of a generator and a discriminator, is one of the most popular network recently, and it can be used to generate the next frame for the video anomaly detection task. For instance, [20] used the U-Net as the generator in predicting the next frame and a patch discriminator was adopted to distinguish the frames generated by the generator. Zhou et al. [39] used the similar network architecture, wherein U-Net was employed as a generator and a patch network as a discriminator, was used to predict the future frame. Moreover, an attention-driven loss was used to deal with the imbalance problem between the foreground object and the static background typically appeared in the anomaly detection videos. Similarly, [36] integrated the segmentation map into the PSNR (Peak Signal to Noise Ratio) to assign different weights to the background and the foreground. They also proposed the patch-level loss in their prediction model to improve the quality of the foreground object. In addition, [16] used a generative model to predict the future frame. In this case, however, the original U-Net of the generator was replaced by a spatio-temporal U-Net, which was added three ConvLSTM layers in the middle of the U-Net to model temporal information. Lu et al. [22] combined variational autoencoder and ConvLSTM to predict the future frame. The ConvLSTM was used to represent the recurrent relationship among frames in the given video. Doshi and Yasin [5] predicted whether the future frame would be normal or abnormal using a GAN. In this case, an object detection system had been used to extract the location and the appearance feature. The reconstruction errors and extracted information of objects were computed using a statistical module to detect the anomalies.

Hybrid method.:

Tang et al. [29] combined a future frame prediction approach with a reconstruction approach to exploit advantages of the above mentioned methods. Two blocks of U-Net were connected in series: the first block was for predicting whether the future frame was normal or abnormal and the second for reconstructing the frame. On the other hand, [27] used dynamic skeleton features for video anomaly detection. The skeletal movements were decomposed into global body movement and local body posture, and then fed into two recurrent encoder-decoder network branches that were employed to reconstruct their own input and predict the future frame. Chang et al. [3] adopted a two-stream network that exploited spatial and temporal information. In the first stream, an autoencoder encoded spatial information while a motion autoencoder predicted RGB difference between the first and the last frame to obtain motion information in the second stream instead of computing optical flow with an expensive computation. On the other hand, object based multi-task learning [9] jointed three self-supervised and one knowledge distillation for anomaly detection in video. In each frame, object detection was carried out with a pre-trained detector. A sequence of detected objects from consecutive frames was fed into a 3D CNN and four 2D prediction heads to detect anomalous events.

Although two-stream network and 3D CNN has proved the capability to model motion information without computing optical flow, such improvement comes with the high computational cost. In this study, we propose a simple autoencoder architecture which includes an encoder to extract feature from input video frames and a decoder to generate the future frame in an unsupervised fashion since the training videos contain only the normal events. The temporal information is exploited by an effective temporal shift method which is inserted into the network at zero computation and zero parameters [19]. In [39], an attention map is learned to force the model focus on the foreground rather than the background. However, the attention is effective with single scene dataset, such as UCSD Ped 2, CUHK Avenue. To tackle the multi scene dataset problem such as ShanghaiTech, we propose a channel attention-based decoder which focus on important objects automatically while predicting the future frame.

3 Method

In this section, we present our framework for video anomaly detection in detail. As mentioned before, abnormal events are very rare in real-world scenarios. Therefore, it is difficult to collect and label training data that cover all types of anomalies. To deal with this problem, we propose an unsupervised learning method for detecting abnormal events in video.

2D CNN [1, 10] has been used for diverse video anomaly detection tasks and yet it cannot represent the temporal features very well. To handle this problem, some approaches [28, 30] combine a 2D CNN and a temporally recurrent network such as convolutional LSTM. Such a combination aims to propagate temporal information across frames. Nevertheless, the more layers the model has, the more complex the model is. Another type of method that tries to capture both spatial and temporal information from videos would be 3D CNN [4] with which both spatial and temporal features can be learnt although it takes lots of effort to train the network. A few recent state-of-the-art methods [2, 14, 15] adopt a two-stream neural network, which consists of a spatial stream and a temporal stream. The spatial stream exploits the appearance features while the flow stream captures the motion information, and yet the computation of optical flow is rather expensive.

Problem statement.:

We propose a network for video anomaly detection using the future frame prediction approach. The input of the network is a sequence of frames in a video and the network tries to predict the future frame [20]. Given several consecutive frames I = {I1,I2,...,It}, the predicted frame is \(\hat {I}_{t+1}\) and the ground truth frame of the predicted one is It+ 1. Then, the anomaly score can be calculated using the difference between the predicted frame \(\hat {I}_{t+1}\) and the ground truth one It+ 1.

3.1 Network architecture

The overall structure of the proposed model is shown in Fig. 1 and it has the autoencoder architecture, consisting of an encoder and a decoder. The former is for capturing both appearance and motion information of the input video frames, and the latter is for predicting the future frame using the extracted spatio-temporal features with the encoder.

Fig. 1
figure 1

The overall architecture of our network for video anomaly detection. Initially, a sequence of input video frames is fed into a DCNN to extract features. Then, the extracted visual features are passed through two branches to exploit further spatial information as well as temporal one respectively. The spatial and temporal features are combined and passed through three deconvolutional layers to generate a future video frame. Note that a Channel Attention (CA) is applied at each deconvolutional layer to exploit the channel dependency of the features in a cascade type to enhance the network performance

Encoder.:

From a given sequence of t frames, the high-level features can be extracted by using a deep and wide convolutional neural network, i.e. WiderResnet [34]. In order to exploit both spatial and temporal information of video frames, the last feature map obtained from the deep convolutional neural network is then passed through two branches, as illustrated in Fig. 1. In the temporal branch, temporal shift is applied to model temporal features over several input frames (Section 3.2), while the extracted features of input frames are concatenated to maintain the spatial information in the spatial branch (Section 3.3). Then, the outputs of two branches are combined using an element-wise sum and fed into the decoder to predict the corresponding future frame.

Decoder.:

The output of the encoder is then used as input of the decoder. The combined features are passed through the decoder to restore the details and the spatial resolution of the predicted frame. Each layer of the decoder is a sequence of blocks, including deconvolution, batch normalization, and Rectified Linear Unit (ReLU) activation function. To exploit the channel relationship of features, the channel attention is applied after each deconvolution block, described in Section 3.4. In addition, the output features of the channel attention are concatenated with the corresponding low-level features extracted by the deep convolutional neural network that have the same spatial resolution. The combined features are used in the next step. Then, they are deconvolved to upsample the features back to the input frame resolution.

3.2 Temporal branch

The temporal shift process [19] has been used in the video understanding area. In the present work, we would like to utilize the temporal shifting technique to exploit temporal information in the video anomaly detection task. The shift operation is performed along the temporal dimension. Some part of the channels is shifted to the next frame while keeping the remaining part, as illustrated in Fig. 2. Then, the feature of the current frame is combined with the feature of the previous one. For the given input feature maps \(\mathbf {F_{tem}} \in \mathbb {R}^{N \times T \times C \times H \times W}\), the output features are computed as:

$$ \mathbf{F_{tem}^{\prime}} = Shift(\mathbf{F_{tem}}), $$
(1)

where Shift refers to the shift operation. In Fig. 2, input features consist of four frames T = {t1,t2,t3,t4}. Part of the channels of the current frame is shifted to the next frame. Note that part of the channels of frame t2 is replaced by the part of a channel of frame t1.

Fig. 2
figure 2

The temporal shift. Given the feature map F, the output feature \(\mathbf {F^{\prime }}\) is obtained by applying a temporal shift to exploit the temporal information. As illustrated, the features of different frames are described as different colors in each column. Part of the channel of frame t1 (blue) is shifted to the next frame t2 (green)

3.3 Spatial branch

In the spatial branch, the extracted features obtained from the deep convolutional neural network are aggregated across frames. To reduce computation complexity, we apply a 1 × 1 convolution on the combined features to reduce the number of channels since the aggregated features contain a large number of channels.

The features of the temporal and spatial branches are combined as follows:

$$ \mathbf{F} = \mathbf{F_{tem}} + \mathbf{F_{spa}} $$
(2)

where Ftem and Fspa denote the output features of the temporal and spatial branches respectively.

3.4 Channel attention

In order to exploit channel dependency of the feature, channel attention [12, 31, 37] has been used in many fields. For instance, ‘Squeeze-and-Exciation’ [12] adopts global average pooling while CBAM [31] takes average-pooling and max-pooling in obtaining the channel-wise statistics. In our channel attention module, two convolutional layers are chosen like [37] instead of two fully-connected layers [12, 31].

After each deconvolutional layer, we apply channel attention for the feature map \(\mathbf {F} \in \mathbb {R}^{C \times H \times W}\). The output feature \(\mathbf {F^{\prime }}\) is computed as follows:

$$ \mathbf{F^{\prime}} = \mathbf{F} \otimes s(\mathbf{F}), $$
(3)

where s(F) refers to the channel attention, and ⊗ denotes element-wise product.

Channel Attention.:

The output of each deconvolutional layer is given as an input feature map \(\mathbf {F} \in \mathbb {R}^{C \times H \times W}\) of the channel attention module. In order to exploit channel dependency, a global average pooling is applied to the feature F [12, 37]. The output of the global average pooling is a vector v having C values. Then, a 1 × 1 convolution is applied to reduce the dimension with a reduction ratio r, followed by a rectified linear unit (ReLU) activation function δ and the second 1 × 1 convolution with the channel dimension is recovered.

$$ s(\mathbf{F})=\sigma(\mathbf{W_{2}}\delta(\mathbf{W_{1}} \mathbf{v})), $$
(4)

where \(\mathbf {W_{1}} \in \mathbb {R}^{C/r \times C}\) and \(\mathbf {W_{2}} \in \mathbb {R}^{C \times C/r}\), and σ denotes the sigmoid function.

Residual channel attention block.:

It is found that the residual channel attention block can provide better result than the channel attention especially when training with large datasets, such as Avenue or ShanghaiTech dataset. In the residual channel attention block, the channel attention is located after two 3 × 3 convolution layers just before the residual connection. A ReLU activation is placed between two convolutional layers as shown in Fig. 3. Given the input feature map \(\mathbf {F} \in \mathbb {R}^{C \times H \times W}\), the residual channel attention block is computed as:

$$ \mathbf{F^{\prime}} = \mathbf{F} \oplus (\mathbf{X} \otimes s(\mathbf{X})), $$
(5)

where F and \(\mathbf {F^{\prime }}\) are the input and output feature map, respectively, and s(X) refers to the channel attention. X is obtained by:

$$ \mathbf{X} = \mathbf{W}_{2} \delta (\mathbf{W}_{1}\mathbf{F}), $$
(6)

where δ denotes the ReLU activation function. W1 and W2 are the weight sets of the two convolutional layers.

Fig. 3
figure 3

Residual channel attention block. The input features are passed through two 3 × 3 convolution layers with a ReLU activation in between before the channel attention is applied

3.5 Objective function

The goal of our network is to predict the future frame \(\hat {I}_{t+1}\) from a sequence of input frames {I1,I2,...,It}. Since each frame consists of many pixels and each pixel has an intensity, the constraints for intensity and its gradient can be the important factors in minimizing the prediction error. Thus, the similarity of all pixels in RGB space can be ensured by an intensity constraint that compares every pixel value between the predicted frame and the ground-truth frame as follows:

$$ L_{int}(I,\hat{I})=\left\|I-\hat{I}\right\|_{2}^{2} $$
(7)

To deal with potential blur occurring while adopting l2 distance, a gradient constraint is added to obtain a sharper video frame. The loss function computes the difference between absolute gradients along two spatial dimensions as follows:

$$ \begin{array}{@{}rcl@{}} L_{gra}(I,\hat{I})&=&{\sum}_{i,j} \left\|{|\hat{I}_{i,j}-\hat{I}_{i-1,j}|-|I_{i,j}-I_{i-1,j}|}\right\|_{1}\\ &&+ \left\|{|\hat{I}_{i,j}-\hat{I}_{i,j-1}|-|I_{i,j}-I_{i,j-1}|}\right\|_{1} \end{array} $$
(8)

To measure Structural Similarity (SSIM), Multi-Scale Structural Similarity (MS-SSIM) is used [22, 23]. Note that MS-SSIM has been proposed initially for the image quality assessment at different resolutions. The combination of loss functions including intensity, gradient, and multi-scale structural similarity constraint is given as follows:

$$ L_{con}(I,\hat{I})=\alpha L_{int}(I,\hat{I}) + \upbeta L_{gra}(I,\hat{I}) + \gamma L_{mss}(I,\hat{I}), $$
(9)

where α, β, and γ are three coefficients that balance the weights between the losses.

3.6 Anomaly detection

To detect anomaly, we use anomaly score S(t), which is used to measure the difference between the ground truth frame I and the predicted frame \(\hat {I}\). Since the Peak Signal to Noise Ratio (PSNR) is widely used in assessing the image quality, the quality of a predicted frame is calculated as follows:

$$ PSNR(I, \hat{I})=10log_{10} \frac{[max_{\hat{I}}]^{2}}{\frac{1}{N}{\sum}_{i=1}^{N}(I_{i}-\hat{I}_{i})^{2}} $$
(10)

where N denotes the number of rows and columns in a frame (the number of pixels), \([max_{\hat {I}}]\) is the maximum value of \(\hat {I}\). The higher value of PSNR indicates that the frame has a higher quality. In other words, the difference between the ground truth frame and the predicted frame is small.

Following [20], the PSNR of all frames in each test video is normalized to the range [0,1], and we compute the anomaly score S(t) for each frame by using the following formula:

$$ S(t)=\frac{PSNR_{t}-min(PSNR)}{max(PSNR)-min(PSNR)} $$
(11)

where min(PSNR) and max(PSNR) denote the minimum and the maximum PSNR values in the given video sequence, respectively. The anomaly score of a predicted frame indicates whether the frame is normal or abnormal with a given threshold.

4 Experimental evaluation

4.1 Datasets

Performance evaluation was carried out using three benchmark datasets such as UCSD Pedestrian dataset [26], CUHK Avenue dataset [21] and ShanghaiTech dataset [24]. Figure 4 shows the sample cases of them. In each dataset, the training set contains only normal videos, whereas the test set contains both normal and abnormal frames. In each test video, the ground truth annotation includes a binary flag per frame, indicating whether a frame contains anomaly event or not. So that, label 0 is the normal frame and label 1 is the abnormal frame.

Fig. 4
figure 4

Examples of normal (the top row) and abnormal (the bottom row) frames in the UCSD Ped2, CUHK and Shanghaitech datasets, respectively. The abnormal object is denoted by a red boxes, such as a man riding a bicycle (d), throwing a bag, and riding a motorbike

UCSD Dataset. :

The UCSD dataset had two subsets, namely Ped1 and Ped2, which were recorded at two different outdoor locations. The former had a resolution of 158 × 238 and the latter a resolution of 240 × 360. Pedestrians walked across the camera. Such normal events were used for training. The abnormal events in this dataset were defined by the appearances of a car, a biker, a skater or a wheelchair. Following the work of [5, 10], Ped1 had been excluded from our experiments because of its lower resolution. Ped2 contained 16 training videos and 12 test videos, corresponding to 2550 frames for training and 2010 for testing, respectively.

CUHK Avenue dataset. :

This dataset consisted of 16 training and 21 test videos, corresponding to 15,328 frames and 15,324 frames, respectively. The resolution of each video frame was 360 × 640 pixels. There were 47 abnormal events, such as throwing objects, loitering, and running across the gate.

ShanghaiTech Campus dataset. :

The ShanghaiTech Campus dataset was one of the most challenging datasets for video anomaly detection, containing 130 abnormal events. The dataset had 330 training and 107 test videos from 13 different scenes with various lighting conditions and camera angles. It had 317,398 frames and each frame had a resolution of 480 × 856 pixels. The dataset was split into 274,515 training frames and 42,883 test frames.

4.2 Parameter and implementation

Each video frame was resized as 224 × 288 for Ped2, 192 × 320 for CUHK Avenue, and 192 × 288 pixels for ShanghaiTech, respectively. The intensity of each frame was normalized in the range of [− 1,1] before being fed into the model. The learning rate was set as 2e-4 initially and decreased to 1e-4 at epoch 60 for Ped2, 50 for Avenue, and 30 for ShanghaiTech, respectively. The Adam optimizer was adopted for training our network. To reduce computation complexity, we utilized the penultimate feature map instead of the last one extracted by a deep convolutional neural network in the encoder when training Avenue and ShanghaiTech datasets. After choosing a sequence of five video frames randomly from the training set, the first four frames among them were used as input, and the fifth frame was used as the ground truth frame. Then, the ground truth frame was compared with the predicted frame obtained from the model in calculating the anomaly score.

Evaluation metric.:

Following the prior work [20, 39], the frame-level area under the curve (AUC) was used in evaluating the performance of our proposed network. The AUC was obtained by computing the area under the receiver operating characteristic (ROC) with varying threshold values for the abnormal scores. A higher AUC value indicated better anomaly detection performance.

4.3 Ablation study

4.3.1 Performance evaluation of processing units in the network

Given that our network contains three major processing units such as spatial processing, temporal processing and attention, an ablation study was carried out in order to evaluate their effectiveness in terms of performance. Table 1 shows result for the combinations of three components. First, when the temporal processing and the attention module are excluded, the network has the spatial processing unit. Secondly, the effectiveness of both spatial and temporal features without the channel attention module in the decoder is shown in the second row of Table 1. Thirdly, only spatial features are used as input of the decoder to estimate the capability of channel attention, which aims to exploit the attention across channels. Finally, the performance of the whole network is shown in the bottom row.

Table 1 Comparison between different processing units in the proposed network in term of AUC (%) on UCSD Ped2, CUHK Avenue and ShanghaiTech datasets. When the spatial and temporal processing unit are combined with the channel attention unit, the system performs best

The AUC performance (%) of proposed network using WiderResnet38 [34] as backbone with different combination of componets on UCSD Ped2, CUHK Avenue, and ShanghaiTech datasets is shown in Table 1. The performance of the baseline, which contains only spatial features, was improved by combining it with other components such as temporal and attention. For instance, the channel attention component improved the performance of the network significantly. Notice that the network using both spatio-temporal features and channel attention achieved the highest performance, reaching 97.4% for UCSD Ped2, 86.7% for CUHK Avenue and 73.6% for ShanghaiTech datasets, respectively, confirming that the combination of spatial and temporal branches provided more information for encoding the input frames, and the channel attention module played a vital role in restoring the future frame well.

In particular, the ROC curves for UCSD Ped2 dataset are shown in Fig. 5, wherein the red and orange curves denote the ROC curves of the method using spatial and spatio-temporal features, respectively. The green one denotes that of proposed method using spatial features in the encoder and channel attention modules in the decoder. The black one denotes the ROC curve of the proposed approach, which includes spatial and temporal branches in the encoder and efficient channel attention in the decoder, reaching 97.4% for UCSD Ped2 dataset.

Fig. 5
figure 5

Frame-level ROC curves for three benchmark datasets

4.3.2 Evaluation of deep convolutional neural networks as backbone

To show the effectiveness of our network architecture, this section describes the performance of the proposed network with different deep convolutional neural networks as a backbone in Table 2. The network architecture is kept unchanged and only the backbone is replaced by different deep convolutional neural networks. The proposed network with different backbones except ResNet-50 outperforms the baseline method [20] for UCSD Ped2 and ShanghaiTech datasets, suggesting that the proposed method can achieve high performance using different features that are extracted by different deep convolutional neural networks as a backbone.

Table 2 Comparison of the proposed network with different deep convolutional neural networks as backbone in term of AUC (%). The proposed network using WiderResnet38 [34] as backbone achieves the best performance

For instance, our network using WiderResNet38 [34] as a backbone gives the best performance for UCSD Ped2, CUHK Avenue and ShanghaiTech datasets, achieving the AUC of 97.4%, 86.7% and 73.6%, respectively. It also achieves 96.7% using SE-ResNext-101 as a backbone [12] for UCSD Ped2, and 73.5% using SE-ResNext-50 [12] as a backbone for ShanghaiTech, respectively.

Figure 5b and c show the frame-level ROC curves using different deep neural networks as a backbone for CUHK Avenue and ShanghaiTech dataset, respectively. The blue line denotes ROC for ResNet-50 whereas the orange lines for ResNet-101. The ROC curves of the for SE-ResNext-50 and SE-ResNext-101 have green and red lines, respectively. The black line is the ROC curve for WiderResNet38 as a backbone.

4.4 Comparison with state-of-the-art

Table 3 compares our approach with the recent state-of-the-art methods for three standard anomaly datasets. These methods are categorized into three groups, such as the reconstruction-based methods, the prediction-based methods, and the hybrid methods. Among them, our method achieves the best performance for UCSD Ped2, CUHK Avenue and ShanghaiTech dataset, reaching the AUC of 97.4%, 86.7% and 73.6%, respectively. Note that the frame-level AUCs of our method are higher than that of the frame prediction-based anomaly detection baseline [20] about 2% for UCSD Ped2 and approximately 1% for CUHK Avenue and ShanghaiTech Campus datasets, suggesting that our network outperforms most of the recent anomaly detection methods in term of AUC performance. The ShanghaiTech dataset is challenging because it is a large-scale dataset, including over 270K training frames and 42K test frames. Since it contains a large amount of data having diverse types of normal and abnormal events, its performance is relatively lower than those of other datasets.

Table 3 Comparison with recent state-of-the-art methods for video anomaly detection in terms of AUC (%) on three benchmark datasets. The proposed network uses WiderResnet38 [34] as a backbone

The overall shape of our proposed network is an autoencoder, consisting of encoder and decoder. In the encoder, the temporal branch is to model the temporal information by applying the effective temporal shift method which does not add any extra parameters and the shift operation is performed at zero computation [19]. As mention in Section 3.3, a 1 × 1 convolution is the main operation in the spatial branch to reduce the dimension of the aggregated features. On the other hand, channel attention uses small extra parameters and computation [31]. As shown in Fig. 3, a channel attention module includes a global average pooling, two 2D convolutions, a rectified linear unit activate and a sigmoid function. The proposed architecture appears to be an effective approach for anomaly detection in videos.

4.5 Visualization

4.5.1 Anomaly score

Figures 67 and 8 show how anomaly score can be visualized along video frames for three anomaly datasets. Note that the anomaly score drawn as a blue line in each figure changes rapidly between the normal and the abnormal event, indicating that our network is able to distinguish the sporadically occurring abnormal events among the vast normal ones within a given video. Figure 6 shows how the anomaly score varies for the normal and abnormal events occurring in the test video 02 of the UCSD Ped2 dataset. The first two frames show only the walking pedestrians, whereas the remaining two frames contain a bicycle rider among these pedestrians. Notice that the anomaly score increases dramatically when the rider appears within the frame and the score maintains high level until he disappears.

Fig. 6
figure 6

Anomaly score of the test video 02 in the UCSD Ped2 dataset. The red rectangles denote the abnormal objects (riding bicycle) in the frames. Note that the anomaly score drawn as a blue line is increased as the abnormal object appears

Figure 7 visualizes how the anomaly score changes as a running man appears in front of a building from the test video 02 of the CUHK Avenue dataset. Three abnormal events are shown: The first two abnormal events record the man running, and the third event come from the shacking of the camera. The anomaly score rises steeply when the man appears and then decrease sharply when he steps out of the frame. A noticeable fact is that the anomaly scores of the third event shows the highest score presumably because the camera shake event affects the whole frame.

Fig. 7
figure 7

Anomaly score of the test video 02 in the CUHK Avenue dataset. The three pink areas indicate the ground truths for appearing anomalous object. The red rectangles denote the abnormal running objects. The anomaly score is drawn as a blue line that is increased whenever the abnormal object is moving in front of the building. Note that the third abnormal event comes from the shaking camera

The anomaly score and some key frames of the test video 01_0063 of the ShanghaiTech dataset are visualized in Fig. 8. Two normal frames contain a few pedestrians on the walkway, while the abnormal event has a bicycle rider. The anomaly score increases rapidly when a bicyclist comes into the scene, whereas the score decreases as the rider disappears, confirming that our network is able to distinguish the abnormal frame from the vast normal frames.

Fig. 8
figure 8

Anomaly score of the test video 01_0063 in the ShanghaiTech dataset. The pink area indicates the ground truth. The red rectangle denotes the abnormal riding object

4.5.2 Network visualization

In this section, the visualizationsFootnote 2 of UCSD Ped2 and ShanghaiTech Campus datasets are shown in Figs. 9 and 10, respectively. Each figure visualizes a sample of normal event on the left column and an abnormal event on the right column. The ground truth frames of events are shown on the first row of each figure. The extracted features obtained from DCNN are fed to the spatial and temporal branches. The spatio-temporal features are visualized on the second row. In addition, channel attention [31] is applied to focus on some objects among others, visualizing as the attention maps on the third row. Given that prediction error can be measured as the difference between the predicted frame and its ground truth frame, our network is designed to produce a smaller prediction error for the normal frame, whereas a larger prediction error for the abnormal frame. Following [36, 39], the prediction errors for UCSD Ped2, and ShanghaiTech datasets are visualized at the last row of Figs. 9 and 10, respectively.

Fig. 9
figure 9

Network visualization for the UCSD Ped2 dataset. The normal and abnormal samples are shown on the left and the right column. From top to bottom, the ground truth frames (a, b), the spatio-temporal maps (c, d), the attention maps (e, f), and the prediction errors (g, h) are shown, respectively. Note that the attention map has half the resolution of input video

Fig. 10
figure 10

Network visualization for the ShanghaiTech Campus dataset. The normal case is shown on the left column and the abnormal one is shown on the right column. From top to bottom, the ground truth frames (a, b), the spatio-temporal maps (c, d), the attention maps (e, f) and the prediction errors (g, h) are shown, respectively. Note that the attention map has half the resolution of input video

In Fig. 9, the ground truth samples of a normal and abnormal events (a, b) are shown on the first row and the corresponding spatio-temporal maps (c, d) are visualized on the second row. In the attention map (f), the cyclist appears to be salient among the pedestrians. On the bottom row, the prediction error around the cyclist is bigger than those around pedestrians in the abnormal case (h) since the model has been trained with normal frames.

A similar observation can be made for ShanghaiTech dataset as shown in Fig. 10, wherein the cyclist is also seen as an abnormal object. The spatio-temporal maps (c, d) corresponding to the ground truth frames (a, b) of the normal and abnormal events are shown in the second row. On the third row, attention of the network is distributed within the normal frame, whereas it is focused around the cyclist within the abnormal frame. The difference between the ground truth and the predicted frames is illustrated as prediction error in the bottom row. Similar to the above case, the prediction error of the normal frame (g) is minimal, whereas the riding cyclist as an abnormal event produces a large prediction error around him as shown in (h).

5 Conclusion and future work

This study presents a new video anomaly detection framework that has an attention-based residual autoencoder architecture. The proposed network is based on unsupervised learning and it exploits both spatial and temporal information in a unified network. The temporal shift is developed for the effective temporal feature extraction. In addition, the channel attention mechanism is utilized to exploit the channel relationship of features, which significantly helps the model learn more effectively. Experiments on three anomaly benchmark datasets show that our network outperforms the state-of-the-art methods. The ablation study shows that not only the spatio-temporal circuits in encoder but also the cascade type application of channel attention in decoder have been very effective in improving the system performance. Moreover, the proposed network architecture works well in 2D data and may be generalizable to 3D data for real-world engineering applications [17, 18]. We look forward to applying this framework for the practical surveillance systems.