1 Introduction

Anomalous event detection in videos refers to the identification of unexpected events or objects in specific scenarios, aiming to discover and locate abnormal events that may pose a threat to public safety, such as robbery, fighting, and traffic accidents. The rarity of anomalous event samples in videos and the scene-dependent definition of abnormal events pose significant challenges in meeting the requirements of efficient and accurate anomaly detection. The emergence of deep learning techniques has provided new insights and greatly advanced the development of video anomalous event detection. Early methods for video anomalous event detection mostly relied on traditional supervised learning techniques [1]. However, due to the scarcity of anomalous event samples, the training data is limited to normal events, which ultimately hampers the effectiveness of supervised learning. Therefore, contemporary researchers widely adopt unsupervised deep learning approaches that only require normal samples.

The key to unsupervised learning-based methods lies in extracting representative, discriminative, and accurate features that reflect the actual characteristics of video data. To address this issue, some researchers [2,3,4,5,6] have employed approaches that focus on extracting local features for anomaly detection. Additionally, there is a portion of researchers who have concentrated their attention on global information. Lee et al. [7] proposed a method based on a bidirectional multi-scale aggregation network, which learns the correspondence between appearance and motion patterns in video information. It utilizes bidirectional multi-scale feature aggregation and appearance-motion joint detection for efficient spatiotemporal feature encoding. Nguyen et al. [8] designed a video anomaly detection method based on the correspondence between pattern appearance and motion. This method consists of two streams: the first stream reconstructs appearance using an autoencoder architecture, while the second stream predicts the motion of input video frames using a U-Net structure, with both streams sharing an encoder. Park et al. [9] also leveraged the power of global information but overlooked the interference caused by the background, resulting in limited utilization of local target information.

Most of the aforementioned methods primarily utilize global or local information without considering the regions of interest. In reality, anomalous events often occur or manifest on individuals or objects within video frames, making the moving entities in the foreground crucial for anomaly detection. As a result, many researchers have shifted their focus towards objects within the scene. Ionescu et al. [10] employed a single-shot detector [11] on every frame of the video. After isolating the objects, they utilized convolutional autoencoder to learn deep unsupervised features, thus directing the algorithm’s attention towards the objects in the scene. Furthermore, they reframed the anomaly detection problem as a multi-class classification task rather than an imbalanced binary classification problem. Doshi et al. [12, 13] proposed a statistical framework for sequential anomaly detection, leveraging efficient object detectors to extract more meaningful features and enhancing model training efficiency through transfer learning. Wang et al. [14] employed a self-supervised approach for deep outlier detection, addressing the limitations of supervised methods while allowing discriminative deep neural networks to be directly applied to deep outlier detection problems. Barbalau et al. [15] further explored the existing self-supervised multi-task learning (SSMTL) framework for video anomaly detection by introducing additional detection methods centered around objects, aiming to improve the accuracy of object detectors within SSMTL. Although these methods to some extent mitigate the interference caused by complex background information in anomaly recognition, they often overlook the association between anomalous events and their contextual information due to their excessive focus on objects.

To enhance the accuracy of anomaly event detection by fully utilizing the informative key region information with high anomaly occurrence, this paper proposes a Key Region Feature Enhancement Dual-channel Autoencoder (KRFE-DAE). Unlike existing methods, KRFE-DAE not only focuses on anomalous key regions for video anomaly event detection but also considers the interaction between foreground objects and global context, avoiding the oversight of anomalous events that may occur in the background. Specifically, we design a Key Region Extraction Network (KREN) to separate the foreground motion regions with high anomaly occurrence from the video background, reducing the interference of background redundancy. Furthermore, we introduce a dual-channel autoencoder with an attention mechanism to fuse the information from key region images and complete video frames, enhancing the key region features and improving the detection accuracy of the model.

In summary, our work makes the following contributions:

  • We propose a well-designed Key Region Extraction Network (KREN) to separate background information and perform pixel-level segmentation of frequently occurring key regions, mitigating the interference from complex backgrounds.

  • We design a dual-channel autoencoder structure that highlights disturbances caused by anomalous data from both global context and key regions. The dual-channel structure preserves information that may trigger anomalies in the global context while enhancing the features of key regions.

  • We incorporate attention mechanisms into the decoder. During the feature reconstruction process, the attention decoder effectively utilizes inter-channel correlation information to suppress noise diffusion and focuses the model’s attention on anomalous key regions, thereby improving the reconstruction error of anomalous samples.

  • We conduct extensive experiments to demonstrate the generalization and effectiveness of the proposed dual-channel autoencoder network on three benchmark datasets.

The rest parts of this paper are organized as follows. Section 2 provides a brief overview of recent relevant work. Section 3 gives the overall and detailed description of the proposed KRFE-DAE. Section 4 showcases the experimental results of our approach for anomalous event detection on benchmark datasets. The conclusion and discuss about future work are finally summarized in Sect. 5.

2 Related Work

Currently, in the field of video anomalous event detection, numerous methods have been proposed. Among them, prediction-based methods and reconstruction-based methods are widely researched and applied as mainstream approaches.

2.1 Prediction-Based Methods

The principle of prediction model is to forecast future data using a training set and detect anomalies by analyzing the errors between the predicted values and actual data. Previous studies [16,17,18] have demonstrated the effectiveness of prediction-based anomalous event detection methods. Liu et al. [19] argued that since anomalies can be seen as events that deviate from certain expectations, predicting future frames could provide a more natural perspective. They employed a generator-discriminator structure similar to generative adversarial networks, with the U-Net architecture chosen as the generator for future frame prediction, and a discriminator at the end of the network to determine if the predicted frames are anomalous. Inspired by the predictive coding mechanism, Ye et al. [20] proposed a novel deep predictive coding network called AnoPCN to address the issue of narrow regularity score intervals in anomaly detection. The network consists of a predictive coding module and an error refinement module. The predictive coding module utilizes a convolutional recurrent neural network to achieve prediction and incorporates explicit motion information for improved prediction performance. However, this model has limitations in modeling temporal information and utilizing adversarial techniques, resulting in poor training effectiveness.

To address this issue, Wang et al. [21] employed a multi-path structure and noise-tolerant loss to enhance the performance of anomaly detection in surveillance videos, avoiding the need for complex variational methods and additional loss functions. This approach effectively improves the training efficiency and robustness of the model. In another study [22], improvements were made to the LSTM model to achieve higher prediction accuracy for time series datasets with different distributions. They also proposed a pruning algorithm to dynamically determine the prediction error threshold for identifying anomalies, reducing false positives, and avoiding reliance on scarce anomaly labels, thereby further enhancing anomaly detection performance. Additionally, Li et al. [23] proposed an unsupervised traffic video anomaly detection method based on future object localization. They improved upon the traditional adversarial generative network by introducing a single encoder-dual decoder architecture with multiple fully convolutional layers. This architecture enables the network to predict future skeletal trajectories while simultaneously reconstructing past input trajectories. Although the aforementioned prediction-based video anomaly detection methods have made significant progress in detection results, their drawbacks should not be overlooked. Firstly, they excessively emphasize the unpredictability of anomalous events, and secondly, they overlook the fact that many normal events are also unpredictable, leading to high false alarm rates.

2.2 Reconstruction-Based Methods

Many researchers [24, 25] have chosen to address the challenge of video anomalous event detection using reconstruction-based methods. Reconstruction-based methods can more accurately restore the original video content and are less susceptible to detection performance degradation caused by prediction errors. Lu et al. [26] recognized the limitations of traditional autoencoders and introduced convolutional autoencoders to reduce the loss of spatial information. They proposed a sparse combination learning framework that decomposes complex problems into several easily solvable sub-problems. This approach significantly improves the detection speed without compromising the detection quality. However, sparse coding requires significant computational power when dealing with large-scale data. Based on this, Shi et al. [27] proposed a Conv-LSTM network based on reconstruction error for video anomalous event detection. This method utilizes a neural network, Long Short-Term Memory (LSTM), capable of learning long-term dependencies in data to construct an encoder-decoder structured Conv-LSTM network for video anomalous event detection research. On the other hand, Chong et al. [28] integrated the aforementioned two methods. Specifically, they designed a convolutional autoencoder with CLSTM layers to preserve temporal information in frame sequences during model training, addressing the issue of temporal information loss. Furthermore, there have been many recent studies [29, 30] exploring additional possibilities of reconstruction-based methods.

In general, reconstruction-based methods for anomalous event detection have been proven effective, exhibiting good generalization capability and scalability. However, most of these methods only utilize complete video frames to learn normal patterns. These models often suffer from a lack of focus, as they do not prioritize the learning and reconstruction of complex regions that pose challenges during training. Consequently, their performance in detecting anomalous events is compromised when confronted with complex background interferences. To address this issue, this paper proposes KRFE-DAE, which employs a reconstruction-based anomaly discrimination approach. By incorporating key region feature enhancement, the network is directed to focus on regions that are more likely to exhibit anomalies, thereby improving the accuracy of anomalous event detection.

3 Methodology

3.1 Overall Architecture

In this paper, we propose a KRFE-DAE to detect anomalous events in videos. Figure 1 describes the architecture of our algorithm, which consists of three main components: KREN, dual-channel encoders, and attention decoder.

Fig. 1
figure 1

The overall framework of the video anomalous event detection algorithm based on KRFE-DAE. Firstly, key region extraction is performed on the video frame sequence. Next, the key regions and original frames are fed into two separate encoders of a dual-channel autoencoder, extracting features from both the key regions and the original video frames. The fused features are then inputted into a self-attention decoder for frame reconstruction. Finally, the reconstruction error between the original and reconstructed video frames is computed to accomplish anomalous event detection in the video

We propose a video anomalous event detection algorithm based on KRFE-DAE. The algorithm shifts the focus of detection from the entire frame to key regions that are prone to anomalies. It adopts a dual-stream architecture, utilizing two encoders to process both the original images and the key region images. After feature extraction, feature fusion is employed to enhance the key region features, enabling the network to learn the anomaly-prone regions while preserving the global contextual information. To emphasize important features during decoding, an attention decoder is introduced, incorporating channel attention modules between each layer of the decoder network to enhance reconstruction performance. As the training data consists only of normal samples, the model struggles to reconstruct anomalous samples effectively. For anomalous samples, there exist substantial disparities between the reconstructed images and their corresponding original images, thereby facilitating anomaly detection by evaluating the reconstruction error.

3.2 Key Region Extraction

Anomalous events often occur on rapidly moving foreground objects, such as suddenly running individuals or vehicles abruptly entering the scene. Therefore, foreground motion targets serve as crucial regions for video anomaly detection. The purpose of key region extraction is to eliminate complex background information and retain only the highly anomalous regions. This aims to enhance the model’s sensitivity and accuracy in detecting abnormal events while reducing the possibilities of false positives and false negatives. The process of KREN proposed in this paper for extracting abnormal key regions is illustrated in Fig. 2.

Fig. 2
figure 2

The process of KREN. When video frames are input to the KREN, the process begins with object detection and foreground detection to extract collections of human and object targets and foreground motion objects from the video frames, respectively. Subsequently, a comparison is performed between the objects in the two collections, eliminating background regions and objects with low anomaly occurrence rates. This results in the identification of key regions that significantly impact video anomaly detection. Finally, random occlusion is applied to the key regions to reduce interference from redundant targets

First, the KREN performs object detection on the original video frames using Mask R-CNN [31], enabling precise localization of targets and pixel-level segmentation from the background. To avoid the loss of anomalous targets caused by missed detections, this study reasonably lowers the threshold for object detection, aiming to capture as many potential targets as possible. Although the obtained object detection results have already removed a significant amount of background information and narrowed down the scope of key regions, there still exist numerous static redundant objects. To further pinpoint the key regions in the video frames, this paper employs ViBe [32] to extract foreground motion information. Subsequently, a one-to-one comparison is conducted between the objects obtained from object detection and foreground detection. The target box to be compared is denoted as \(B _{o}\), and the corresponding region in the foreground motion target image is denoted as \(B_{f}\).

$$\begin{aligned} P_{k}=\frac{\left| B_{o}\cap B_{f} \right| }{\left| B_{o}\cup B_{f} \right| } \end{aligned}$$
(1)

The probability of key region determination, \(P_{k}\), is calculated based on Eq. (1). When \(P_{k}\) exceeds the detection threshold, the target box \(B_{o}\) is classified as a key region. The process described in Eq. (1) is repeated until all target boxes have been compared. Since the proportion of anomalous targets is relatively low compared to the overall targets, in order to better focus the detection on potential anomalous targets, this study employs random occlusion to convert a certain proportion of the compared foreground targets into background. In this way, KREN can effectively mitigate the interference caused by redundant information and extract pixel-level key regions that have the most significant impact on anomaly detection.

3.3 Dual-Channel Autoencoder

Due to the close association between the definition of abnormal event and the environment, it is not feasible to simply remove the background completely when considering how to mitigate the interference caused by complex environmental factors in the detection process. In order to enhance the foreground information while preserving the global contextual information, we propose a dual-channel autoencoder that effectively utilizes both global information and key region information.

The proposed dual-channel autoencoder network architecture is illustrated in Fig. 3. Both the global encoder and the key region encoder in the network employ the same structure of 3D-CNN, where each layer consists of a convolutional layer, a Batch Normalization layer, and a Leaky ReLU activation function. After extracting global and key region features using the two branch encoders, the features are fused and inputted into the decoder. The decoder consists of 3D deconvolutional layers, Batch Normalization layers, and Leaky ReLU activations, except for the last layer. The detailed configuration of the proposed dual-channel autoencoder is presented in Table 1, where \(C_{in}\) and \(C_{out}\) represent the input and output channels of each layer, respectively. \(N_{k}\), \(N_{s}\), and \(N_{p}\) denote the kernel size, stride, and padding size, respectively. \(SHAPE_{out}\) represents the output size.

Table 1 The structural details of dual-channel autoencoder
Fig. 3
figure 3

The framework of dual-channel autoencoder

3.4 Decoder Based on Attention Mechanism

To enhance the network’s focus on important features, we introduced a channel attention module [33] to calculate the importance of each feature map. A channel attention module was incorporated between each deconvolutional layer of the decoder, aiming to enhance the model’s perception and discriminative ability for the target task, thereby improving its generalization capability. The structure of this module is illustrated in Fig. 4.

Fig. 4
figure 4

The framework of attention-based decoder

The purpose of incorporating channel attention is to improve the model’s focus on different channels by learning attention weights, with the core being the Squeeze-and-Excitation (SE) module. The SE module is a computational unit that can be built upon a transformation convolution operator, denoted as \(F_{tr}\), which maps an input \(X\ (X\in R^{H'\times W'\times C'})\) to a feature map \(U\ (U\in R^{H\times W\times C} )\). In Eq. (2), \(V=[v_{1},v_{2},\cdots ,v_{c} ]\) represents the collection of filter kernels, where \(v_{c}\) denotes the parameters of the c-th filter. The output is denoted as \(u_{c}\).

$$\begin{aligned} u_{c}=v_{c}\times X=\sum _{s=1}^{C^{'}}v^{s}_{c} \times x^{s} \end{aligned}$$
(2)

In Eq. (2), \(\times \) denotes convolution, and \(v_{c}\) is a two-dimensional spatial kernel. The SE module consists of two steps, squeeze and excitation, to calibrate the filter responses and provide global information. Squeeze is performed by global average pooling, which compresses the input feature map into a one-dimensional vector. This vector contains global information of the input feature map across different channels. Formally, the statistic \(z \ (z\in R^{C})\) is generated by compressing the spatial dimensions \(H \times W\) of U. Each \(z_{c}\) is computed using Eq. (3):

$$\begin{aligned} z_{c}=F_{sq}(u_{c})=\frac{\sum _{i=1}^{H}\sum _{j=1}^{W}u_{c}(i,j)}{H\times W} \end{aligned}$$
(3)

To utilize the aggregated information after sufficient squeezing, an excitation operation is performed. It maps the one-dimensional vector to a lower-dimensional vector through a fully connected layer and uses the Sigmoid activation function to generate channel attention weights, as shown in Eq. (4):

$$\begin{aligned} s=F_{ex}(z,W)=\sigma (g(z,W))=\sigma (W_{2}\delta (W_{1}z)) \end{aligned}$$
(4)

In Eq. (4), \(\sigma \) denotes the ReLU [34] function, \(W_{1}\in R^{\frac{c}{r} \times c},W_{2}\in R^{\frac{c}{r} \times c}\). To limit the model’s complexity and improve its generalization, this study employs a dimension reduction layer with a reduction ratio r, ReLU, and an expansion layer for parameterized gating. The final output of the SE module is obtained by re-scaling U with the activated s.

3.5 Anomaly Discrimination

This paper employs reconstruction error for anomaly discrimination. The basic idea of reconstruction error is that normal samples, when input into a well-trained model, will produce outputs that are closer to the inputs, resulting in lower reconstruction error because they are more similar to the training data. On the other hand, anomalous samples, due to their significant differences from the training data, will exhibit higher reconstruction error.

Let \(x_{t}\) denote the video segment or frame at time t, g represent the neural network that reconstructs x, g(x) denote the reconstructed equation, and \(g_{t}(x)\) represent the reconstructed video segment or frame. The reconstruction error can be defined as the function R, which calculates the error between x (the original input) and g(x), as shown in Eq. (5).

$$\begin{aligned} R(x_{t},g_{t}(x))=\begin{Vmatrix} x_{t}-g_{t}(x) \end{Vmatrix}^{2}_{2} \end{aligned}$$
(5)

The reconstruction error formula represents the L2 norm error between the original and reconstructed inputs. During the testing phase, when the model reconstructs frames of anomalous events, it typically exhibits larger reconstruction errors. If the error value exceeds a predefined threshold, the current video frame is considered to contain an anomalous event. To facilitate further computations, after obtaining the reconstruction scores for all frames of the video, this paper normalizes the reconstruction scores of all frames. Let \(R_{u}\) denote the score of the u-th frame, where \(min(R_{u})\) represents the minimum score among all scores, and \(max(R_{u})\) represents the maximum score among all scores. The normalized reconstruction score for the u-th frame is calculated as shown in Eq. (6).

$$\begin{aligned} p_{u}(R_{u})=1-\frac{R_{u}-min(R_{u})}{max(R_{u})-min(R_{u})} \end{aligned}$$
(6)

4 Experiments

4.1 Comparison with State-of-the-Art Methods

To validate the effectiveness of the proposed KRFE-DAE, experiments were conducted on the UCSD ped2, CUHK Avenue, and SHTech Campus datasets for video anomaly event detection. The performance of KRFE-DAE was compared with handcrafted feature-based methods [45,46,47] and deep learning-based methods [35,36,37,38,39,40,41,42,43,44] in terms of frame-level AUC. The experimental results are presented in Table 2.

Table 2 AUC performance of KRFE-DAE compared with other handcrafted feature-based methods and deep learning-based methods on the UCSD ped2, CUHK Avenue, and SHTech Campus datasets

Existing methods can be broadly categorized into handcrafted feature-based methods and deep learning-based methods. As shown in Table 2, our proposed method achieves significantly higher AUC scores on the UCSD Ped2, CUHK Avenue, and SHTech Campus datasets, with values of 97.1%, 82.9%, and 73.8%, respectively, surpassing the performance of handcrafted feature-based methods. Among the deep learning-based methods, compared to the traditional 3D convolutional autoencoder method [35], KRFE-DAE improves the accuracy by 5.9%, 5.5%, and 4.1% on the UCSD Ped2, CUHK Avenue, and SHTech Campus datasets, respectively, and achieves the best performance on the UCSD Ped2 and SHTech Campus datasets.

In the CUHK Avenue dataset, our method does not perform the best, primarily due to the presence of various abnormal behaviors such as throwing backpacks and papers. These throwing actions result in blurred objects that are challenging to extract using the proposed KREN in this paper. In future work, we will further optimize KREN to enhance its ability to capture rapidly moving objects. In fact, as shown in Table 2, most methods do not achieve the best performance across all three datasets. For example, ME [40] outperforms our method in the CUHK Avenue dataset, but its accuracy is lower than ours in the other two datasets. Overall, our method has significantly improved accuracy compared to traditional deep learning-based approaches, providing sufficient evidence for the effectiveness of KRFE-DAE.

Table 3 Time complexity comparisons between KRFE-DAE and other existing methods

We empirically study the computing complexity of the proposed KRFE-DAE with an NVIDIA GTX 1080 Ti GPU. As shown in Table 3, KRFE-DAE achieves an frames per second (FPS) of 26.2, demonstrating better real-time feasibility compared to other existing methods. This further validates the effectiveness of our proposed approach.

4.2 Ablation Study

A series of ablation experiments were conducted in this study to investigate the effects of selecting different parameters and modules, aiming to validate the effectiveness of the proposed method.

4.2.1 Selection of Parameters

The selection of the key region occlusion rate determines the number of preserved targets in the key region, which significantly affects the effectiveness of key region feature enhancement and the final results of anomaly detection. Therefore, in this paper, we conducted occlusion rate selection experiments on the UCSD ped2 dataset using a single autoencoder. The experiments tested the model’s anomaly detection performance with different values of occlusion rate ranging from 0 to 1. The experimental results are shown in Table 4.

Table 4 Comparison of accuracy of KREN on the UCSD ped2 Dataset under different random occlusion rates

According to Table 4, it can be observed that the KREN performs the best on the UCSD ped2 dataset when the occlusion rate is set to 0.2. Moreover, the performance of KREN with an occlusion rate of 0.2 surpasses that of the model without any occlusion (occlusion rate of 0), thereby demonstrating the effectiveness of the proposed random occlusion module.

For the task of video anomaly event detection, in addition to improving the accuracy of anomaly detection, it is also necessary to fully consider the real-time requirements of detection and strive to enhance the speed of detection. In order to maximize the real-time potential of the proposed method while ensuring accuracy, this paper adopts a linear fusion approach during feature fusion to reduce the computational cost of the model and thereby improve the detection speed.

$$\begin{aligned}{} & {} F_{fusion}=\alpha \cdot F_{1}+\beta \cdot F_{2} \end{aligned}$$
(7)
$$\begin{aligned}{} & {} \quad \beta =1-\alpha \end{aligned}$$
(8)

In the dual-channel autoencoder, the feature fusion method is shown in Eq. (7), where \(F_{1}\) and \(F_{2}\) represent the extracted key region feature and global contextual feature from two encoders, respectively. \(\alpha \) and \(\beta \) are weight coefficients used to adjust the relative importance of the two features, with a linear relationship between them as depicted in Eq. (8). In this paper, experiments were conducted to select the value of \(\alpha \) on the UCSD ped2 and CUHK Avenue datasets, and the experimental results are shown in Fig. 5. Due to the linear relationship between \(\alpha \) and \(\beta \), as shown in Eq. (8), when \(\alpha \) is 0.5, \(\beta \) is also 0.5. As can be seen from Fig. 5, the dual-channel autoencoder achieves optimal accuracy on both datasets at this point. Therefore, in the experiments, we assign equal weights to the key region feature and global context feature, achieving a good balance and improving the accuracy of detection.

Fig. 5
figure 5

Accuracy of anomaly detection in dual-channel autoencoder at different values of \(\alpha \)

In future work, we will continue to refine the feature fusion method and employ a learnable approach to further improve the model’s performance.

4.2.2 Analysis of the Effectiveness of Different Modules

We have also examined the impact of different components of KRFE-DAE in terms of AUC, as shown in Table 5.

Table 5 Impacts of different components on KRFE-DAE

Comparing with the traditional approach that only utilizes autoencoders, Table 5 reveals a slight performance improvement with the addition of the attention module, resulting in a 1.3% increase in accuracy on the UCSD ped2 dataset. Moreover, incorporating KREN on top of the autoencoder leads to a significant performance boost, with a 4.7% increase in accuracy on UCSD ped2. This demonstrates the importance and high detection value of key region information for anomaly detection. By integrating KREN and the attention module into a single autoencoder, the accuracy on UCSD ped2 is further improved by 5.3%. The combination of a dual-channel autoencoder with the key region feature extraction network, incorporating the attention mechanism (KRFE-DAE), achieves the optimal detection performance with an accuracy of 97.1%, representing a 5.9% accuracy improvement. These experimental results validate the effectiveness of the proposed approach in this paper.

4.3 Analysis of Visual Results

This paper presents a visual analysis of the experimental results of each module of KRFE-DAE on an abnormal event detection dataset.

4.3.1 Visual Analysis of KREN’s Results

Figure 6 illustrates the key region extraction performance of KREN on different datasets.

Fig. 6
figure 6

Key region extraction results of KREN on normal and abnormal video frames in the UCSD ped2 dataset, CUHK Avenue dataset, and SHTech Campus dataset

As depicted in Fig. 6, KREN demonstrates efficient extraction of pixel-level key regions for video abnormal events, which helps mitigate background interference and extract more discriminative features.

4.3.2 Visual Analysis of Attention-Based Autoencoder

In order to further validate the effectiveness of the attention-based autoencoder, this study conducted experiments on the UCSD ped2 and CUHK Avenue dataset using both a traditional autoencoder without attention mechanism and an attention-based autoencoder. As shown in Fig. 7, it can be observed that the attention-based autoencoder converges faster and achieves a lower loss function value compared to the traditional autoencoder, providing evidence for the effectiveness of the attention module.

Fig. 7
figure 7

Comparison of loss functions during training of attention-based autoencoder and traditional autoencoder without attention mechanism on different dataset

4.3.3 Visual Analysis of KRFE-DAE’s Results

We conducted video anomaly event detection on the UCSD ped2 and CUHK Avenue datasets using both a traditional autoencoder and the proposed KRFE-DAE. The detection results were visualized and analyzed. Figure 8a and b show the test results of the same video segment from UCSD ped2 using the traditional autoencoder method and the KRFE-DAE, respectively. A comparison between Fig. 8a and b reveals that the traditional method exhibits numerous false negatives, while the KRFE-DAE successfully detects all anomaly events with a true positive rate (TPR) of 100%.

Fig. 8
figure 8

Visualization images of detection results obtained using both the traditional autoencoder and KRFE-DAE on a specific testing video from the UCSD ped2 dataset. The images include reconstructed score curves for detection, as well as annotations for false positives and false negatives

Comparing Fig. 9a and b, it can be observed that in the video segments of the CUHK Avenue dataset, although the traditional method does not have false negatives, it suffers from significant false positives. In the real world, false alarms caused by false positives can lead to substantial resource wastage and have detrimental effects. In contrast, the KRFE-DAE achieves a false positive rate (FPR) of 0 and a TPR of 96.8% for this video segment, with only a minimal number of missed abnormal frames. Overall, the proposed method demonstrates effectiveness and practicality, outperforming traditional method.

Fig. 9
figure 9

Visualization images of detection results obtained using both the traditional autoencoder and KRFE-DAE on a specific testing video from the CUHK Avenue dataset. The images include reconstructed score curves for detection, as well as annotations for false positives and false negatives

5 Conclusion

In traditional video anomaly event detection based on autoencoders, the autoencoder lacks focus during feature extraction and overlooks the importance of key regions where anomaly events are likely to occur. To address this issue, we propose a video anomaly event detection algorithm based on KRFE-DAE. KRFE-DAE precisely extracts pixel-level key regions using KREN, reducing interference from redundant information. To avoid missing contextual information, KRFE-DAE adopts a dual-stream structure to simultaneously extract key region information and global contextual information, achieving enhanced key region feature fusion. Additionally, we incorporate an attention mechanism into the decoder to improve the reconstruction performance of the network. Extensive experiments on three video anomaly event detection datasets, namely UCSD ped2, CUHK Avenue, and SHTech Campus, validate the effectiveness of the proposed KRFE-DAE in this paper. In future work, we will further investigate how to better apply video anomaly event detection in real-world scenarios and improve our method in terms of extracting motion features and enhancing real-time feasibility.